I'm trying to parse tweets data.
My data shape is as follows:
59593936 3061025991 null null <d>2009-08-01 00:00:37</d> <s><a href="http://help.twitter.com/index.php?pg=kb.page&id=75" rel="nofollow">txt</a></s> <t>honda just recalled 440k accords...traffic around here is gonna be light...win!!</t> ajc8587 15 24 158 -18000 0 0 <n>adrienne conner</n> <ud>2009-07-23 21:27:10</ud> <t>eastern time (us & canada)</t> <l>ga</l>
22020233 3061032620 null null <d>2009-08-01 00:01:03</d> <s><a href="http://alexking.org/projects/wordpress" rel="nofollow">twitter tools</a></s> <t>new blog post: honda recalls 440k cars over airbag risk http://bit.ly/2wsma</t> madcitywi 294 290 9098 -21600 0 0 <n>madcity</n> <ud>2009-02-26 15:25:04</ud> <t>central time (us & canada)</t> <l>madison, wi</l>
I want to get the total numbers of tweets and the numbers of keyword related tweets. I prepared the keywords in text file. In addition, I wanna get the tweet text contents, total number of tweets which contain mention(#), retweet(RT), and URL (I wanna save every URL in other file).
So, I coded like this.
import time
import os
total_tweet_count = 0
related_tweet_count = 0
rt_count = 0
mention_count = 0
URLs = {}
def get_keywords(filepath, mode):
with open(filepath, mode) as f:
for line in f:
yield line.split().lower()
for line in open('/nas/minsu/2009_06.txt'):
tweet = line.strip().lower()
total_tweet_count += 1
with open('./related_tweets.txt', 'a') as save_file_1:
keywords = get_keywords('./related_keywords.txt', 'r')
if keywords in line:
text = line.split('<t>')[1].split('</t>')[0]
if 'http://' in text:
try:
url = text.split('http://')[1].split()[0]
url = 'http://' + url
if url not in URLs:
URLs[url] = []
URLs[url].append('\t' + text)
save_file_3 = open('./URLs_in_related_tweets.txt', 'a')
print >> save_file_3, URLs
except:
pass
if '#' in text:
mention_count +=1
if 'RT' in text:
rt_count += 1
related_tweet_count += 1
print >> save_file_1, text
save_file_2 = open('./info_related_tweets.txt', 'w')
print >> save_file_2, str(total_tweet_count) + '\t' + srt(related_tweet_count) + '\t' + str(mention_count) + '\t' + str(rt_count)
save_file_1.close()
save_file_2.close()
save_file_3.close()
Following is the sample keywords
Depression
Placebo
X-rays
X-ray
HIV
Blood preasure
Flu
Fever
Oral Health
Antibiotics
Diabetes
Mellitus
Genetic disorders
I think my code has many problem, but the first error is as follows:
line 23, in <module>
if keywords in line:
TypeError: 'in <string>' requires string as left operand, not generator
I coded "def ..." part. I think it has a problem. When I try "print keywords" under line (keywords = get_keywords('./related_keywords.txt', 'r')), it gives something strange numbers not words.... . Please help me out!
Maybe change if keywords in line: to use a regular expression match instead. For example, something like:
import re
...
keywords = "|".join(get_keywords('./related_keywords.txt', 'r'))
matcher = re.compile(keywords)
if matcher.match(line):
text = ...
... And changed get_keywords to something like this instead:
def get_keywords(filepath, mode):
keywords = []
with open(filepath, mode) as f:
for line in f:
sp = line.split()
for w in sp:
keywords.append(w.lower())
return keywords
Related
I have a mistake with google search module.
I try to use this module to do multiple requests, but i have a mistake for do each query with the word.
alpha = input(colored ("[{}*{}] Enter Path of you're Word : ",'yellow'))
word = open(alpha, 'r')
Lines = word.readlines()
query = Lines
try:
print(colored("[{}+{}] Scan started! Please wait... :)",'red'))
for gamma in search(query, start=0, tld=beta, num=1000 , pause=2):
print(colored ('[+] Found > ' ,'yellow') + (gamma) )
with open("googleurl.txt","a") as f:
f.write(gamma + "/" + "\n")
except:
print("[{}-{}] Word Liste not found!")
I think it's not possible to do multiple query,
Because my dorks is loaded into my python program but query not done. If i change
query = "test"
I have like 100 requests for the word test. I think i have do a bad things, for do query with the text file.
I'm sorry for my bad English. I'm a beginner with English and also with Python
I hope you can help me
I'm now with this program :
alpha = input(colored ("[{}*{}] Wordlist : ",'yellow'))
Word = open(alpha, 'r')
Lines = Word.readlines()
query = Lines
beta = random.choice(TLD)
Word_number = 0
for line in Lines:
Word_number+=1
for query in Lines:
print("Nombre de Word: "+str(Word_number))
for i in search(query, start=0, tld=beta, num=1000 , pause=2, stop=None):
print(colored ('[+] Found > ' ,'yellow') +(i))
URL_number+=1
with open("googleurl.txt","a") as f:
f.write(i + "/" + "\n")
f.close()
print(colored("[{}+{}] Total Google URL : ",'red') + str(URL_number))
And my program answer do this :
He just fount 98 website and stop, and he only check the 1st word
word.readlines() returns a list of strings, where each item is the next line in the file. This means that query is a list.
The search() function wants query to be a string, so you'll have to loop through Lines to get each individual query:
for query in Lines:
# perform search with this query
Hey i finally update my code. And i now i have a problem with proxies.
The code is fixed for requests with dorks but i can't find how to add proxy my code is :
alpha = input(colored ("[{}*{}] Dorklist : ",'yellow'))
dorks = open(alpha, 'r')
Lines = dorks.readlines()
query = Lines
beta = random.choice(TLD)
ceta = input(colored ("[{}*{}] Proxylist :",'yellow'))
prox = open(ceta, 'r')
Lines2 = prox.readlines()
proxy = Lines2
Dorks_number = 0
Proxy_number = 0
for line in Lines:
Dorks_number+=1
for line in Lines2:
Proxy_number+=1
print("Nombre de dorks: "+str(Dorks_number))
print("Nombre de Proxy: "+str(Proxy_number))
s = requests.Session(proxies=proxy)
s.cookies.set_policy(BlockAll())
for query in Lines:
for i in search(query, start=0, tld=beta, num=1000 , pause=2, stop=None):
print(colored ('[+] Found > ' ,'yellow') +(i))
URL_number+=1
with open("googleurl.txt","a") as f:
f.write(i + "/" + "\n")
f.close()
print(colored("[{}+{}] Total Google URL : ",'red') + str(URL_number))
My error :
s = requests.Session(proxies=proxy)
TypeError: init() got an unexpected keyword argument 'proxies'
Someone have an idea how to done it ?
I am using Python 3.7 and have a test.txt file that looks like this:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
.
</FONT>
I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.
My desired output would be:
['of our common stock is expected to be between $ and $ per share']
How can I do it?
Thank you very much in advance.
The right way with html.unescape and re.search features:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
The output:
[' of our common stock is expected to be between $ and $ per share']
You need to decide when to append a line to price:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
This is just a sketch, so you'll probably need to tweak it a little bit
this also works:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
output:
['of our common stock is expected to be between $ and $ per share']
dirty way of doing it:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")
Here is another simple solution:
It collects all lines into 1 long string, detects starting index of 'be between', ending index of 'per share', and then takes the appropriate substring.
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
Outputs:
['be between $and $per share']
This will also work:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
Output:
"be between $and $per share"
I'm working on a minor content analysis program that I was hoping that I could have running through several pdf-files and return the sum of frequencies that some specific words are mentioned in the text. The words that are searched for are specified in a separate text file (list.txt) and can be altered. The program runs just fine through files with .txt format, but the result is completely different when running the program on a .pdf file. To illustrate, the test text that I have the program running trhough is the following:
"Hello
This is a product development notice
We’re working with innovative measures
A nice Innovation
The world that we live in is innovative
We are currently working on a new process
And in the fall, you will experience our new product development introduction"
The list of words grouped in categories are the following (marked in .txt file with ">>"):
innovation: innovat
product: Product, development, introduction
organization: Process
The output from running the code with a .txt file is the following:
Whereas the ouput from running it with a .pdf is the following:
As you can see, my issue is pertaining to the splitting of the words, where in the .pdf output i can have a string like "world" be split into 'w','o','rld'. I have tried to search for why this happens tirelessly, without success. As I am rather new to Python programming, I would appreciate any answe or direction to where I can fin and answer to why this happens, should you know any source.
Thanks
The code for the .txt is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.txt'
textfile = open(f)
text = textfile.read().split() # lowercase the text
print (text)
textfile.close()
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
While the code for the .pdf is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.pdf'
textfile = open(f, 'rb')
text = PyPDF2.PdfFileReader(textfile)# lowercase the text
for pageNum in range(0, text.numPages):
texts = text.getPage(pageNum)
textfile = texts.extractText().split()
print (textfile)
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in textfile:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
Matching a file in this form. It always begins with InvNo, ~EOR~ is End Of Record.
InvNo: 123
Tag1: rat cake
Media: d234
Tag2: rat pudding
~EOR~
InvNo: 5433
Tag1: strawberry tart
Tag5: 's got some rat in it
~EOR~
InvNo: 345
Tag2: 5
Media: d234
Tag5: rather a lot really
~EOR~
It should become
IN 123
UR blabla
**
IN 345
UR blibli
**
Where UR is a URL. I want to keep the InvNo as first tag. ** is now the end of record marker. This works:
impfile = filename[:4]
media = open(filename + '_earmark.dat', 'w')
with open(impfile, 'r') as f:
HASMEDIA = False
recordbuf = ''
for line in f:
if 'InvNo: ' in line:
InvNo = line[line.find('InvNo: ')+7:len(line)]
recordbuf = 'IN {}'.format(InvNo)
if 'Media: ' in line:
HASMEDIA = True
mediaref = line[7:len(line)-1]
URL = getURL(mediaref) # there's more to it, but that's not important now
recordbuf += 'UR {}\n'.format(URL))
if '~EOR~' in line:
if HASMEDIA:
recordbuf += '**\n'
media.write(recordbuf)
HASMEDIA = False
recordbuf = ''
media.close()
Is there a better, more Pythonic way? Working with the recordbuffer and the HASMEDIA flag seems, well, old hat. Any examples or tips for good or better practice?
(Also, I'm open to suggestions for a more to-the-point title to this post)
You could set InvNo and URL initially to None, and only print a record when InvNo and URL are both not Falsish:
impfile = filename[:4]
with open(filename + '_earmark.dat', 'w') as media, open(impfile, 'r') as f:
InvNo = URL = None
for line in f:
if line.startswith('InvNo: '):
InvNo = line[line.find('InvNo: ')+7:len(line)]
if line.startswith('Media: '):
mediaref = line[7:len(line)-1]
URL = getURL(mediaref)
if line.startswith('~EOR~'):
if InvNo and URL:
recordbuf = 'IN {}\nUR {}\n**\n'.format(InvNo, URL)
media.write(recordbuf)
InvNo = URL = None
Note: I changed 'InvNo: ' in line to line.startswith('InvNo: ') based on the assumption that InvNo always occurs at the beginning of the line. It appears to be true in your example, but the fact that you use line.find('InvNo: ') suggests that 'InvNo:' might appear anywhere in the line.
If InvNo: appears only at the beginning of the line, then use line.startswith(...) and remove line.find('InvNo: ') (since it would equal 0).
Otherwise, you'll have to retain 'InvNo:' in line and line.find (and of course, the same goes for Media and ~EOR~).
The problem with using code like 'Media' in line is that if the Tags can contain anything, it might contain the string 'Media' without being a true field header.
Here is a version if you don't want to slice and if you ever need to write to the same output file again, you may not, you can change 'w' to 'a'.
with open('input_file', 'r') as f, open('output.dat', 'a') as media:
write_to_file = False
lines = f.readlines()
for line in lines:
if line.startswith('InvNo:'):
first_line = 'IN ' + line.split()[1] + '\n'
if line.startswith('Media:'):
write_to_file = True
if line.startswith('~EOR~') and write_to_file:
url = 'blabla' #Put getUrl() here
media.write(first_line + url + '\n' + '**\n')
write_to_file = False
first_line = ''
I'm trying to parse tweets data.
My data shape is as follows:
59593936 3061025991 null null <d>2009-08-01 00:00:37</d> <s><a href="http://help.twitter.com/index.php?pg=kb.page&id=75" rel="nofollow">txt</a></s> <t>honda just recalled 440k accords...traffic around here is gonna be light...win!!</t> ajc8587 15 24 158 -18000 0 0 <n>adrienne conner</n> <ud>2009-07-23 21:27:10</ud> <t>eastern time (us & canada)</t> <l>ga</l>
22020233 3061032620 null null <d>2009-08-01 00:01:03</d> <s><a href="http://alexking.org/projects/wordpress" rel="nofollow">twitter tools</a></s> <t>new blog post: honda recalls 440k cars over airbag risk http://bit.ly/2wsma</t> madcitywi 294 290 9098 -21600 0 0 <n>madcity</n> <ud>2009-02-26 15:25:04</ud> <t>central time (us & canada)</t> <l>madison, wi</l>
I want to get the total numbers of tweets and the numbers of keyword related tweets. I prepared the keywords in text file. In addition, I wanna get the tweet text contents, total number of tweets which contain mention(#), retweet(RT), and URL (I wanna save every URL in other file).
So, I coded like this.
import time
import os
total_tweet_count = 0
related_tweet_count = 0
rt_count = 0
mention_count = 0
URLs = {}
def get_keywords(filepath):
with open(filepath) as f:
for line in f:
yield line.split()
for line in open('/nas/minsu/2009_06.txt'):
tweet = line.strip()
total_tweet_count += 1
with open('./related_tweets.txt', 'a') as save_file_1:
keywords = get_keywords('./related_keywords.txt', 'r')
if keywords in line:
text = line.split('<t>')[1].split('</t>')[0]
if 'http://' in text:
try:
url = text.split('http://')[1].split()[0]
url = 'http://' + url
if url not in URLs:
URLs[url] = []
URLs[url].append('\t' + text)
save_file_3 = open('./URLs_in_related_tweets.txt', 'a')
print >> save_file_3, URLs
except:
pass
if '#' in text:
mention_count +=1
if 'RT' in text:
rt_count += 1
related_tweet_count += 1
print >> save_file_1, text
save_file_2 = open('./info_related_tweets.txt', 'w')
print >> save_file_2, str(total_tweet_count) + '\t' + srt(related_tweet_count) + '\t' + str(mention_count) + '\t' + str(rt_count)
save_file_1.close()
save_file_2.close()
save_file_3.close()
The keyword set likes
Happy
Hello
Together
I think my code has many problem, but the first error is as follws:
Traceback (most recent call last):
File "health_related_tweets.py", line 21, in <module>
keywords = get_keywords('./public_health_related_words.txt', 'r')
TypeError: get_keywords() takes exactly 1 argument (2 given)
Please help me out!
The issue is self explanatory in the error, you have specified two parameters in your call to get_keywords() but your implementation only has one parameter. You should change your get_keywords implementation to something like:
def get_keywords(filepath, mode):
with open(filepath, mode) as f:
for line in f:
yield line.split()
Then you can use the following line without that specific error:
keywords = get_keywords('./related_keywords.txt', 'r')
Now you are getting this error:
Traceback (most recent call last): File "health_related_tweets.py", line 23, in if keywords in line: TypeError: 'in ' requires string as left operand, not generator
The reason is that keywords = get_keywords(...) returns a generator. Logically thinking about it, keywords should be a list of all the keywords. And for each keyword in this list, you want to check if it's in the tweet/line or not.
Sample code:
keywords = get_keywords('./related_keywords.txt', 'r')
has_keyword = False
for keyword in keywords:
if keyword in line:
has_keyword = True
break
if has_keyword:
# Your code here (for the case when the line has at least one keyword)
(The above code would be replacing if keywords in line:)