Matching a file in this form. It always begins with InvNo, ~EOR~ is End Of Record.
InvNo: 123
Tag1: rat cake
Media: d234
Tag2: rat pudding
~EOR~
InvNo: 5433
Tag1: strawberry tart
Tag5: 's got some rat in it
~EOR~
InvNo: 345
Tag2: 5
Media: d234
Tag5: rather a lot really
~EOR~
It should become
IN 123
UR blabla
**
IN 345
UR blibli
**
Where UR is a URL. I want to keep the InvNo as first tag. ** is now the end of record marker. This works:
impfile = filename[:4]
media = open(filename + '_earmark.dat', 'w')
with open(impfile, 'r') as f:
HASMEDIA = False
recordbuf = ''
for line in f:
if 'InvNo: ' in line:
InvNo = line[line.find('InvNo: ')+7:len(line)]
recordbuf = 'IN {}'.format(InvNo)
if 'Media: ' in line:
HASMEDIA = True
mediaref = line[7:len(line)-1]
URL = getURL(mediaref) # there's more to it, but that's not important now
recordbuf += 'UR {}\n'.format(URL))
if '~EOR~' in line:
if HASMEDIA:
recordbuf += '**\n'
media.write(recordbuf)
HASMEDIA = False
recordbuf = ''
media.close()
Is there a better, more Pythonic way? Working with the recordbuffer and the HASMEDIA flag seems, well, old hat. Any examples or tips for good or better practice?
(Also, I'm open to suggestions for a more to-the-point title to this post)
You could set InvNo and URL initially to None, and only print a record when InvNo and URL are both not Falsish:
impfile = filename[:4]
with open(filename + '_earmark.dat', 'w') as media, open(impfile, 'r') as f:
InvNo = URL = None
for line in f:
if line.startswith('InvNo: '):
InvNo = line[line.find('InvNo: ')+7:len(line)]
if line.startswith('Media: '):
mediaref = line[7:len(line)-1]
URL = getURL(mediaref)
if line.startswith('~EOR~'):
if InvNo and URL:
recordbuf = 'IN {}\nUR {}\n**\n'.format(InvNo, URL)
media.write(recordbuf)
InvNo = URL = None
Note: I changed 'InvNo: ' in line to line.startswith('InvNo: ') based on the assumption that InvNo always occurs at the beginning of the line. It appears to be true in your example, but the fact that you use line.find('InvNo: ') suggests that 'InvNo:' might appear anywhere in the line.
If InvNo: appears only at the beginning of the line, then use line.startswith(...) and remove line.find('InvNo: ') (since it would equal 0).
Otherwise, you'll have to retain 'InvNo:' in line and line.find (and of course, the same goes for Media and ~EOR~).
The problem with using code like 'Media' in line is that if the Tags can contain anything, it might contain the string 'Media' without being a true field header.
Here is a version if you don't want to slice and if you ever need to write to the same output file again, you may not, you can change 'w' to 'a'.
with open('input_file', 'r') as f, open('output.dat', 'a') as media:
write_to_file = False
lines = f.readlines()
for line in lines:
if line.startswith('InvNo:'):
first_line = 'IN ' + line.split()[1] + '\n'
if line.startswith('Media:'):
write_to_file = True
if line.startswith('~EOR~') and write_to_file:
url = 'blabla' #Put getUrl() here
media.write(first_line + url + '\n' + '**\n')
write_to_file = False
first_line = ''
Related
I have a mistake with google search module.
I try to use this module to do multiple requests, but i have a mistake for do each query with the word.
alpha = input(colored ("[{}*{}] Enter Path of you're Word : ",'yellow'))
word = open(alpha, 'r')
Lines = word.readlines()
query = Lines
try:
print(colored("[{}+{}] Scan started! Please wait... :)",'red'))
for gamma in search(query, start=0, tld=beta, num=1000 , pause=2):
print(colored ('[+] Found > ' ,'yellow') + (gamma) )
with open("googleurl.txt","a") as f:
f.write(gamma + "/" + "\n")
except:
print("[{}-{}] Word Liste not found!")
I think it's not possible to do multiple query,
Because my dorks is loaded into my python program but query not done. If i change
query = "test"
I have like 100 requests for the word test. I think i have do a bad things, for do query with the text file.
I'm sorry for my bad English. I'm a beginner with English and also with Python
I hope you can help me
I'm now with this program :
alpha = input(colored ("[{}*{}] Wordlist : ",'yellow'))
Word = open(alpha, 'r')
Lines = Word.readlines()
query = Lines
beta = random.choice(TLD)
Word_number = 0
for line in Lines:
Word_number+=1
for query in Lines:
print("Nombre de Word: "+str(Word_number))
for i in search(query, start=0, tld=beta, num=1000 , pause=2, stop=None):
print(colored ('[+] Found > ' ,'yellow') +(i))
URL_number+=1
with open("googleurl.txt","a") as f:
f.write(i + "/" + "\n")
f.close()
print(colored("[{}+{}] Total Google URL : ",'red') + str(URL_number))
And my program answer do this :
He just fount 98 website and stop, and he only check the 1st word
word.readlines() returns a list of strings, where each item is the next line in the file. This means that query is a list.
The search() function wants query to be a string, so you'll have to loop through Lines to get each individual query:
for query in Lines:
# perform search with this query
Hey i finally update my code. And i now i have a problem with proxies.
The code is fixed for requests with dorks but i can't find how to add proxy my code is :
alpha = input(colored ("[{}*{}] Dorklist : ",'yellow'))
dorks = open(alpha, 'r')
Lines = dorks.readlines()
query = Lines
beta = random.choice(TLD)
ceta = input(colored ("[{}*{}] Proxylist :",'yellow'))
prox = open(ceta, 'r')
Lines2 = prox.readlines()
proxy = Lines2
Dorks_number = 0
Proxy_number = 0
for line in Lines:
Dorks_number+=1
for line in Lines2:
Proxy_number+=1
print("Nombre de dorks: "+str(Dorks_number))
print("Nombre de Proxy: "+str(Proxy_number))
s = requests.Session(proxies=proxy)
s.cookies.set_policy(BlockAll())
for query in Lines:
for i in search(query, start=0, tld=beta, num=1000 , pause=2, stop=None):
print(colored ('[+] Found > ' ,'yellow') +(i))
URL_number+=1
with open("googleurl.txt","a") as f:
f.write(i + "/" + "\n")
f.close()
print(colored("[{}+{}] Total Google URL : ",'red') + str(URL_number))
My error :
s = requests.Session(proxies=proxy)
TypeError: init() got an unexpected keyword argument 'proxies'
Someone have an idea how to done it ?
I am using Python 3.7 and have a test.txt file that looks like this:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
.
</FONT>
I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.
My desired output would be:
['of our common stock is expected to be between $ and $ per share']
How can I do it?
Thank you very much in advance.
The right way with html.unescape and re.search features:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
The output:
[' of our common stock is expected to be between $ and $ per share']
You need to decide when to append a line to price:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
This is just a sketch, so you'll probably need to tweak it a little bit
this also works:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
output:
['of our common stock is expected to be between $ and $ per share']
dirty way of doing it:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")
Here is another simple solution:
It collects all lines into 1 long string, detects starting index of 'be between', ending index of 'per share', and then takes the appropriate substring.
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
Outputs:
['be between $and $per share']
This will also work:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
Output:
"be between $and $per share"
I have this large script ( I will post the whole thing if I have to but it is very big) which starts off okay when I run it but it immediatly gives me 'TypeError: cannot concatenate 'str' and 'NoneType' objects' when it comes to this last bit of the code:
with open("self.txt", "a+") as f:
f = open("self.txt", "a+")
text = f.readlines()
text_model = markovify.Text(text)
for i in range(1):
tool = grammar_check.LanguageTool('en-GB')
lin = (text_model.make_sentence(tries=800))
word = ('' + lin)
matches = tool.check (word)
correct = grammar_check.correct (word, matches)
print ">",
print correct
print ' '
f = open("self.txt", "a+")
f.write(correct + "\n")
I have searched everywhere but gotten nowhere. It seems to have something to do with: word = ('' + lin). but no matter what I do I can't fix it. What am I doing wrong?
I'm not sure how I did it but with a bit of fiddling and google I came up with a solution, the corrected code is here (if you're interested):
with open("self.txt", "a+") as f:
f = open("self.txt", "a+")
text = f.readlines()
text_model = markovify.Text(text)
for i in range(1):
tool = grammar_check.LanguageTool ('en-GB')
lin = (text_model.make_sentence(tries=200))
matches = tool.check (lin)
correct = grammar_check.correct (lin, matches)
lowcor = (correct.lower())
print ">",
print str (lowcor)
print ' '
f = open("self.txt", "a+")
f.write(lowcor + "\n")
Thanks for all the replies, they had me thinking and that's how I fixed it!
You can't concatenate a string and a NoneType object. In your code, it appears your variable lin is not getting assigned the value you think it is. You might try an if block that starts like this:
if type(lin) == str:
some code
else:
raise Exception('lin is not the correct datatype')
to verify that lin is the correct datatype before printing.
fp = open ('data.txt','r')
saveto = open('backup.txt','w')
someline = fp.readline()
savemodfile = ''
while someline :
temp_array = someline.split()
print('temp_array[1] {0:20} temp_array[0] {0:20}'.format(temp_array[1], temp_array[0]), '\trating:', temp_array[len(temp_array)-1]))
someline = fp.readline()
savemodfile = temp_array[1] + ' ' + temp_array[0] +',\t\trating:'+ temp_array[10]
saveto.write(savemodfile + '\n')
fp.close()
saveto.close()
The input file :data.txt has records of this pattern: firstname Lastname age address
I would like the backup.txt to has this format: Lastname firstname address age
How do i store the data in the backup.txt in a nice formatted way? I think i should use format() method somehow...
I use the print object in the code to show you what i understood about format() so far. Of course, i do not get the desired results.
To answer your question:
you can indeed use the .format() method on a string template, see the documentation https://docs.python.org/3.5/library/stdtypes.html#str.format
For example:
'the first parameter is {}, the second parameter is {}, the third one is {}'.format("this one", "that one", "there")
Will output: 'the first parameter is this one, the second parameter is that one, the third one is there'
You do not seem to use format() properly in your case: 'temp_array[1] {0:20} temp_array[0] {0:20}'.format(temp_array[1], temp_array[0]) will output something like 'temp_array[1] Lastname temp_array[0] Lastname '. That is because {0:20} will output the 1st parameter to format(), right padded with spaces to 20 characters.
Additionally, there is many things to be improved in your code. I guess you are learning Python so that's normal. Here is a functionally equivalent code that produces the output you want, and makes good use of Python features and syntax:
with open('data.txt', 'rt') as finput, \
open('backup.txt','wt') as foutput:
for line in finput:
firstname, lastname, age, address = line.strip().split()
foutput.write("{} {} {} {}\n".format(lastname, firstname, address, age)
This code will give you a formatted output on the screen and in the output file
fp = open ('data.txt','r')
saveto = open('backup.txt','w')
someline = fp.readline()
savemodfile = ''
while someline :
temp_array = someline.split()
str = '{:20}{:20}{:20}{:20}'.format(temp_array[1], temp_array[0], temp_array[2], temp_array[3])
print(str)
savemodfile = str
saveto.write(savemodfile + '\n')
someline = fp.readline()
fp.close()
saveto.close()
But this is not a very nice code in working with files, try using the following pattern:
with open('a', 'w') as a, open('b', 'w') as b:
do_something()
refer to : How can I open multiple files using "with open" in Python?
fp = open ('data.txt','r')
saveto = open('backup.txt','w')
someline = fp.readline()
savemodfile = ''
while someline :
temp_array = someline.split()
someline = fp.readline()
savemodfile = '{:^20} {:^20} {:^20} {:^20}'.format(temp_array[1],temp_array[0],temp_array[3],temp_array[2])
saveto.write(savemodfile + '\n')
fp.close()
saveto.close()
I'm trying to parse tweets data.
My data shape is as follows:
59593936 3061025991 null null <d>2009-08-01 00:00:37</d> <s><a href="http://help.twitter.com/index.php?pg=kb.page&id=75" rel="nofollow">txt</a></s> <t>honda just recalled 440k accords...traffic around here is gonna be light...win!!</t> ajc8587 15 24 158 -18000 0 0 <n>adrienne conner</n> <ud>2009-07-23 21:27:10</ud> <t>eastern time (us & canada)</t> <l>ga</l>
22020233 3061032620 null null <d>2009-08-01 00:01:03</d> <s><a href="http://alexking.org/projects/wordpress" rel="nofollow">twitter tools</a></s> <t>new blog post: honda recalls 440k cars over airbag risk http://bit.ly/2wsma</t> madcitywi 294 290 9098 -21600 0 0 <n>madcity</n> <ud>2009-02-26 15:25:04</ud> <t>central time (us & canada)</t> <l>madison, wi</l>
I want to get the total numbers of tweets and the numbers of keyword related tweets. I prepared the keywords in text file. In addition, I wanna get the tweet text contents, total number of tweets which contain mention(#), retweet(RT), and URL (I wanna save every URL in other file).
So, I coded like this.
import time
import os
total_tweet_count = 0
related_tweet_count = 0
rt_count = 0
mention_count = 0
URLs = {}
def get_keywords(filepath, mode):
with open(filepath, mode) as f:
for line in f:
yield line.split().lower()
for line in open('/nas/minsu/2009_06.txt'):
tweet = line.strip().lower()
total_tweet_count += 1
with open('./related_tweets.txt', 'a') as save_file_1:
keywords = get_keywords('./related_keywords.txt', 'r')
if keywords in line:
text = line.split('<t>')[1].split('</t>')[0]
if 'http://' in text:
try:
url = text.split('http://')[1].split()[0]
url = 'http://' + url
if url not in URLs:
URLs[url] = []
URLs[url].append('\t' + text)
save_file_3 = open('./URLs_in_related_tweets.txt', 'a')
print >> save_file_3, URLs
except:
pass
if '#' in text:
mention_count +=1
if 'RT' in text:
rt_count += 1
related_tweet_count += 1
print >> save_file_1, text
save_file_2 = open('./info_related_tweets.txt', 'w')
print >> save_file_2, str(total_tweet_count) + '\t' + srt(related_tweet_count) + '\t' + str(mention_count) + '\t' + str(rt_count)
save_file_1.close()
save_file_2.close()
save_file_3.close()
Following is the sample keywords
Depression
Placebo
X-rays
X-ray
HIV
Blood preasure
Flu
Fever
Oral Health
Antibiotics
Diabetes
Mellitus
Genetic disorders
I think my code has many problem, but the first error is as follows:
line 23, in <module>
if keywords in line:
TypeError: 'in <string>' requires string as left operand, not generator
I coded "def ..." part. I think it has a problem. When I try "print keywords" under line (keywords = get_keywords('./related_keywords.txt', 'r')), it gives something strange numbers not words.... . Please help me out!
Maybe change if keywords in line: to use a regular expression match instead. For example, something like:
import re
...
keywords = "|".join(get_keywords('./related_keywords.txt', 'r'))
matcher = re.compile(keywords)
if matcher.match(line):
text = ...
... And changed get_keywords to something like this instead:
def get_keywords(filepath, mode):
keywords = []
with open(filepath, mode) as f:
for line in f:
sp = line.split()
for w in sp:
keywords.append(w.lower())
return keywords