Python extract text with line cuts

Python extract text with line cuts - python

I am using Python 3.7 and have a test.txt file that looks like this:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
 .
</FONT>
I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.
My desired output would be:
['of our common stock is expected to be between $ and $ per share']
How can I do it?
Thank you very much in advance.

The right way with html.unescape and re.search features:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
The output:
[' of our common stock is expected to be between $ and $ per share']

You need to decide when to append a line to price:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
This is just a sketch, so you'll probably need to tweak it a little bit

this also works:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
output:
['of our common stock is expected to be between $ and $ per share']

dirty way of doing it:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")

Here is another simple solution:
It collects all lines into 1 long string, detects starting index of 'be between', ending index of 'per share', and then takes the appropriate substring.
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
Outputs:
['be between $and $per share']

This will also work:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
Output:
"be between $and $per share"

Related

Data generation Python

I'm trying to generate a dataset based on an existing one, I was able to implement a method to randomly change the contents of files, but I can’t write all this to a file. Moreover, I also need to write the number of changed words to the file, since I want to use this dataset to train a neural network, could you help me?
Input: files with 2 lines of text in each.
Output: files with 3(maybe) lines: the first line does not change, the second changes according to the method, the third shows the number of words changed (if for deep learning tasks it is better to do otherwise, I would be glad to advice, since I'm a beginner)
from random import randrange
import os
Path = "D:\corrected data\\"
filelist = os.listdir(Path)
if __name__ == "__main__":
new_words = ['consultable', 'partie ', 'celle ', 'également ', 'forte ', 'statistiques ', 'langue ',
'cadeaux', 'publications ', 'notre', 'nous', 'pour', 'suivr', 'les', 'vos', 'visitez ', 'thème ', 'thème ', 'thème ', 'produits', 'coulisses ', 'un ', 'atelier ', 'concevoir ', 'personnalisés ', 'consultable', 'découvrir ', 'fournit ', 'trace ', 'dire ', 'tableau', 'décrire', 'grande ', 'feuille ', 'noter ', 'correspondant', 'propre',]
nb_words_to_replace = randrange(10)
#with open("1.txt") as file:
for i in filelist:
# if i.endswith(".txt"):
with open(Path + i,"r",encoding="utf-8") as file:
# for line in file:
data = file.readlines()
first_line = data[0]
second_line = data[1]
print(f"Original: {second_line}")
# print(f"FIle: {file}")
second_line_array = second_line.split(" ")
for j in range(nb_words_to_replace):
replacement_position = randrange(len(second_line_array))
old_word = second_line_array[replacement_position]
new_word = new_words[randrange(len(new_words))]
print(f"Position {replacement_position} : {old_word} -> {new_word}")
second_line_array[replacement_position] = new_word
res = " ".join(second_line_array)
print(f"Result: {res}")
with open(Path + i,"w") as f:
for line in file:
if line == second_line:
f.write(res)

In short, you have two questions:
How to properly replace line number 2 (and 3) of the file.
How to keep track of number of words changed.
How to properly replace line number 2 (and 3) of the file.
Your code:
with open(Path + i,"w") as f:
for line in file:
if line == second_line:
f.write(res)
Reading is not enabled. for line in file will not work. fis defined, but file is used instead. To fix this, do the following instead:
with open(Path + i,"r+") as file:
lines = file.read().splitlines() # splitlines() removes the \n characters
lines[1] = second_line
file.writelines(lines)
However, you want to add more lines to it. I suggest you structure the logic differently.
How to keep track of number of words changed.
Add varaible changed_words_count and increment it when old_word != new_word
Resulting code:
for i in filelist:
filepath = Path + i
# The lines that will be replacing the file
new_lines = [""] * 3
with open(filepath, "r", encoding="utf-8") as file:
data = file.readlines()
first_line = data[0]
second_line = data[1]
second_line_array = second_line.split(" ")
changed_words_count = 0
for j in range(nb_words_to_replace):
replacement_position = randrange(len(second_line_array))
old_word = second_line_array[replacement_position]
new_word = new_words[randrange(len(new_words))]
# A word replaced does not mean the word has changed.
# It could be replacing itself.
# Check if the replacing word is different
if old_word != new_word:
changed_words_count += 1
second_line_array[replacement_position] = new_word
# Add the lines to the new file lines
new_lines[0] = first_line
new_lines[1] = " ".join(second_line_array)
new_lines[2] = str(changed_words_count)
print(f"Result: {new_lines[1]}")
with open(filepath, "w") as file:
file.writelines(new_lines)
Note: Code not tested.

Python text processing/finding data

I am trying to parse/process some information from a text file using Python. This file contains names, employee numbers and other data. I do not know the names or employee numbers ahead of time. I do know that after the names there is the text: "Per End" and before the employee number there is the text: "File:". I can find these items using the .find() method. But, how do I ask Python to look at the information that comes before or after "Per End" and "File:"? In this specific case the output should be the name and employee number.
The text looks like this:
SMITH, John
Per End: 12/10/2016
File:
002013
Dept:
000400
Rate:10384 60
My code is thus:
file = open("Register.txt", "rt")
lines = file.readlines()
file.close()
countPer = 0
for line in lines:
line = line.strip()
print (line)
if line.find('Per End') != -1:
countPer += 1
print ("Per End #'s: ", countPer)

file = open("Register.txt", "rt")
lines = file.readlines()
file.close()
for indx, line in enumerate(lines):
line = line.strip()
print (line)
if line.find('Per End') != -1:
print lines[indx-1].strip()
if line.find('File:') != -1:
print lines[indx+1].strip()
enumerate(lines) gives access to indices and line as well, there by you can access previous and next lines as well
here is my stdout directly ran in python shell:
>>> file = open("r.txt", "rt")
>>> lines = file.readlines()
>>> file.close()
>>> lines
['SMITH, John\n', 'Per End: 12/10/2016\n', 'File:\n', '002013\n', 'Dept:\n', '000400\n', 'Rate:10384 60\n']
>>> for indx, line in enumerate(lines):
... line = line.strip()
... if line.find('Per End') != -1:
... print lines[indx-1].strip()
... if line.find('File:') != -1:
... print lines[indx+1].strip()
SMITH, John
002013

Here is how I would do it.
First, some test data.
test = """SMITH, John\n
Per End: 12/10/2016\n
File:\n
002013\n
Dept:\n
000400\n
Rate:10384 60\n"""
text = [line for line in test.splitlines(keepends=False) if line != ""]
Now for the real answer.
count_per, count_num = 0, 0
Using enumerate on an iterable gives you an index automagically.
for idx, line in enumerate(text):
# Just test whether what you're looking for is in the `str`
if 'Per End' in line:
print(text[idx - 1]) # access the full set of lines with idx
count_per += 1
if 'File:' in line:
print(text[idx + 1])
count_num += 1
print("Per Ends = {}".format(count_per))
print("Files = {}".format(count_num))
yields for me:
SMITH, John
002013
Per Ends = 1
Files = 1

Python: Add a new line after the first word in a sentence if the first word is all caps

I'm trying to modify a txt file. The file is a movie script in the format:
BEN We’ve discussed this before.
LUKE I can be a Jedi. I’m ready.
I'd like insert a new line after the character:
BEN
We’ve discussed this before.
LUKE
I can be a Jedi. I’m ready.
How do I do this in python? I currently have:
def modify_file(file_name):
fh=fileinput.input(file_name,inplace=True)
for line in fh:
split_line = line.split()
if(len(split_line)>0):
first_word = split_line[0]
replacement = first_word+'\n'
first_word=first_word.replace(first_word,replacement)
sys.stdout.write(first_word)
fh.close()

As suggested in one of the comments, this can be done using split and isupper. An example is provided below:
source_path = 'source_path.txt'
f = open(source_path)
lines = f.readlines()
f.close()
temp = ''
for line in lines:
words = line.split(' ')
if words[0].isupper():
temp += words[0] + '\n' + ' '.join(words[1:])
else:
temp += line
f = open(source_path, 'w')
f.write(temp)
f.close()

There are multiple problems with your code.
import fileinput
def modify_file(file_name):
fh=fileinput.input("output.txt",inplace=True)
for line in fh:
split_line = line.split()
if(len(split_line)>0):
x=split_line[0]+"\n"+" ".join(split_line[1:])+"\n"
sys.stdout.write(x)
fh.close() #==>this cannot be in the if loop.It has to be at the outer for level

Python flow control with Flag?

Matching a file in this form. It always begins with InvNo, ~EOR~ is End Of Record.
InvNo: 123
Tag1: rat cake
Media: d234
Tag2: rat pudding
~EOR~
InvNo: 5433
Tag1: strawberry tart
Tag5: 's got some rat in it
~EOR~
InvNo: 345
Tag2: 5
Media: d234
Tag5: rather a lot really
~EOR~
It should become
IN 123
UR blabla
**
IN 345
UR blibli
**
Where UR is a URL. I want to keep the InvNo as first tag. ** is now the end of record marker. This works:
impfile = filename[:4]
media = open(filename + '_earmark.dat', 'w')
with open(impfile, 'r') as f:
HASMEDIA = False
recordbuf = ''
for line in f:
if 'InvNo: ' in line:
InvNo = line[line.find('InvNo: ')+7:len(line)]
recordbuf = 'IN {}'.format(InvNo)
if 'Media: ' in line:
HASMEDIA = True
mediaref = line[7:len(line)-1]
URL = getURL(mediaref) # there's more to it, but that's not important now
recordbuf += 'UR {}\n'.format(URL))
if '~EOR~' in line:
if HASMEDIA:
recordbuf += '**\n'
media.write(recordbuf)
HASMEDIA = False
recordbuf = ''
media.close()
Is there a better, more Pythonic way? Working with the recordbuffer and the HASMEDIA flag seems, well, old hat. Any examples or tips for good or better practice?
(Also, I'm open to suggestions for a more to-the-point title to this post)

You could set InvNo and URL initially to None, and only print a record when InvNo and URL are both not Falsish:
impfile = filename[:4]
with open(filename + '_earmark.dat', 'w') as media, open(impfile, 'r') as f:
InvNo = URL = None
for line in f:
if line.startswith('InvNo: '):
InvNo = line[line.find('InvNo: ')+7:len(line)]
if line.startswith('Media: '):
mediaref = line[7:len(line)-1]
URL = getURL(mediaref)
if line.startswith('~EOR~'):
if InvNo and URL:
recordbuf = 'IN {}\nUR {}\n**\n'.format(InvNo, URL)
media.write(recordbuf)
InvNo = URL = None
Note: I changed 'InvNo: ' in line to line.startswith('InvNo: ') based on the assumption that InvNo always occurs at the beginning of the line. It appears to be true in your example, but the fact that you use line.find('InvNo: ') suggests that 'InvNo:' might appear anywhere in the line.
If InvNo: appears only at the beginning of the line, then use line.startswith(...) and remove line.find('InvNo: ') (since it would equal 0).
Otherwise, you'll have to retain 'InvNo:' in line and line.find (and of course, the same goes for Media and ~EOR~).
The problem with using code like 'Media' in line is that if the Tags can contain anything, it might contain the string 'Media' without being a true field header.

Here is a version if you don't want to slice and if you ever need to write to the same output file again, you may not, you can change 'w' to 'a'.
with open('input_file', 'r') as f, open('output.dat', 'a') as media:
write_to_file = False
lines = f.readlines()
for line in lines:
if line.startswith('InvNo:'):
first_line = 'IN ' + line.split()[1] + '\n'
if line.startswith('Media:'):
write_to_file = True
if line.startswith('~EOR~') and write_to_file:
url = 'blabla' #Put getUrl() here
media.write(first_line + url + '\n' + '**\n')
write_to_file = False
first_line = ''

list / string in python

I'm trying to parse tweets data.
My data shape is as follows:
59593936 3061025991 null null <d>2009-08-01 00:00:37</d> <s><a href="http://help.twitter.com/index.php?pg=kb.page&id=75" rel="nofollow">txt</a></s> <t>honda just recalled 440k accords...traffic around here is gonna be light...win!!</t> ajc8587 15 24 158 -18000 0 0 <n>adrienne conner</n> <ud>2009-07-23 21:27:10</ud> <t>eastern time (us & canada)</t> <l>ga</l>
22020233 3061032620 null null <d>2009-08-01 00:01:03</d> <s><a href="http://alexking.org/projects/wordpress" rel="nofollow">twitter tools</a></s> <t>new blog post: honda recalls 440k cars over airbag risk http://bit.ly/2wsma</t> madcitywi 294 290 9098 -21600 0 0 <n>madcity</n> <ud>2009-02-26 15:25:04</ud> <t>central time (us & canada)</t> <l>madison, wi</l>
I want to get the total numbers of tweets and the numbers of keyword related tweets. I prepared the keywords in text file. In addition, I wanna get the tweet text contents, total number of tweets which contain mention(#), retweet(RT), and URL (I wanna save every URL in other file).
So, I coded like this.
import time
import os
total_tweet_count = 0
related_tweet_count = 0
rt_count = 0
mention_count = 0
URLs = {}
def get_keywords(filepath, mode):
with open(filepath, mode) as f:
for line in f:
yield line.split().lower()
for line in open('/nas/minsu/2009_06.txt'):
tweet = line.strip().lower()
total_tweet_count += 1
with open('./related_tweets.txt', 'a') as save_file_1:
keywords = get_keywords('./related_keywords.txt', 'r')
if keywords in line:
text = line.split('<t>')[1].split('</t>')[0]
if 'http://' in text:
try:
url = text.split('http://')[1].split()[0]
url = 'http://' + url
if url not in URLs:
URLs[url] = []
URLs[url].append('\t' + text)
save_file_3 = open('./URLs_in_related_tweets.txt', 'a')
print >> save_file_3, URLs
except:
pass
if '#' in text:
mention_count +=1
if 'RT' in text:
rt_count += 1
related_tweet_count += 1
print >> save_file_1, text
save_file_2 = open('./info_related_tweets.txt', 'w')
print >> save_file_2, str(total_tweet_count) + '\t' + srt(related_tweet_count) + '\t' + str(mention_count) + '\t' + str(rt_count)
save_file_1.close()
save_file_2.close()
save_file_3.close()
Following is the sample keywords
Depression
Placebo
X-rays
X-ray
HIV
Blood preasure
Flu
Fever
Oral Health
Antibiotics
Diabetes
Mellitus
Genetic disorders
I think my code has many problem, but the first error is as follows:
line 23, in <module>
if keywords in line:
TypeError: 'in <string>' requires string as left operand, not generator
I coded "def ..." part. I think it has a problem. When I try "print keywords" under line (keywords = get_keywords('./related_keywords.txt', 'r')), it gives something strange numbers not words.... . Please help me out!

Maybe change if keywords in line: to use a regular expression match instead. For example, something like:
import re
...
keywords = "|".join(get_keywords('./related_keywords.txt', 'r'))
matcher = re.compile(keywords)
if matcher.match(line):
text = ...
... And changed get_keywords to something like this instead:
def get_keywords(filepath, mode):
keywords = []
with open(filepath, mode) as f:
for line in f:
sp = line.split()
for w in sp:
keywords.append(w.lower())
return keywords

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python extract text with line cuts - python

dirty way of doing it: price = [] with open("test.txt", 'r') as f: for i,line in enumerate(f): if "be between" in line: price.append(line.rstrip().replace(' ','')) #remove '\n' and ' ' if i > 3 and i <= 6: price.append(line.rstrip().replace(' ','')) print(str(price).split('.')[0]+"]")

Related

Data generation Python

Python text processing/finding data

Python: Add a new line after the first word in a sentence if the first word is all caps

Python flow control with Flag?

list / string in python

Categories

Resources