Textual parsing - python

I am a newby with Python and Panda, but i would like to parse from multiple downloaded files (which have the same format).
On every HTML there is an section like below where the executives are mentioned.
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
and further in the files there is a section called "DIV id=article_qanda class="content_part hid" where the executives like Ori Shilo is named followed by an answer, like:
<P><STRONG><SPAN class=answer>Ori Shilo</SPAN></STRONG></P>
<P>Good morning, Vernon. Both safety which is obvious and fertility analysis under the charter of the data and safety monitoring board will be - will be on up.</P>
Till now i only succeeded with an html parser for one individual by name to collect all their answers. I am not sure how to proceed and base the code on a variable list of executives. Does someone have a suggestion?
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a"))

To give some color to my original comment, I'll use a simple example. Let's say you've got some code that is looking for the string "Hello, World!" in a file, and you want the line numbers to be aggregated into a list. Your first attempt might look like:
# where I will aggregate my results
line_numbers = []
with open('path/to/file.txt') as fh:
for num, line in enumerate(fh):
if 'Hello, World!' in line:
line_numbers.append(num)
This code snippet works perfectly well. However, it only works to check 'path/to/file.txt' for 'Hello, World!'.
Now, you want to be able to change the string you are looking for. This is analogous to saying "I want to check for different executives". You could use a function to do this. A function allows you to add flexibility into a piece of code. In this simple example, I would do:
# Now I'm checking for a parameter string_to_search
# that I can change when I call the function
def match_in_file(string_to_search):
line_numbers = []
with open('path/to/file.txt') as fh:
for num, line in enumerate(fh):
if string_to_search in line:
line_numbers.append(num)
return line_numbers
# now I'm just calling that function here
line_numbers = match_in_file("Hello, World!")
You'd still have to make a code change, but this becomes much more powerful if you wanted to search for lots of strings. I could feasibly use this function in a loop if I wanted to (though I would do things a little differently in practice), for the sake of the example, I now have the power to do:
list_of_strings = [
"Hello, World!",
"Python",
"Functions"
]
for s in list_of_strings:
line_numbers = match_in_file(s)
print(f"Found {s} on lines ", *line_numbers)
Generalized to your specific problem, you'll want a parameter for the executive that you want to search for. Your function signature might look like:
def find_executive(soup, executive):
for answer in soup.select(f'p:contains("Question-and-Answer Session") ~ strong:contains({executive}) + p'):
# rest of code
You've already read in the soup, so you don't need to do that again. You only need to change the executive in your select statement. The reason you want a parameter for soup is so you aren't relying on variables in global scope.

#C.Nivs Would my code then be the following? Because i now get a block error:
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
def find_executive(soup, executive):
for answer in soup.select(f'p:contains("Question-and-Answer Session") ~ strong:contains({executive}) + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format(func, txt), file=open("output.txt", "a"))

Related

Python extract text with line cuts

I am using Python 3.7 and have a test.txt file that looks like this:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
“ ”.
</FONT>
I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.
My desired output would be:
['of our common stock is expected to be between $ and $ per share']
How can I do it?
Thank you very much in advance.
The right way with html.unescape and re.search features:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
The output:
[' of our common stock is expected to be between $ and $ per share']
You need to decide when to append a line to price:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
This is just a sketch, so you'll probably need to tweak it a little bit
this also works:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
output:
['of our common stock is expected to be between $ and $ per share']
dirty way of doing it:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")
Here is another simple solution:
It collects all lines into 1 long string, detects starting index of 'be between', ending index of 'per share', and then takes the appropriate substring.
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
Outputs:
['be between $and $per share']
This will also work:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
Output:
"be between $and $per share"

How to replace string character pattern using python in csv file

I am new to python. How to replace string character ," to ,{ and ", to }, which contains multilines in .csv file?
Here is my content of .csv file
Name, Degree,Some, Occupation, Object
Mr. A,"B.A, M.A",123,"ags,gshs",ass
Mr. ABC,"B.A, M.A",231,"ags,gshs",asas
Mr. D,"BB.A, M.A",44,"ags,gshs",asas
Mr. G,"BBBB.A, M.A",12,"ags,gshs",asasasa
Mr. S,"B.A, MMM.A",10,"ags,gshs",asasas
Mr. R,"B.A, M.A",11,"ags,gshs",asasas
Mr. T,"B.A, M.A",12,"ags,gshs",asasa
Mr. DD,"B.A, M.A",13,"ags,gshs",asasas
So my output will be something like this
Name, Degree,Some, Occupation, Obejct
Mr. A,{B.A, M.A},123,{ags,gshs},ass
Mr. ABC,{B.A, M.A},231,{ags,gshs},asas
Mr. D,{BB.A, M.A},44,{ags,gshs},asas
Mr. G,{BBBB.A, M.A},12,{ags,gshs},asasasa
Mr. S,{B.A, MMM.A},10,{ags,gshs},asasas
Mr. R,{B.A, M.A},11,{ags,gshs},asasas
Mr. T,{B.A, M.A},12,{ags,gshs},asasa
Mr. DD,{B.A, M.A},13,{ags,gshs},asasas
After opening the file with file.read(), you can use replace(old, new) to replace the string characters you desire. Keep in mind, since the strings ," and ", contain quotes, you must put a \ before the quotes to show they part of the string.
EDIT: A comment mentioned you could enclose the string in ' '. If you do this, putting \ before the quotes is not required. For example, both ",\"" and ',"' are valid strings.
data = ""
with open("/path/to/file.csv") as file:
data = file.read().replace(",\"", ",{").replace("\",", "},")
with open("/path/to/new_file.csv") as file:
file.write(data)
If you only need it once you could use pandas like this:
import pandas as pd
data1 = '''\
Name,Degree,Some,Occupation,Object
Mr. A,"B.A, M.A",123,"ags,gshs",ass
Mr. ABC,"B.A, M.A",231,"ags,gshs",asas
Mr. D,"BB.A, M.A",44,"ags,gshs",asas
Mr. G,"BBBB.A, M.A",12,"ags,gshs",asasasa
Mr. S,"B.A, MMM.A",10,"ags,gshs",asasas
Mr. R,"B.A, M.A",11,"ags,gshs",asasas
Mr. T,"B.A, M.A",12,"ags,gshs",asasa
Mr. DD,"B.A, M.A",13,"ags,gshs",asasas'''
df = pd.read_csv(pd.compat.StringIO(data1), sep=',', dtype=object)
#df = pd.read_csv('input.csv', sep=',', dtype=object) # Use this row for real application
df['Degree'] = '{'+df['Degree']+'}'
df['Occupation'] = '{'+df['Occupation']+'}'
# Create custom output
out = '\n'.join([','.join(df.columns), '\n'.join(','.join(i) for i in df.values)])
with open('output.csv') as f:
f.write(out)
You can use unpacking:
import csv
with open('filename.csv') as f:
data = filter(None, list(csv.reader(f)))
with open('filename.csv', 'w') as f1:
write = csv.writer(f1)
write.writerows([data[0]]+[[a, '{'+b+'}', c, '{'+d+'}', e] for a, b, c, d, e in data[1:]])
Output:
Name, Degree,Some, Occupation, Object
Mr. A,{B.A, M.A},123,{ags,gshs},ass
Mr. ABC,{B.A, M.A},231,{ags,gshs},asas
Mr. D,{BB.A, M.A},44,{ags,gshs},asas
Mr. G,{BBBB.A, M.A},12,{ags,gshs},asasasa
Mr. S,{B.A, MMM.A},10,{ags,gshs},asasas
Mr. R,{B.A, M.A},11,{ags,gshs},asasas
Mr. T,{B.A, M.A},12,{ags,gshs},asasa
Mr. DD,{B.A, M.A},13,{ags,gshs},asasas
Try:
def find_replace(csv_path, search_characters, replace_with):
text = open(csv_path, "r")
text = ''.join([i for i in text]).replace(
search_characters, replace_with)
x = open(csv_path, "w")
x.writelines(text)
x.close()
if __name__ == '__main__':
csv_path = "path/to/csv/file.csv"
search_characters = ',"'
replace_with = ',{'
find_replace(csv_path, search_characters, replace_with)
search_characters = '",'
replace_with = '},'
find_replace(csv_path, search_characters, replace_with)
The above code opens the file, writes some data to it and then closes it.
Or, if you prefer lists as well the with statement which will take care to call __exit__ function of the given object even if something bad happened in code.
def find_replace(csv_path, search_characters, replace_with):
s_one, s_two = search_characters
r_one, r_two = replace_with
with open(csv_path) as file:
data = file.read().replace(s_one, r_one).replace(s_two, r_two)
with open(csv_path, 'w') as file:
file.write(data)
if __name__ == '__main__':
csv_path = "path/to/csv/file.csv"
search_characters = [',"', '",']
replace_with = [',{', '},']
find_replace(csv_path, search_characters, replace_with)
The main advantage of using a with statement is that it makes sure our file is closed without paying attention to how the nested block exits.
Tested and works nicely on your example.
Replacing string character pattern using python in csv file
text = open("input.csv", "r")
#text = ''.join([i for i in text]).replace("character to be replaced", "character to be replaced with")
text = ''.join([i for i in text]).replace(",\"", ",{")
#Replacing character from replaced text
text1 = ''.join([i for i in text]).replace("\",", "},")
x = open("output.csv","w")
x.writelines(text1)
x.close()

NYT summary extractor python2

I am trying to access the summary of the NYT articles using the NewsWire API and python 2.7. Here is the code:
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('\n' + str(x) + " " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue
articles.append(el_post)
count=count + 1
res= abst # url + " " + abst
# print res.encode("utf-8")
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))
I was getting is good summary. But sometimes I came across some weird sentences as a part of the summary. Here are the example:
Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.
which are consider to be a summary of some article. Kindly, help me in extracting the perfect summary of the article from the NYT news. I thought of using the titles if such the arises, but the title is weird too.
So, I have a taken a look through the summary results.
It is possible to remove repeated statements such as Corrections appearing in print on Monday, August 28, 2017., where only the date is different.
Simplest way to do this is to check if the statement is present in the vairable itself.
Example,
# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"]
And then,
if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)
As for the remaining unwanted statements, there is NO way they can be differentiated, unless you search for keywords within res that you want to ignore, or they are repetitive. If you find any, just simply add them to the list I created.

How to Modify Python Code in Order to Print Multiple Adjacent "Location" Tokens to Single Line of Output

I am new to python, and I am trying to print all of the tokens that are identified as locations in an .xml file to a .txt file using the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('exercise-ner.xml', 'r'))
tokenlist = soup.find_all('token')
output = ''
for x in tokenlist:
readeachtoken = x.ner.encode_contents()
checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")
if checktoseeifthetokenisalocation != -1:
output += "\n%s" % x.word.encode_contents()
z = open('exercise-places.txt','w')
z.write(output)
z.close()
The program works, and spits out a list of all of the tokens that are locations, each of which is printed on its own line in the output file. What I would like to do, however, is to modify my program so that any time beautiful soup finds two or more adjacent tokens that are identified as locations, it can print those tokens to the same line in the output file. Does anyone know how I might modify my code to accomplish this? I would be entirely grateful for any suggestions you might be able to offer.
This question is very old, but I just got your note #Amanda and I thought I'd post my approach to the task in case it might help others:
import glob, codecs
from bs4 import BeautifulSoup
inside_location = 0
location_string = ''
with codecs.open("washington_locations.txt","w","utf-8") as out:
for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
locations = []
with codecs.open(i,'r','utf-8') as f:
soup = BeautifulSoup(f.read())
tokens = soup.findAll('token')
for token in tokens:
if token.ner.string == "LOCATION":
inside_location = 1
location_string += token.word.string + u" "
else:
if location_string:
locations.append( location_string )
location_string = ''
out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )

Creating a dynamic forum signature generator in python

I have searched and searched but I have only found solutions involving php and not python/django. My goal is to make a website (backend coded in python) that will allow a user to input a string. The backend script would then be run and output a dictionary with some info. What I want is to use the info from the dictionary to sort of draw it onto an image I have on the server and give the new image to the user. How can I do this offline for now? What libraries can I use? Any suggestions on the route I should head on would be lovely.
I am still a novice so please forgive me if my code needs work. So far I have no errors with what I have but like I said I have no clue where to go next to achieve my goal. Any tips would be greatly appreciated.
This is sort of what I want the end goal to be http://combatarmshq.com/dynamic-signatures.html
This is what I have so far (I used beautiful soup as a parser from here. If this is too excessive or if I did it in a not so good way please let me know if there is a better alternative. Thanks):
The url where I'm getting the numbers I want (These are dynamic) is this: http://combatarms.nexon.net/ClansRankings/PlayerProfile.aspx?user=
The name of the player will go after user so an example is http://combatarms.nexon.net/ClansRankings/PlayerProfile.aspx?user=-aonbyte
This is the code with the basic functions to scrape the website:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
def get_avatar(player_name):
'''Return the players avatar as a binary string.'''
player_name = str(player_name)
url = 'http://combat.nexon.net/Avatar/MyAvatar.srf?'
url += 'GameName=CombatArms&CharacterID=' + player_name
sock = urlopen(url)
data = sock.read()
sock.close()
return data
def save_avatar(data, file_name):
'''Saves the avatar data from get_avatar() in png format.'''
local_file = open(file_name + '.png', 'w' + 'b')
local_file.write(data)
local_file.close()
def get_basic_info(player_name):
'''Returns basic player statistics as a dictionary'''
url = 'http://combatarms.nexon.net/ClansRankings'
url += '/PlayerProfile.aspx?user=' + player_name
sock = urlopen(url)
html_raw = sock.read()
sock.close()
html_original_parse = BeautifulSoup(''.join(html_raw))
player_info = html_original_parse.find('div', 'info').find('ul')
basic_info_list = range(6)
for i in basic_info_list:
basic_info_list[i] = str(player_info('li', limit = 7)[i+1].contents[1])
basic_info = dict(date = basic_info_list[0], rank = basic_info_list[1], kdr = basic_info_list[2], exp = basic_info_list[3], gp_earned = basic_info_list[4], gp_current = basic_info_list[5])
return basic_info
And here is the code that tests out those functions:
from grabber import get_avatar, save_avatar, get_basic_info
player = raw_input('Player name: ')
print 'Downloading avatar...'
avatar_data = get_avatar(player)
file_name = raw_input('Save as? ')
print 'Saving avatar as ' + file_name + '.png...'
save_avatar(avatar_data, file_name)
print 'Retrieving ' + player + '\'s basic character info...'
player_info = get_basic_info(player)
print ''
print ''
print 'Info for character named ' + player + ':'
print 'Character creation date: ' + player_info['date']
print 'Rank: ' + player_info['rank']
print 'Experience: ' + player_info['exp']
print 'KDR: ' + player_info['kdr']
print 'Current GP: ' + player_info['gp_current']
print ''
raw_input('Press enter to close...')
If I understand you correctly, you want to get an image from one place, get some textual information from another place, draw text on top of the image, and then return the marked-up image. Do I have that right?
If so, get PIL, the Python Image Library. Both PIL and BeatifulSoup are capable of reading directly from an opened URL, so you can forget that socket nonsense. Get the player name from the HTTP request, open the image, use BeautifulSoup to get the data, use PIL's text functions to write on the image, save the image back into the HTTP response, and you're done.

Categories