I am trying to access the summary of the NYT articles using the NewsWire API and python 2.7. Here is the code:
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('\n' + str(x) + " " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue
articles.append(el_post)
count=count + 1
res= abst # url + " " + abst
# print res.encode("utf-8")
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))
I was getting is good summary. But sometimes I came across some weird sentences as a part of the summary. Here are the example:
Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.
which are consider to be a summary of some article. Kindly, help me in extracting the perfect summary of the article from the NYT news. I thought of using the titles if such the arises, but the title is weird too.
So, I have a taken a look through the summary results.
It is possible to remove repeated statements such as Corrections appearing in print on Monday, August 28, 2017., where only the date is different.
Simplest way to do this is to check if the statement is present in the vairable itself.
Example,
# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"]
And then,
if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)
As for the remaining unwanted statements, there is NO way they can be differentiated, unless you search for keywords within res that you want to ignore, or they are repetitive. If you find any, just simply add them to the list I created.
Related
I am a newby with Python and Panda, but i would like to parse from multiple downloaded files (which have the same format).
On every HTML there is an section like below where the executives are mentioned.
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
and further in the files there is a section called "DIV id=article_qanda class="content_part hid" where the executives like Ori Shilo is named followed by an answer, like:
<P><STRONG><SPAN class=answer>Ori Shilo</SPAN></STRONG></P>
<P>Good morning, Vernon. Both safety which is obvious and fertility analysis under the charter of the data and safety monitoring board will be - will be on up.</P>
Till now i only succeeded with an html parser for one individual by name to collect all their answers. I am not sure how to proceed and base the code on a variable list of executives. Does someone have a suggestion?
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a"))
To give some color to my original comment, I'll use a simple example. Let's say you've got some code that is looking for the string "Hello, World!" in a file, and you want the line numbers to be aggregated into a list. Your first attempt might look like:
# where I will aggregate my results
line_numbers = []
with open('path/to/file.txt') as fh:
for num, line in enumerate(fh):
if 'Hello, World!' in line:
line_numbers.append(num)
This code snippet works perfectly well. However, it only works to check 'path/to/file.txt' for 'Hello, World!'.
Now, you want to be able to change the string you are looking for. This is analogous to saying "I want to check for different executives". You could use a function to do this. A function allows you to add flexibility into a piece of code. In this simple example, I would do:
# Now I'm checking for a parameter string_to_search
# that I can change when I call the function
def match_in_file(string_to_search):
line_numbers = []
with open('path/to/file.txt') as fh:
for num, line in enumerate(fh):
if string_to_search in line:
line_numbers.append(num)
return line_numbers
# now I'm just calling that function here
line_numbers = match_in_file("Hello, World!")
You'd still have to make a code change, but this becomes much more powerful if you wanted to search for lots of strings. I could feasibly use this function in a loop if I wanted to (though I would do things a little differently in practice), for the sake of the example, I now have the power to do:
list_of_strings = [
"Hello, World!",
"Python",
"Functions"
]
for s in list_of_strings:
line_numbers = match_in_file(s)
print(f"Found {s} on lines ", *line_numbers)
Generalized to your specific problem, you'll want a parameter for the executive that you want to search for. Your function signature might look like:
def find_executive(soup, executive):
for answer in soup.select(f'p:contains("Question-and-Answer Session") ~ strong:contains({executive}) + p'):
# rest of code
You've already read in the soup, so you don't need to do that again. You only need to change the executive in your select statement. The reason you want a parameter for soup is so you aren't relying on variables in global scope.
#C.Nivs Would my code then be the following? Because i now get a block error:
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
def find_executive(soup, executive):
for answer in soup.select(f'p:contains("Question-and-Answer Session") ~ strong:contains({executive}) + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format(func, txt), file=open("output.txt", "a"))
Atm I am working on a plug in for a Chat bot for Twitch.
I have this working so far. So that I am able to add Items to a file.
# Variables
f = open("Tank_request_list.txt","a+")
fr = open("Tank_request_list.txt","r")
tr = "EBR" # test input
tank_request = fr.read()
treq = tank_request.split("#")
with open("Tank_request_list.txt") as fr:
empty = fr.read(1)
if not empty:
f.write(tr)
f.close
else:
tr = "#" + tr
f.write(tr)
f.close
I now need to work out how to delete an item at Index 0
I also have this piece of code I need to implement also:
# List Length
list_length = len(treq)
print "There are %d tanks in the queue." % (list_length)
# Next 5 Requests
print "Next 5 Requests are:"
def tank_lst(x):
for i in range(5):
print "- " + x[i]
# Run Tank_request
tank_lst(treq)
The following will return the right answer but not write it.
def del_played(tank):
del tank[0]
return tank
tanks = treq
print del_played(tanks)
First, remove the content
Use the truncate function for removing the content from a file then write the new list into it.
I've got this these two code snippets from a webinar(slide 7 and 8 respectively)
The first one finds a desired URL i have tested it and it works:
def SECdownload(year, month):
import os
from urllib.request import urlopen
root = None
feedFile = None
feedData = None
good_read = False
itemIndex = 0
edgarFilingsFeed = 'http://www.sec.gov/Archives/edgar/monthly/xbrlrss-' + str(year) + '-' + str(month).zfill(2) + '.xml'
print( edgarFilingsFeed )
if not os.path.exists( "sec/" + str(year) ):
os.makedirs( "sec/" + str(year) )
if not os.path.exists( "sec/" + str(year) + '/' + str(month).zfill(2) ):
os.makedirs( "sec/" + str(year) + '/' + str(month).zfill(2) )
target_dir = "sec/" + str(year) + '/' + str(month).zfill(2) + '/'
try:
feedFile = urlopen( edgarFilingsFeed ) # urlopen will not work (python 3) needs from urllib.request import urlopen
try:
feedData = feedFile.read()
good_read = True
finally:
feedFile.close()
except HTTPError as e:
print( "HTTP Error:", e.code )
and the second one is supposed to parse the RSS Feed to find ZIP filenames:
#Downloading the data - parsing the RSS feed to extract the ZIP file enclosure filename
# Process RSS feed and walk through all items contained
for item in feed.entries:
print( item[ "summary" ], item[ "title" ], item[ "published" ] )
try:
# Identify ZIP file enclosure, if available
enclosures = [ l for l in item[ "links" ] if l[ "rel" ] == "enclosure" ]
if ( len( enclosures ) > 0 ):
# ZIP file enclosure exists, so we can just download the ZIP file
enclosure = enclosures[0]
sourceurl = enclosure[ "href" ]
cik = item[ "edgar_ciknumber" ]
targetfname = target_dir+cik +' - ' +sourceurl.split('/')[-1]
retry_counter = 3
while retry_counter > 0:
good_read = downloadfile( sourceurl, targetfname )
if good_read:
break
else:
print( "Retrying:", retry_counter )
retry_counter -= 1
except:
pass
However whenever i try to run the second module i get:
Traceback (most recent call last):
File "E:\Py_env\df2.py", line 3, in <module>
for item in feed.entries:
NameError: name 'feed' is not defined
What am i not understanding in the webinar right? And if i must define feed i have literally no idea how to do it while keeping a logical linkage to the data the first code snippet provides!
(On the sidenote this is a webinar from a reputable software vendor so how it is possible to have mistakes(?) there's something i am doing wrong...)
The problem is like the error message implies: you haven't defined any variable named feed that's in scope when the second snippet executes. Either their code omitted something, or you missed a part that was crucial.
That aside, the formatting on this code is really dodgy and not at all idiomatic Python. You're probably better off looking for a new snippet.
Migrated from a comment.
As your output is showing you and you noticed, feed wasn't defined nor was it shown to you in the slides. It looks as though the slide share is expecting you to make a logical jump, and they do point out in the right column that feedparser is an easy way to parse... feeds (RSS Feeds).
So they are expecting that you can adapt the feedData you found in your first function and can dump it into a method from feedparser.
As you can see in various examples online (such as the docs), this can be done from the string you got:
>>> import feedparser
>>> rawdata = """<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>"""
>>> d = feedparser.parse(rawdata)
>>> d['feed']['title']
u'Sample Feed'
Using that, I bet you can see where it goes (rather than me telling you).
as #PatrickCollins pointed out, this is kindof crappy examples for python, but that shouldn't get in your way as you're learning it.
I am new to python, and I am trying to print all of the tokens that are identified as locations in an .xml file to a .txt file using the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('exercise-ner.xml', 'r'))
tokenlist = soup.find_all('token')
output = ''
for x in tokenlist:
readeachtoken = x.ner.encode_contents()
checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")
if checktoseeifthetokenisalocation != -1:
output += "\n%s" % x.word.encode_contents()
z = open('exercise-places.txt','w')
z.write(output)
z.close()
The program works, and spits out a list of all of the tokens that are locations, each of which is printed on its own line in the output file. What I would like to do, however, is to modify my program so that any time beautiful soup finds two or more adjacent tokens that are identified as locations, it can print those tokens to the same line in the output file. Does anyone know how I might modify my code to accomplish this? I would be entirely grateful for any suggestions you might be able to offer.
This question is very old, but I just got your note #Amanda and I thought I'd post my approach to the task in case it might help others:
import glob, codecs
from bs4 import BeautifulSoup
inside_location = 0
location_string = ''
with codecs.open("washington_locations.txt","w","utf-8") as out:
for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
locations = []
with codecs.open(i,'r','utf-8') as f:
soup = BeautifulSoup(f.read())
tokens = soup.findAll('token')
for token in tokens:
if token.ner.string == "LOCATION":
inside_location = 1
location_string += token.word.string + u" "
else:
if location_string:
locations.append( location_string )
location_string = ''
out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )
Below is script that I found on forum, and it is almost exactly what I need except I need to read like 30 different url's and print them all together.I have tried few options but script just breaks. How can I merge all 30's urls, parse, and than print them out.
If you can help me I would be very greatful, ty.
import sys
import string
from urllib2 import urlopen
import xml.dom.minidom
var_xml = urlopen("http://www.test.com/bla/bla.xml")
var_all = xml.dom.minidom.parse(var_xml)
def extract_content(var_all, var_tag, var_loop_count):
return var_all.firstChild.getElementsByTagName(var_tag)[var_loop_count].firstChild.data
var_loop_count = 0
var_item = " "
while len(var_item) > 0:
var_title = extract_content(var_all, "title", var_loop_count)
var_date = extract_content(var_all, "pubDate", var_loop_count)
print "Title: ", var_title
print "Published Date: ", var_date
print " "
var_loop_count += 1
try:
var_item = var_all.firstChild.getElementsByTagName("item")[var_loop_count].firstChild.data
except:
var_item = ""
If this is standard RSS, I'd encourage to use http://www.feedparser.org/ ; extracting all items there is straightforward.
You are overwriting var_item, var_title, var_date. each loop. Make a list of these items, and put each var_item, var_title, var_date in the list. At the end, just print out your list.
http://docs.python.org/tutorial/datastructures.html