Python - merging many url's and parsing them - python

Below is script that I found on forum, and it is almost exactly what I need except I need to read like 30 different url's and print them all together.I have tried few options but script just breaks. How can I merge all 30's urls, parse, and than print them out.
If you can help me I would be very greatful, ty.
import sys
import string
from urllib2 import urlopen
import xml.dom.minidom
var_xml = urlopen("http://www.test.com/bla/bla.xml")
var_all = xml.dom.minidom.parse(var_xml)
def extract_content(var_all, var_tag, var_loop_count):
return var_all.firstChild.getElementsByTagName(var_tag)[var_loop_count].firstChild.data
var_loop_count = 0
var_item = " "
while len(var_item) > 0:
var_title = extract_content(var_all, "title", var_loop_count)
var_date = extract_content(var_all, "pubDate", var_loop_count)
print "Title: ", var_title
print "Published Date: ", var_date
print " "
var_loop_count += 1
try:
var_item = var_all.firstChild.getElementsByTagName("item")[var_loop_count].firstChild.data
except:
var_item = ""

If this is standard RSS, I'd encourage to use http://www.feedparser.org/ ; extracting all items there is straightforward.

You are overwriting var_item, var_title, var_date. each loop. Make a list of these items, and put each var_item, var_title, var_date in the list. At the end, just print out your list.
http://docs.python.org/tutorial/datastructures.html

Related

How to open a file and delete the first item (item at index 0)

Atm I am working on a plug in for a Chat bot for Twitch.
I have this working so far. So that I am able to add Items to a file.
# Variables
f = open("Tank_request_list.txt","a+")
fr = open("Tank_request_list.txt","r")
tr = "EBR" # test input
tank_request = fr.read()
treq = tank_request.split("#")
with open("Tank_request_list.txt") as fr:
empty = fr.read(1)
if not empty:
f.write(tr)
f.close
else:
tr = "#" + tr
f.write(tr)
f.close
I now need to work out how to delete an item at Index 0
I also have this piece of code I need to implement also:
# List Length
list_length = len(treq)
print "There are %d tanks in the queue." % (list_length)
# Next 5 Requests
print "Next 5 Requests are:"
def tank_lst(x):
for i in range(5):
print "- " + x[i]
# Run Tank_request
tank_lst(treq)
The following will return the right answer but not write it.
def del_played(tank):
del tank[0]
return tank
tanks = treq
print del_played(tanks)
First, remove the content
Use the truncate function for removing the content from a file then write the new list into it.

NYT summary extractor python2

I am trying to access the summary of the NYT articles using the NewsWire API and python 2.7. Here is the code:
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('\n' + str(x) + " " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue
articles.append(el_post)
count=count + 1
res= abst # url + " " + abst
# print res.encode("utf-8")
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))
I was getting is good summary. But sometimes I came across some weird sentences as a part of the summary. Here are the example:
Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.
which are consider to be a summary of some article. Kindly, help me in extracting the perfect summary of the article from the NYT news. I thought of using the titles if such the arises, but the title is weird too.
So, I have a taken a look through the summary results.
It is possible to remove repeated statements such as Corrections appearing in print on Monday, August 28, 2017., where only the date is different.
Simplest way to do this is to check if the statement is present in the vairable itself.
Example,
# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"]
And then,
if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)
As for the remaining unwanted statements, there is NO way they can be differentiated, unless you search for keywords within res that you want to ignore, or they are repetitive. If you find any, just simply add them to the list I created.

Python IndexError: list index out of range when using iterations

I've been trying to download screenshots from the App Store and here's my code (I'm a beginner).
The problem I encounter is list index out of range at line 60 (screenshotList = data["results"][resultCounter]["screenshotUrls"]
The thing is that sometimes, the search API returns 0 results for the search term used, and therefore it gets messed up because "resultCount" = 0.
I'm not sure what else it could be/nor how I can fix it. Any help?
# Required libraries
import urllib
import string
import random
import json
import time
""" screenshotCounter is used so that all screenshots have a different name
resultCounter is used to go from result to result in downloaded JSON file
"""
screenshotCounter = 0
resultCounter = 0
""" Create three random letters as search term on App Store
Download JSON results file
Shows used search term
"""
searchTerm = (''.join(random.choice(string.ascii_lowercase) for i in range(3)))
urllib.urlretrieve("https://itunes.apple.com/search?country=us&entity=software&limit=3&term=" + str(searchTerm), "download.txt")
print "Used search term: " + str(searchTerm)
# Function to download screenshots + give it a name + confirmation msg
def download_screenshot(screenshotLink, screenshotName):
urllib.urlretrieve(screenshotLink, screenshotName)
print "Downloaded with success:" + str(screenshotName)
# Opens newly downloaded JSON file
with open ('download.txt') as data_file:
data = json.load(data_file)
""" Get the first list of screenshots from stored JSON file,
resultCounter = 0 on first iteration
"""
screenshotList = data["results"][resultCounter]["screenshotUrls"]
# Gives the number of found results and serves as iteration limit
iterationLimit = data["resultCount"]
# Prints the number of found results
print str(iterationLimit) + " results found."
""" Change the number of iterations to the number of results, which will be
different for every request, minus 1 since indexing starts at 0
"""
iterations = [0] * iterationLimit
""" For each iteration (number of results), find each screenshot in the
screenshotList, name it, download it. Then change result to find the next
screenshotList and change screenshotList variable.
"""
for number in iterations:
for screenshotLink in screenshotList:
screenshotName = "screenshot" + str(screenshotCounter) + ".jpeg"
download_screenshot(screenshotLink, screenshotName)
screenshotCounter = screenshotCounter + 1
resultCounter = resultCounter + 1
screenshotList = data["results"][resultCounter]["screenshotUrls"]
# Sleeping to avoid crash
time.sleep(1)
I rewrote your code to check for the presence of results before trying anything. If there aren't any, it goes back through the loop with a new search term. If there are, it will stop at the end of that iteration.
# Required libraries
import urllib
import string
import random
import json
import time
# Function to download screenshots + give it a name + confirmation msg
def download_screenshot(screenshotLink, screenshotName):
urllib.urlretrieve(screenshotLink, screenshotName)
print "Downloaded with success:" + str(screenshotName)
success = False
while success == False:
""" Create three random letters as search term on App Store
Download JSON results file
Shows used search term
"""
searchTerm = (''.join(random.choice(string.ascii_lowercase) for i in range(3)))
urllib.urlretrieve("https://itunes.apple.com/search?country=us&entity=software&limit=3&term=" + str(searchTerm), "download.txt")
print "Used search term: " + str(searchTerm)
# Opens newly downloaded JSON file
with open ('download.txt') as data_file:
data = json.load(data_file)
""" Get the first list of screenshots from stored JSON file,
resultCounter = 0 on first iteration
"""
resultCount = len(data["results"])
if resultCount == 0:
continue #if no results, skip to the next loop
success = True
print str(resultCount) + " results found."
for j, resultList in enumerate(data["results"]):
screenshotList = resultList["screenshotUrls"]
""" For each iteration (number of results), find each screenshot in the
screenshotList, name it, download it. Then change result to find the next
screenshotList and change screenshotList variable.
"""
for i, screenshotLink in enumerate(screenshotList):
screenshotName = "screenshot" + str(i) + '_' + str(j) + ".jpeg"
download_screenshot(screenshotLink, screenshotName)
# Sleeping to avoid crash
time.sleep(1)
have you tried
try:
for screenshotLink in screenshotList:
screenshotName = "screenshot" + str(screenshotCounter) + ".jpeg"
download_screenshot(screenshotLink, screenshotName)
screenshotCounter = screenshotCounter + 1
except IndexError:
pass

How to Modify Python Code in Order to Print Multiple Adjacent "Location" Tokens to Single Line of Output

I am new to python, and I am trying to print all of the tokens that are identified as locations in an .xml file to a .txt file using the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('exercise-ner.xml', 'r'))
tokenlist = soup.find_all('token')
output = ''
for x in tokenlist:
readeachtoken = x.ner.encode_contents()
checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")
if checktoseeifthetokenisalocation != -1:
output += "\n%s" % x.word.encode_contents()
z = open('exercise-places.txt','w')
z.write(output)
z.close()
The program works, and spits out a list of all of the tokens that are locations, each of which is printed on its own line in the output file. What I would like to do, however, is to modify my program so that any time beautiful soup finds two or more adjacent tokens that are identified as locations, it can print those tokens to the same line in the output file. Does anyone know how I might modify my code to accomplish this? I would be entirely grateful for any suggestions you might be able to offer.
This question is very old, but I just got your note #Amanda and I thought I'd post my approach to the task in case it might help others:
import glob, codecs
from bs4 import BeautifulSoup
inside_location = 0
location_string = ''
with codecs.open("washington_locations.txt","w","utf-8") as out:
for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
locations = []
with codecs.open(i,'r','utf-8') as f:
soup = BeautifulSoup(f.read())
tokens = soup.findAll('token')
for token in tokens:
if token.ner.string == "LOCATION":
inside_location = 1
location_string += token.word.string + u" "
else:
if location_string:
locations.append( location_string )
location_string = ''
out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )

How do I replace a specific part of a string in Python

As of now I am trying to scrape Good.is.The code as of now gives me the regular image(turn the if statement to True) but I want to higher res picture. I was wondering how I would replace a certain text so that I could download the high res picture. I want to change the html: http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html to http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html (The end is different). My code is:
import os, urllib, urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser
parser = HTMLParser.HTMLParser()
# make folder.
folderName = 'Good.is'
if not os.path.exists(folderName):
os.makedirs(folderName)
list = []
# Python ranges start from the first argument and iterate up to one
# less than the second argument, so we need 36 + 1 = 37
for i in range(1, 37):
list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all")
listIterator1 = []
listIterator1[:] = range(0,37)
counter = 0
for x in listIterator1:
soup = BeautifulSoup(urllib2.urlopen(list[x]).read())
body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'})
number = len(body[0].findAll("p"))
listIterator = []
listIterator[:] = range(0,number)
for i in listIterator:
paragraphs = body[0].findAll("p")
nextArticle = body[0].findAll("a")[2]
text = body[0].findAll("p")[i]
if len(paragraphs) > 0:
#print image['src']
counter += 1
print counter
print parser.unescape(text.getText())
print "http://www.good.is" + nextArticle['href']
originalArticle = "http://www.good.is" + nextArticle['href']
article = BeautifulSoup(urllib2.urlopen(originalArticle).read())
title = article.findAll("div", attrs = {'class': 'title_and_image'})
getTitle = title[0].findAll("h1")
article1 = article.findAll("div", attrs = {'class': 'body'})
articleImage = article1[0].find("p")
betterImage = articleImage.find("a")
articleImage1 = articleImage.find("img")
paragraphsWithinSection = article1[0].findAll("p")
print betterImage['href']
if len(paragraphsWithinSection) > 1:
articleText = article1[0].findAll("p")[1]
else:
articleText = article1[0].findAll("p")[0]
print articleImage1['src']
print parser.unescape(getTitle)
if not articleText is None:
print parser.unescape(articleText.getText())
print '\n'
link = articleImage1['src']
x += 1
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
Have a look at str.replace. If that isn't general enough to get the job done, you'll need to use a regular expression ( re -- probably re.sub ).
>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html"
>>> str1.replace("flash","flat")
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html'
I think the safest and easiest way is to use a regular expression:
import re
url = 'http://www.google.com/this/is/sample/url/flash.html'
newUrl = re.sub('flash\.html$','flat.html',url)
The "$" means only match the end of the string. This solution will behave correctly even in the (admittedly unlikely) event that your url includes the substring "flash.html" somewhere other than the end, and also leaves the string unchanged (which I assume is the correct behavior) if it does not end with 'flash.html'.
See: http://docs.python.org/library/re.html#re.sub
#mgilson has a good solution, but the problem is it will replace all occurrences of the string with the replacement; so if you have the word "flash" as part of the URL (and not the just the trailing file name), you'll have multiple replacements:
>>> str = 'hello there hello'
>>> str.replace('hello','world')
'world there world'
An alternate solution is to replace the last part after / with flat.html:
>>> url = 'http://www.google.com/this/is/sample/url/flash.html'
>>> url[:url.rfind('/')+1]+'flat.html'
'http://www.google.com/this/is/sample/url/flat.html'
Using urlparse you can do a few bits and bobs:
from urlparse import urlsplit, urlunsplit, urljoin
s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html'
url = urlsplit(s)
head, tail = url.path.rsplit('/', 1)
new_path = head, 'flat.html'
print urlunsplit(url._replace(path=urljoin(*new_path)))

Categories