Circumventing Python errors in a script - python

I have a large file containing thousands of links. I've written a script calling each link line-by-line and performing various analyses on the respective webpage. However, sometimes it is the case that the link is faulty (article removed from website, etc), and my whole script just stops at that point.
Is there a way to circumvent this problem? Here's my (pseudo)code:
for row in file:
url = row[4]
req=urllib2.Request(url)
tree = lxml.html.fromstring(urllib2.urlopen(req).read())
perform analyses
append analyses results to lists
output data
I have tried
except:
pass
But it royally messes up the script for some reason.

Works for me:
for row in file:
url = row[4]
try:
req=urllib2.Request(url)
tree = lxml.html.fromstring(urllib2.urlopen(req).read())
perform analyses
append analyses results to lists
except URLError, e:
pass
output data

Try block is the way to go:
for row in file:
url = row[4]
try:
req=urllib2.Request(url)
tree = lxml.html.fromstring(urllib2.urlopen(req).read())
except URLError, e:
continue
perform analyses
append analyses results to lists
output data
Continue will allow you to skip any unnecessary computation after the url check and restart at the next iteration of the loop

Related

Python: Attempting to scrape multiple (Similar) websites for (Similar) data

I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.
Are you on a linux system? You could set up a cron job to run your script whenever you want.
Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.

getting JSON data from python several levels deep without names

So I've written some simple python code to do some web-scraping and I'm fairly noob, so i have a question. I get my json data using:
results = response.json()
this causes me no problems with my site entered and parameters correct.
This JSON file has a few different groups, one of which is entitled 'moments', which itself goes fairly deep.
So, for example to get part of what i want, i can do a
print results['moments'][0][5]
but what i really want is to get
results['moments'][0][5]
results['moments'][1][5]
results['moments'][2][5]
results['moments'][3][5]
etc... through several hundred, so i'm not sure how to iterate that and keep the [5] at the next tier.
The reason i don't just use the full output of results['moments'] is i want to export this to excel, and if i just write using csv_writer on just
results['moments']
it doesn't actually comma seperate the values so i end up with long bracketed values within Column 1, but if i go to the 3rd level it will be comma seperated when i output to excel.
i'm sure there are several ways to resolve this issue.
see code below
response = session.get('http://xxxxxxxxxxxx', params=params)
results = response.json()
location = results['moments'][0][5]
print location
with open('Location1.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
for y in location:
csv_writer.writerow(y)
Instead of doing
results['moments'][0][5]
results['moments'][1][5]
results['moments'][2][5]
results['moments'][3][5]
You can use a simple list comprehension to do this for you, where you iterate on the length of the list results['moments']. Note that the other index remains fixed, as shown below:
locations = [results['moments'][i][5] for i in xrange(len(results['moments']))]
or
locations = [moment[5] for moment in results['moments']]
Is this what you're looking for? (Assuming the code you pasted works.)
response = session.get('http://xxxxxxxxxxxx', params=params)
results = response.json()
for i in xrange(len(results['moments'])):
location = results['moments'][i][5]
with open('Location'+str(i+1)+'.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
for y in location:
csv_writer.writerow(y)

Using csv module to write file

I'm trying to extract series numbers (that is the number of bus stops) from a csv file and write in a new csv file. These series numbers usually take the form as follows: "Queen Street, Bus Station - Platform A3 [BT000998]". I only need the content enclosed by the brackets. I found that there are cases that unwanted comma exist (as the example above), and using csv module can avoid such issue. In order to do that I wrote the following code:
import csv
import re
fp = open(r'C:\data\input.csv')
fpw = open(r'C:\data\output.csv','w')
data = csv.reader(fp)
writer = csv.writer(fpw)
for row in data:
line = ','.join(row)
lst = line.split(',')
try:
stop = lst[11] # find the location that contains stop number
extr = re.search(r"\[([A-Za-z0-9_]+)\]", stop) # extract stop number enclosed by brackets
stop_id = str(extr.group(1))
lst[11] = stop_id # replace the original content with the extracted stop number
writer.writerow(lst) # write in the output file (fpw)
except Exception, e: # this part is in case there is error such as AttributeError
writer.writerow(row)
After running this code, while there is no error raised, only an empty csv file is generated. I'm quite new to python. Much appreciate if anyone can help me with this code to make it work.
Thank you in advance.
Sui
====UPDATE====
Based on everyone's reply, I revised the code as follows:
import csv
import re
fp = r'C:\data\input.csv'
fpw = r'C:\data\output.csv'
with open(fp, 'rb') as input, open(fpw, 'wb') as output:
for row in csv.reader(input):
try:
stop = row[11]
extr = re.search(r"\[([A-Za-z0-9_]+)\]", stop)
stop_id = str(extr.group(1))
row[11] = stop_id
repl_row = ','.join(row) + '\n'
output.write(repl_row)
except csv.Error:
pass
Now running the code seems working. HOWEVER, in the middle of running, an error 'line contains NULL byte' was raised, and python stopped even though I added try/except as shown above. So anyone has suggestion to deal with this issue and the let the code continue? By the way, the csv file I'm working on is over 2GB.
Many thanks, Sui
If that's the whole code, you need to close the file with fpw.close() after you are done with all the writer operations.
You can also try with keyword, as in official Python documentation

Open URLS from list and write data

I am writing a code which creates several URLs, which again are stored in a list.
The next step would be, open each URL, download the data (which is only text, formatted in XML or JSON) and save the downloaded data.
My code works fine thanks to the online community here up. It stuck at the point to open the URL and download the data. I want the url.request to loop through the list with my created urls and call each url seperately, open it, display it and move on to the next. But it only does the loop to create the urls, but then nothing. No feedback, nothing.
import urllib.request
.... some calculations for llong and llat ....
#create the URLs and store in list
urls = []
for lat,long,lat1,long1 in (zip(llat, llong,llat[1:],llong[1:])):
for pages in range (1,17):
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
print (urls)
#accessing the website
data = []
for amounts in urls:
response = urllib.request.urlopen(urls)
flickrapi = data.read()
data.append(+flickrapi)
data.close()
print (data)
What am I doing wrong`?
The next step would be, downloading the data and save them to a file or somewhere else for further processing.
Since I will receive heaps of data, like a lot lot lot, I am not sure what would be the best way to store it to precess it with R (or maybe Python? - need to do some statistical work on it). Any suggestions?
You're not appending your generated urls to the url list, you are printing them:
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Should be:
urls.append("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Then you can iterate over the urls as planned.
But you'll run into the error on the following line:
response = urllib.request.urlopen(urls)
Here you are feeding the whole set of urls into urlopen, where you should be passing in a single url from urls which you have named amounts like so:
response = urllib.request.urlopen(amounts)

How to store data while scraping in case of script failure?

I'm doing some web scraping (actually geocoding using a web service) and writing the results to a file:
for i, row in enumerate(data):
data[0] = url
output = {}
try:
r = requests.get(url)
if r.status_code == 200:
results = r.json()
if results:
output['Lat'] = results['wgs84_lat']
output['Lng'] = results['wgs84_lon']
writer.writerow(output)
except:
pass
My problem is that if something goes wrong half-way through and I have to stop the process, I end up with a half-written file.
Then I have two choices: I can either restart from scratch (tedious with a million rows to geocode), or I can add boilerplate code to check whether the row already exists in the output file, and skip it if so.
I feel there must be a more convenient way to check whether the row has already been obtained. What's the neatest, most Pythonic way to do this? Perhaps pickle (which I've never used?

Categories