How to store data while scraping in case of script failure? - python

I'm doing some web scraping (actually geocoding using a web service) and writing the results to a file:
for i, row in enumerate(data):
data[0] = url
output = {}
try:
r = requests.get(url)
if r.status_code == 200:
results = r.json()
if results:
output['Lat'] = results['wgs84_lat']
output['Lng'] = results['wgs84_lon']
writer.writerow(output)
except:
pass
My problem is that if something goes wrong half-way through and I have to stop the process, I end up with a half-written file.
Then I have two choices: I can either restart from scratch (tedious with a million rows to geocode), or I can add boilerplate code to check whether the row already exists in the output file, and skip it if so.
I feel there must be a more convenient way to check whether the row has already been obtained. What's the neatest, most Pythonic way to do this? Perhaps pickle (which I've never used?

Related

Best way to update a json file as data is coming in

I am running a loop with data coming in and writing the data to a json file. Here's what it looks like in a minimal verifiable concrete example.
import json
import random
import string
dc_master= {}
for i in range(100):
# Below mimics an API call that returns new data.
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
dc_info= {}
dc_info['Height'] = 'NA'
dc_master[name] = dc_info
with open("myfile.json", "w") as filehandle:
filehandle.write(json.dumps(dc_master))
As you can see from the above, every time it loops, it creates a new dc_info. That becomes the value of the new key-value pair (with the key being the name) that gets written to the json file.
The one disadvantage of the above is that when it fails and I restart again, I have to do it from the very beginning. Should I do a open for reading of the json file to dc_master, then add a name:dc_info to the dictionary, followed by writing the dc_master back to the json file at every turn of the loop? Should I just append to the json file even if it's a duplicate and let the fact that when I need to use it, I will load it back into a dictionary and that takes care of duplicates automatically?
Additional information: There are occasionally timeouts, so I want to be able to start somewhere in the middle if needed. Number of key value pairs in the dc_info is about 30 and number of overall name:dc_info pairs is about 1000. So it's not huge. Reading it out and writing it back in again is not onerous. But I do like to know if there's a more efficient way of doing it.
I think the full script of fetching and storing API results should look like example code below. At least I always do same code for long-processing set of tasks.
I put each result of API call as a separate JSON single line in result file.
Script may be stopped in the middle e.g. due to exception, file will be correctly closed and flushed thanks to with manager. Then on restart script will read already processed result lines from file.
Only those results that have not being processed already (if their id not in processed_ids) should and will be fetched from API. id-field may be anything that identifies uniquely each API call result.
Each next result will be appended to json-lines file thanks to a mode (append mode). buffering specifies write-buffer size in bytes, file will be flushed and written in this size of blocks, this is not to stress disk with sequent one-line-100-bytes writes. Using large buffering is totally alright because Python's with block correctly flushes and writes out all bytes whenever block exits due to exception or any other reason, so you'll never lose even a single small result or byte that already has being written by f.write(...).
Final results will be printed to console.
Because your task is very interesting and important (at least I had similar tasks many times), I've decided to also implement multi-threaded version of the single-threaded code located below, it is aspecially needed for the case of fetching data from Internet, as it is usually necessary to download data in several parallel threads. Multi-threaded version can be found and run here and here. This multi-threading can be extended to multi-processing too for efficiency by using ideas from my another answer.
Next is single-threaded version of code:
Try next code online here!
import json, os, random, string
fname = 'myfile.json'
enc = 'utf-8'
id_field = 'id'
def ReadResults():
results = []
processed_ids = set()
if os.path.exists(fname):
with open(fname, 'r', encoding = enc) as f:
data = f.read()
results = [json.loads(line) for line in data.splitlines() if line.strip()]
processed_ids = {r[id_field] for r in results}
return (results, processed_ids)
# First read already processed elements
results, processed_ids = ReadResults()
with open(fname, 'a', buffering = 1 << 20, encoding = enc) as f:
for id_ in range(100):
# !!! Only process ids that are not in processed_ids !!!
if id_ in processed_ids:
continue
# Below mimics an API call that returns new data.
# Should fetch only those objects that correspond to id_.
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
# Fill necessary result fields
result = {}
result['id'] = id_
result['name'] = name
result['field0'] = 'value0'
result['field1'] = 'value1'
cid = result[id_field] # There should be some unique id field
assert cid not in processed_ids, f'Processed {cid} twice!'
f.write(json.dumps(result, ensure_ascii = False) + '\n')
results.append(result)
processed_ids.add(cid)
print(ReadResults()[0])
I think you're fine and I'd loop over the whole thing and keep writing to the file, as it's cheap.
As for retries, you would have to check for a timeout and then see if the JSON file is already there, load it up, count your keys and then fetch the missing number of entries.
Also, your example can be simplified a bit.
import json
import random
import string
dc_master = {}
for _ in range(100):
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
dc_master.update({name: {"Height": "NA"}})
with open("myfile.json", "w") as jf:
json.dump(dc_master, jf, sort_keys=True, indent=4)
EDIT:
On second thought, you probably want to use a JSON list instead of a dictionary as the top level element, so it's easier to check how much you've got already.
import json
import os
import random
import string
output_file = "myfile.json"
max_entries = 100
dc_master = []
def do_your_stuff(data_container, n_entries=max_entries):
for _ in range(n_entries):
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
data_container.append({name: {"Height": "NA"}})
return data_container
def dump_data(data, file_name):
with open(file_name, "w") as jf:
json.dump(data, jf, sort_keys=True, indent=4)
if not os.path.isfile(output_file):
dump_data(do_your_stuff(dc_master), output_file)
else:
with open(output_file) as f:
data = json.load(f)
if len(data) < max_entries:
new_entries = max_entries - len(data)
dump_data(do_your_stuff(data, new_entries), output_file)
print(f"Added {new_entries} entries.")
else:
print("Nothing to update.")

Python: Attempting to scrape multiple (Similar) websites for (Similar) data

I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.
Are you on a linux system? You could set up a cron job to run your script whenever you want.
Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.

getting JSON data from python several levels deep without names

So I've written some simple python code to do some web-scraping and I'm fairly noob, so i have a question. I get my json data using:
results = response.json()
this causes me no problems with my site entered and parameters correct.
This JSON file has a few different groups, one of which is entitled 'moments', which itself goes fairly deep.
So, for example to get part of what i want, i can do a
print results['moments'][0][5]
but what i really want is to get
results['moments'][0][5]
results['moments'][1][5]
results['moments'][2][5]
results['moments'][3][5]
etc... through several hundred, so i'm not sure how to iterate that and keep the [5] at the next tier.
The reason i don't just use the full output of results['moments'] is i want to export this to excel, and if i just write using csv_writer on just
results['moments']
it doesn't actually comma seperate the values so i end up with long bracketed values within Column 1, but if i go to the 3rd level it will be comma seperated when i output to excel.
i'm sure there are several ways to resolve this issue.
see code below
response = session.get('http://xxxxxxxxxxxx', params=params)
results = response.json()
location = results['moments'][0][5]
print location
with open('Location1.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
for y in location:
csv_writer.writerow(y)
Instead of doing
results['moments'][0][5]
results['moments'][1][5]
results['moments'][2][5]
results['moments'][3][5]
You can use a simple list comprehension to do this for you, where you iterate on the length of the list results['moments']. Note that the other index remains fixed, as shown below:
locations = [results['moments'][i][5] for i in xrange(len(results['moments']))]
or
locations = [moment[5] for moment in results['moments']]
Is this what you're looking for? (Assuming the code you pasted works.)
response = session.get('http://xxxxxxxxxxxx', params=params)
results = response.json()
for i in xrange(len(results['moments'])):
location = results['moments'][i][5]
with open('Location'+str(i+1)+'.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
for y in location:
csv_writer.writerow(y)

Circumventing Python errors in a script

I have a large file containing thousands of links. I've written a script calling each link line-by-line and performing various analyses on the respective webpage. However, sometimes it is the case that the link is faulty (article removed from website, etc), and my whole script just stops at that point.
Is there a way to circumvent this problem? Here's my (pseudo)code:
for row in file:
url = row[4]
req=urllib2.Request(url)
tree = lxml.html.fromstring(urllib2.urlopen(req).read())
perform analyses
append analyses results to lists
output data
I have tried
except:
pass
But it royally messes up the script for some reason.
Works for me:
for row in file:
url = row[4]
try:
req=urllib2.Request(url)
tree = lxml.html.fromstring(urllib2.urlopen(req).read())
perform analyses
append analyses results to lists
except URLError, e:
pass
output data
Try block is the way to go:
for row in file:
url = row[4]
try:
req=urllib2.Request(url)
tree = lxml.html.fromstring(urllib2.urlopen(req).read())
except URLError, e:
continue
perform analyses
append analyses results to lists
output data
Continue will allow you to skip any unnecessary computation after the url check and restart at the next iteration of the loop

Open URLS from list and write data

I am writing a code which creates several URLs, which again are stored in a list.
The next step would be, open each URL, download the data (which is only text, formatted in XML or JSON) and save the downloaded data.
My code works fine thanks to the online community here up. It stuck at the point to open the URL and download the data. I want the url.request to loop through the list with my created urls and call each url seperately, open it, display it and move on to the next. But it only does the loop to create the urls, but then nothing. No feedback, nothing.
import urllib.request
.... some calculations for llong and llat ....
#create the URLs and store in list
urls = []
for lat,long,lat1,long1 in (zip(llat, llong,llat[1:],llong[1:])):
for pages in range (1,17):
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
print (urls)
#accessing the website
data = []
for amounts in urls:
response = urllib.request.urlopen(urls)
flickrapi = data.read()
data.append(+flickrapi)
data.close()
print (data)
What am I doing wrong`?
The next step would be, downloading the data and save them to a file or somewhere else for further processing.
Since I will receive heaps of data, like a lot lot lot, I am not sure what would be the best way to store it to precess it with R (or maybe Python? - need to do some statistical work on it). Any suggestions?
You're not appending your generated urls to the url list, you are printing them:
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Should be:
urls.append("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Then you can iterate over the urls as planned.
But you'll run into the error on the following line:
response = urllib.request.urlopen(urls)
Here you are feeding the whole set of urls into urlopen, where you should be passing in a single url from urls which you have named amounts like so:
response = urllib.request.urlopen(amounts)

Categories