I am trying to write a script (Python 2.7.11, Windows 10) to collect data from an API and append it to a csv file.
The API I want to use returns data in json.
It limits the # of displayed records though, and pages them.
So there is a max number of records you can get with a single query, and then you have to run another query, changing the page number.
The API informs you about the nr of pages a dataset is divided to.
Let's assume that the max # of records per page is 100 and the nr of pages is 2.
My script:
import json
import urllib2
import csv
url = "https://some_api_address?page="
limit = "&limit=100"
myfile = open('C:\Python27\myscripts\somefile.csv', 'ab')
def api_iterate():
for i in xrange(1, 2, 1):
parse_url = url,(i),limit
json_page = urllib2.urlopen(parse_url)
data = json.load(json_page)
for item in data['someobject']:
print item ['some_item1'], ['some_item2'], ['some_item3']
f = csv.writer(myfile)
for row in data:
f.writerow([str(row)])
This does not seem to work, i.e. it creates a csv file, but the file is not populated. There is obviously something wrong with either the part of the script which builds the address for the query OR the part dealing with reading json OR the part dealing with writing query to csv. Or all of them.
I have tried using other resources and tutorials, but at some point I got stuck and I would appreciate your assistance.
The url you have given provides a link to the next page as one of the objects. You can use this to iterate automatically over all of the pages.
The script below gets each page, extracts two of the entries from the Dataobject array and writes them to an output.csv file:
import json
import urllib2
import csv
def api_iterate(myfile):
url = "https://api-v3.mojepanstwo.pl/dane/krs_osoby"
csv_myfile = csv.writer(myfile)
cols = ['id', 'url']
csv_myfile.writerow(cols) # Write a header
while True:
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data['Dataobject']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['Links']['next'] # Get the next url
except KeyError as e:
break
with open(r'e:\python temp\output.csv', 'wb') as myfile:
api_iterate(myfile)
This will give you an output file looking something like:
id,url
1347854,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1347854
1296239,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1296239
705217,https://api-v3.mojepanstwo.pl/dane/krs_osoby/705217
802970,https://api-v3.mojepanstwo.pl/dane/krs_osoby/802970
Related
import pandas as pd
import requests
from bs4 import BeautifulSoup
df = pd.read_csv('env_sequences.csv')
Namedf = df['Name']
Uniprotdf = df['ID']
for row in Uniprotdf:
theurl = 'https://www.uniprot.org/uniprot/' + row + '.fasta'
page = requests.get(theurl).content
for row in Namedf:
fasta = open(row + '.txt', 'w')
fasta.write(page)
fasta.close()
#Sample website: https://www.uniprot.org/uniprot/P04578.fasta
I have a .csv file, from which I am using the column 'ID' to generate links to websites from which I want to download the content and save it as the corresponding name from the 'Name' column within the same .csv.
The code ceases to work after the second for loop in which I get a TypeError for trying to use the page variable within the fasta.write() function. Yet, If I print(page) I am able to output the text that I'm looking to have in each file. Is this a case of me having to convert html into a string? I am unsure how to proceed from here.
For the given url, if you print the content of the page, you'll notice that it has 'b'' which indicates it's in binary format.
print (page)
b'>sp|P04578|ENV_HV1H2 Envelope glycoprotein gp160 OS=Human immunodeficiency virus type 1 group M subtype B (isolate HXB2) OX=11706 GN=env PE=1 SV=2\nMRVKEKYQHLWRWGWRWGTMLLGMLMICSATEKLWVTVYYGVPVWKEATTTLFCASDAKA\nYDTEVHNVWATHACVPTDPNPQEVVLVNVTENFNMWKNDMVEQMHEDIISLWDQSLKPCV\nKLTPLCVSLKCTDLKNDTNTNSSSGRMIMEKGEIKNCSFNISTSIRGKVQKEYAFFYKLD\nIIPIDNDTTSYKLTSCNTSVITQACPKVSFEPIPIHYCAPAGFAILKCNNKTFNGTGPCT\nNVSTVQCTHGIRPVVSTQLLLNGSLAEEEVVIRSVNFTDNAKTIIVQLNTSVEINCTRPN\nNNTRKRIRIQRGPGRAFVTIGKIGNMRQAHCNISRAKWNNTLKQIASKLREQFGNNKTII\nFKQSSGGDPEIVTHSFNCGGEFFYCNSTQLFNSTWFNSTWSTEGSNNTEGSDTITLPCRI\nKQIINMWQKVGKAMYAPPISGQIRCSSNITGLLLTRDGGNSNNESEIFRPGGGDMRDNWR\nSELYKYKVVKIEPLGVAPTKAKRRVVQREKRAVGIGALFLGFLGAAGSTMGAASMTLTVQ\nARQLLSGIVQQQNNLLRAIEAQQHLLQLTVWGIKQLQARILAVERYLKDQQLLGIWGCSG\nKLICTTAVPWNASWSNKSLEQIWNHTTWMEWDREINNYTSLIHSLIEESQNQQEKNEQEL\nLELDKWASLWNWFNITNWLWYIKLFIMIVGGLVGLRIVFAVLSIVNRVRQGYSPLSFQTH\nLPTPRGPDRPEGIEEEGGERDRDRSIRLVNGSLALIWDDLRSLCLFSYHRLRDLLLIVTR\nIVELLGRRGWEALKYWWNLLQYWSQELKNSAVSLLNATAIAVAEGTDRVIEVVQGACRAI\nRHIPRRIRQGLERILL\n'
Changing the 'w' to 'wb' while opening the file should fix it. Also, using with open () is the more pythonic way of handling files.
for row in Namedf:
with open ('url.txt','wb') as fasta:
file.write(page)
I'm a beginner with Python and I'm trying to automate some tasks. What I cannot do is iterate through each url inside a large csv file after I read them with pandas and chunksize:
import pandas as pd
import urllib.request, json
import csv
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 9999999)
finalUrlList = []
# Basically I want to append each URL from the csv to apiBase, then read the url and retrieve the JSON for each url and save it to a new csv file
apiBase = "https://script.google.com/macros/s/AKfycbykfWnqp7urCXZLmOOGnuWz6OcAufTFWNoOMHIew2nh3CWKriZS/exec?page="
csv_url='/Users/Andrea/Desktop/test.csv'
# use chunk size
c_size = 50000
df_chunk = pd.read_csv(csv_url, chunksize=c_size, iterator=True)
# iterate through each url in the chunks and append it to the apiBase, then add it to a list
for chunk in df_chunk:
urlToParse = apiBase + chunk
finalUrlList.append(urlToParse)
# iterate through each element of the list and process the url to retrieve json data
index = 0
while index < len(finalUrlList):
try:
with urllib.request.urlopen(finalUrlList[index]) as urlToProcess:
data = json.loads(urlToProcess.read().decode())
index = index + 1
except Exception:
print("An error occurred. I will try again!")
pass
# Write data into a new csv file
csvfile = "IndexedUrls.csv"
try:
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in data:
writer.writerow([val])
print("Csv file saved successfully!")
except:
print("An error occured, couldn't save csv file!")
The first part of reading the big file in chunks is successful, python can read the csv very fast. But then I cannot iterate through each url of the csv and perform the json reading task on each of them (maybe with multiprocessing to go faster in this last step cause it takes time as well to open the url, get the result, store it etc.).
Is there a fast way to achieve all this? thanks a lot for your help and sorry if code is crap but I'm a beginner and willing to learn a lot.
THANKS!
I am trying to bulk download movie information from The Movie Database. The preferred method mentioned on their website is to loop through movie IDs from 1 until the most recent movie ID. When I pull individual movies using their ID, I get the entire set of information. However, when I pull it into a loop, I receive an error 34, resource cannot be found. For my example, I picked specifically a movie ID that I have grabbed individual (Skyfall, 37724), which returns the resource cannot be found error.
import requests
dataset = []
for i in range(37724, 37725):
url = 'https://api.themoviedb.org/3/movie/x?api_key=*****&language=en-US'
movieurl = url[:35] + str(i) + url[36:]
payload = "{}"
response = requests.request("GET", url, data=payload)
data = response.json()
dataset.append(data)
print(movieurl)
dataset
[ANSWERED] 1) Is there a reason for why the loop cannot pull the information? Is this a programming question or specific to the API?
2) Is the way my code set up the best to pull the information and store it in bulk? My ultimate goal is to create a CSV file with the data.
Your request uses url, while your actual url is in the movieurl variable.
To write your data to csv, I would recommend the python csv DictWriter, as your data are dicts (response.json() produces a dict).
BONUS: If you want to format a string, use the string.format method:
url = 'https://api.themoviedb.org/3/movie/{id}?api_key=*****&language=en-US'.format(id=i)
this is much more robust.
The working, improved version of your code, with writing to csv would be:
import csv
import requests
with open('output.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile)
for i in range(37724, 37725):
url = 'https://api.themoviedb.org/3/movie/{id}?api_key=*****&language=en-US'.format(id=i)
payload = "{}"
response = requests.request("GET", url, data=payload)
writer.writerow(response.json())
JSON data output when printed in command line I am currently pulling data via an API and am attempting to write the data into a CSV in order to run calculations in SQL. I am currently able to pull the data, open the CSV, however an error occurs when the data is being written into the CSV. The error is that each individual character is separated by a comma.
I am new to working with JSON data so I am curious if I need to perform an intermediary step between pulling the JSON data and inserting it into a CSV. Any help would be greatly appreciated as I am completely stuck on this (even the data provider does not seem to know how to get around this).
Please see the code below:
import requests
import time
import pyodbc
import csv
import json
headers = {'Authorization': 'Token'}
Metric1 = ['Website1','Website2']
Metric2 = ['users','hours','responses','visits']
Metric3 = ['Country1','Country2','Country3']
obs_list = []
obs_file = r'TEST.csv'
with open(obs_file, 'w') as csvfile:
f=csv.writer(csvfile)
for elem1 in Metric1:
for elem2 in Metric2:
for elem3 in Metric3:
URL = "www.data.com"
r = requests.get(URL, headers=headers, verify=False)
for elem in r:
f.writerow(elem) `
Edit: When I print the data instead of writing it to a CSV, the data appears in the command window in the following format:
[timestamp, metric], [timestamp, metric], [timestamp, metric] ...
Timestamp = 12 digit character
Metric = decimal value
This is my first time doing this, so I better apologize in advance for my rookie mistakes. I'm trying to scrape legacy.com for the first page results from searching for a first and last name within the state. I'm new to programming, and was using scraperwiki to do the code. It worked, but I ran out of cpu time long before the 10,000 ish queries had time to process. Now I'm trying to save progress, catch when it time is running low, and then resume from where it left off.
I can't get the save to work, and any help with the other parts would be appreciated as well. As of now I'm just grabbing links, but if there was a way to save the main content of the linked pages that would be really helpful as well.
Here's my code:
import scraperwiki
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
f = open('/tmp/workfile', 'w')
#read database, find last, start from there
def searchname(fname, lname, id, stateid):
url = 'http://www.legacy.com/ns/obitfinder/obituary-search.aspx?daterange=Last1Yrs&firstname= %s &lastname= %s &countryid=1&stateid=%s&affiliateid=all' % (fname, lname, stateid)
obits=urlopen(url)
soup=BeautifulSoup(obits)
obits_links=soup.findAll("div", {"class":"obitName"})
print obits_links
s = str(obits_links)
id2 = int(id)
f.write(s)
#save the database here
scraperwiki.sqlite.save(unique_keys=['id2'], data=['id2', 'fname', 'lname', 'state_id', 's'])
# Import Data from CSV
import scraperwiki
data = scraperwiki.scrape("https://dl.dropbox.com/u/14390755/legacy.csv")
import csv
reader = csv.DictReader(data.splitlines())
for row in reader:
#scraperwiki.sqlite.save(unique_keys=['id'], 'fname', 'lname', 'state_id', data=row)
FNAME = str(row['fname'])
LNAME = str(row['lname'])
ID = str(row['id'])
STATE = str(row['state_id'])
print "Person: %s %s" % (FNAME,LNAME)
searchname(FNAME, LNAME, ID, STATE)
f.close()
f = open('/tmp/workfile', 'r')
data = f.read()
print data
At the bottom of the CSV loop, write each fname+lname+state combination with save_var. Then, right before that loop, add another loop that goes through the rows without processing them until it passes the saved value.
You should be able to write entire web pages into the datastore, but I haven't tested that.