Using chunk in json, requests to get large data into python - python

I am trying to get a large data into python using API. But I am not being able to get the entire data. The request is allowing only first 1000 lines to be retrieved.
r = requests.get("https://data.cityofchicago.org/resource/6zsd-86xi.json")
json=r.json()
df=pd.DataFrame(json)
df.drop(df.columns[[0,1,2,3,4,5,6,7]], axis=1, inplace=True) #dropping some columns
df.shape
Output is
(1000,22)
The website contains almost 6 million data points. Yet only 1000 are retrieved. How do I get around this? Is chunking right option? Can someone please help me with the code?
Thanks.

You'll need to paginate through the results to get the entire dataset. Most APIs will limit the amount of results returned in a single request. According to the Socrata docs you need to add $limit and $offset parameters to the request url.
For example, for the first page of results you would start with -
https://data.cityofchicago.org/resource/6zsd-86xi.json?$limit=1000&$offset=0
Then for the next page you would just increment the offset -
https://data.cityofchicago.org/resource/6zsd-86xi.json?$limit=1000&$offset=1000
Continue incrementing until you have the entire dataset.

Related

Extract first few lines of URL's text data without 'get'ting entire page data?

Use case: Need to check if JSON data from a url has been updated by checking it's created_date field which lies in the first few lines. The entire page's JSON data is huge and i don't want to retrieve the entire page just to check the first few lines.
Currently, For both
x=feedparser.parse(url)
y=requests.get(url).text
#y.split("\n") etc..
the entire url data is retrieved and then parsed.
I want to do some sort of next(url) or reading only first 10 lines (chunks).. thus not sending request for entire page's data...i.e just scroll & check 'created_date' field and exit.
What can be utilized to solve this? Thanks for your knowledge & Apologies for the noob q
Example of URL -> https://www.w3schools.com/xml/plant_catalog.xml
I want to stop reading the entire URL data if the first PLANT object's LIGHT tag hadn't changed from 'Mostly Shady' (without needing to read/get the data below)
Original poster stated below solution worked:
Instead of GET request, one can try HEAD request:
"The GET method requests a representation of the specified resource. Requests using GET should only retrieve data. The HEAD method asks for a response identical to a GET request, but without the response body."
This way, you don't need to request entire JSON, and will therefore speed up the server side part, as well as be more friendly to the hosting server!

Authenticated API Call Python + Pagination

I am making an authenticated API call in Python - and dealing with pagination. If I call each page separately, I can pull in each of the 20 pages of records and combine them into a dataframe, but it's obviously not a very efficient process and the code gets quite lengthy.
I found some instructions on here to retrieve all records across all pages -- and the json output confirms there are 20 total pages, ~4900 records -- but I'm still only, somehow, getting page 1 of the data. Any ideas on how I might pull each page into a single dataframe via a single call to the API?
Link I Referenced:
Python to retrieve multiple pages of data from API with GET
My Code:
import requests
import json
import pandas as pd
for page in range (1,20):
url = "https://api.xxx.xxx...json?fields=keywords, source, campaign, per_page=250&date_range=all_time"
headers={"myauthkey"}
response=request.get(url, headers=headers)
print(response.json()) #Shows Page 1, per_page: 25, total_pages: 20, total_records: 4900
data=response.json()
df=pd.json_normalize(data,'records')
df.info() #confirms I've only pulled in 250 records, 1 page of the 20 pages of data.
I've scoured the web and this site - and can't find a solution to efficiently pull in all 20 pages of data, except to call each page one-by-one. I think one solution might be looping through the code until the final page of data has been reached, but I'm not quite sure how to set that up and sort of thought the above might accomplish that - and maybe it does, but maybe the subsequent code I've written is not right to pull in all pages of data.
Appreciate any guidance/advice. Thanks!

How to Cross the 1000 Results Limit of youtube data API?

I need all video's ID between X to Y dates of specific subject. I used YouTube data API v3 and every thing was OK, until it returned empty item after 1000 result. I searched about this issue and figured out it's API limitation!!! Does anyone have any solution?
If it helps I code in python.I did test in php too and same result. It doesn't depend on languages.
thanks
Just split your date range into multiple requests, so each request < result limit.

What is the maximum limit to fetch the attributes of a particular url in a website

I have around 5 to 6k URL's in excel sheet, and I need to build a scraper in Python with Beautiful soup package by fetching each URL in excel and scrape the required attributes in that particular URL, and I need to loop in for all 5k URL's.
Is it possible to do this at one shot?
Can the websites block us because of large volumes? Any other optimal method to do the same?
Kindly suggest on this!
There are different ways you may consider:
Send Request periodically with certain interval like 10 times for every 5 seconds.
Use different IP proxy so It is hard for server to determine if the request comes from the same client.

Incomplete data after scraping a website for Data

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?
As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.
You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.

Categories