How do I insert API pagination using Python 3 & requests? - python

I make an API call using requests, and get a JSON response back. The API limit is 20 results per page. I can access the first page without a problem, however I cannot figure out how to include pagination in the query. In the JSON response, at the bottom of the page, it gives me the following information.
},
"_links":{
"first":{
"href":"https://search-lastSale?date=20190723-20190823&page=0&size=20
},
"last":{
"href":"https://search-lastSale?date=20190723-20190823&page=4&size=20
},
"next":{
"href":"https://search-lastSale?date=20190723-20190823&page=1&size=20
},
"self":{
"href":"https://search-lastSale?date=20190723-20190823&page=0&size=20
}
},
"page":{
"number":0,
"size":20,
"totalElements":77,
"totalPages":4
}
I've read the docs at https://2.python-requests.org//en/latest/user/advanced/#link-headers and various other articles and posts, but everything seems very specific to people's own APIs.
I've taken my code back to just a single URL request, and old auth token just so I can get a grasp of it, then rescale up to my existing project. The code below:
url = "https://search-api.corelogic.asia/search/au/property/postcode/401249/lastSale"
querystring = {"page":"0","size":"20","date":"20190723-20190823"}
headers = {
'Content-Type': "application/JSON",
'Authorization': "My Token"}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
As far as I can tell from the docs and reading, what I should be doing is either:
Get a JSON response back that finds the page count, and then send a new request with a custom list of URLs that reference that count somehow, i.e.
if 'totalPages = [4]
https:www.search/page0
https:www.search/page1
https:www.search/page2
https:www.search/page3
https:www.search/page4
loop through each URL and append the JSON file; or
Utilise the 'next' page in the JSON response to grab the next url, until there is no 'next' page in the JSON file, i.e.
While json.response = ['next']
keep getting data
append to open json file
Both methods make sense, however I cannot see where this pagination would exist in my code.

As you can see, in the response you receive there is the _links dict, you can use the href inside next to get next page.
Or you can try to manually generate those urls:
>>> def paged_url(page: int=0, size=20) -> str:
... return ("https://search-lastSale?date=20190723-20190823&"
... f"page={page}&size={size}")
...
>>> paged_url(1)
'https://search-lastSale?date=20190723-20190823&page=1&size=20'
>>> paged_url(2)
'https://search-lastSale?date=20190723-20190823&page=2&size=20'
>>> paged_url(3, 10)
'https://search-lastSale?date=20190723-20190823&page=3&size=10'
Those URLs contain the next page you should fetch.

Related

Scrape a website that loads table based on dynamically selected dropdown values

I am trying to scrape table values from this website.. There are multiple dropdown which change based on selections in the previous dropdown. Upon inspecting the source, it seems to sending HTTP requests and render the results. However, I can't seem to write script to send those requests myself and scrape the data.
This is what I tried:
import requests
URL = 'https://voterlist.election.gov.np/bbvrs1/index_process_1.php' #API URL
payload = 'vdc=5298&ward=1&list_type=reg_centre' #Unique payload fetched from the network request
response = requests.post(URL,data=payload,verify=False) #POST request to get the data using URL and Payload information
print(response.text)
This is not giving me the expected response which should be a table containing the values. What could be the best approach to take in this case?
So after a few hours of digging in I found an answer. All I had to do was send a cookie while making the request and also change the format the payload was sent in.
headers = {
'num': number, # This the cookie that should be sent which happens to be a session id in my case.
}
payload = {
'state': "1",
"district": "14",
"vdc_mun": "5132",
"ward": "3",
"reg_centre" :""
}
#Then send a post request to the url
res = session.post(url, data=payload, headers=headers, verify=False)

How do I return all pages from a rest API request without manually specifying the page number?

I'm retrieving data from a paginated API and converting it to .JSON format and I'd like to retrieve all pages in the response, without having to specify the page number in the URL. The API accepts page number and results per page (max. 250) as inputs.
I understand that the typical solution is to loop through pages using a key that specifies the address of the next page. However, it appears as though this API doesn't include a next page parameter in the output (see example response below). I can only think that the last page (i.e. total pages) parameter could be useful here? How can I scrape all of the pages without specifying the page number?
My script:
import requests
import json
url = "https://api-v2.pitchbook.com/deals/search?keywords=machine learning accelerator&perPage=250"
payload={}
headers = {
'Authorization': 'PB-Token 1234567'
}
response = requests.request("GET", url, headers=headers, data=payload)
data = response.json()
print(data)
Example response
{'stats': {'total': 2, 'perPage': 250, 'page': 1, 'lastPage': 1}, 'items': [{'dealId': '98982-
28T', 'companyId': '162120-79', 'companyName': 'companyA'}, {'dealId': '112532-05T',
'companyId': '233527-87', 'companyName': 'companyB'}]}
without having to specify the page number in the URL
Unless you can pass a page number in the request header, that's not possible. You could pass a very large number to perPage parameter, but the server could always have more data, or the client would fail to deserialize large payloads.
appears as though this API doesn't include a next page parameter in the output
It doesn't need to. You have your current page, and the number of results "per page" . As long as the "total" has not yet been read into your local results, and "last page" is greater than your current "page", you should make a new request for page+1

How to make an API request?

I'm trying to build an app that alerts when air quality rises above a certain level. I'm trying to get some json data from the api at https://api-docs.iqair.com, and they kindly provide simple copy and paste code. However, when I run this (with my API key), I get this error message:
requests.exceptions.MissingSchema: Invalid URL '{{urlExternalAPI}}v2/city?city=Los Angeles&state=California&country=USA&key={{my_key}}': No schema supplied. Perhaps you meant http://{{urlExternalAPI}}v2/city?city=Los Angeles&state=California&country=USA&key={{my_key}}?`
I tried putting in the http, but then nothing happened.
Here's the code they provide:
import requests
url = "{{urlExternalAPI}}v2/city?city=Los Angeles&state=California&country=USA&key={{YOUR_API_KEY}}"
payload = {}
headers= {}
response = requests.request("GET", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
First of all, you have to put in the URL, and not use the curly brackets. Also, I couldn't find the correct URL, but after googling it, I found that I merely had to use the correct URL, which was https://api.airvisual.com.

I want to send a Python request to an ASP site but the site show access denied

Site url is http://rajresults.nic.in/resbserx18.htm when send data, but when response comes URL changes in ASP. So which URL user need to send request ASP or html?
Request:
import requests
# data for get result
>>> para = {'roll_no':'2000000','B1':'Submit'}
# this is url where data is entered and get asp response
>>> url = 'http://rajresults.nic.in/resbserx18.htm'
>>> result = requests.post(url,data=para)
>>> result.text
Response
'The page you are looking for cannot be displayed because an invalid method (HTTP verb) is being used.'
Okay after a little bit of work, I found it's some issue with the headers.
I did some trial and error, and found that it checks to make sure the Host header is set.
To debug this, I just incrementally removed chrome's request headers and found which one this web service was particular about.
import requests
headers = {
"Host": "rajresults.nic.in"
}
r = requests.post('http://rajresults.nic.in/resbserx18.asp',
headers = headers,
data = {'roll_no': 2000000, 'B1': 'Submit'}
)
print(r.text)

Unable to login to my nytimes account using requests in python

I am trying to send a post request using the nice Requests library in Python. I am sending the payload, as shown in the code, however, the r.text print statement shows the html dump of the myaccount.nytimes.com page, which is not what I want. Any one knows what's happening?
payload = {
'userid': 'myemail',
'password': 'mypass'
}
s = requests.session()
r = s.post('https://myaccount.nytimes.com/auth/login/?URI=http://www.nytimes.com/2014/09/13/opinion/on-long-island-a-worthy-plan-for-coastal-flooding.html?partner=rss', data=payload)
print(r.text)
There are a couple of hidden <input> fields that you are omitting from your form:
is_continue
expires
token
token looks like it would be required, maybe the others aren't.
And possibly remember which is the "remember me" tickbox at the bottom of the form.
Starting with token try incrementally adding fields until it works.
Edit from comment: Token is provided to you when you first access the login page. Thus you need to do an initial GET to https://myaccount.nytimes.com/auth/login/, parse the HTML (BeautifulSoup?) to get the token (and other fields), then POST back to the server. Or you could use mechanize to handle this more easily.

Categories