Fetch all pages using python request - python

I am using an api to fetch orders from a website. The problem is at one time it only fetch only 20 orders. I figured out i need to use a pagination iterator but dont know to use it. How to fetch all the orders all at once.
My code:
def search_orders(self):
headers = {'Authorization':'Bearer %s' % self.token,'Content-Type':'application/json',}
url = "https://api.flipkart.net/sellers/orders/search"
filter = {"filter": {"states": ["APPROVED","PACKED"],},}
return requests.post(url, data=json.dumps(filter), headers=headers)
Here is a link to documentation.
Documentation

You need to do what the documentation suggests -
The first call to the Search API returns a finite number of results based on the pageSize value. Calling the URL returned in the nextPageURL field of the response gets the subsequent pages of the search result.
nextPageUrl - String - A GET call on this URL fetches the next page results. Not present for the last page
(Emphasis mine)
You can use response.json() to get the json of the response. Then you can check the flag - hasMore - to see if there are more if so, use requests.get() to get the response for next page, and keep doing this till hasMore is false. Example -
def search_orders(self):
headers = {'Authorization':'Bearer %s' % self.token,'Content-Type':'application/json',}
url = "https://api.flipkart.net/sellers/orders/search"
filter = {"filter": {"states": ["APPROVED","PACKED"],},}
s = requests.Session()
response = s.post(url, data=json.dumps(filter), headers=headers)
orderList = []
resp_json = response.json()
orderList.append(resp_json["orderItems"])
while resp_json.get('hasMore') == True:
response = s.get('"https://api.flipkart.net/sellers{0}'.format(resp_json['nextPageUrl']))
resp_json = response.json()
orderList.append(resp_json["orderItems"])
return orderList
The above code should return the complete list of orders.

Related

Why does GitHub pagination give varying results?

If I perform a code search using the GitHub Search API and request 100 results per page, I get a varying number of results -
import requests
# url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:0..250&page=1&per_page=100"
url = "https://api.github.com/search/code?q=torch +in:file + language:python&page=1&per_page=100"
headers = {
'Authorization': 'Token xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
}
response = requests.request("GET", url, headers=headers).json()
print(len(response['items']))
Thanks to this answer, I have the following workaround: I run the query multiple times until I get the required results on a page.
My current project requires me to iterate through the search API looking for files of varying sizes. I am basically repeating the procedure described here. Therefore my code looks something like this -
url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:0..250&page=1&per_page=100"
In this case, I don't know in advance the number of results a page should actually have. Could someone tell me a workaround for this? Maybe I am using the Search API incorrectly?
GitHub provides documentation about Using pagination in the REST API. Each response includes a Link header that includes a link for the next set of results (along with other links); you can use this to iterate over the complete result set.
For the particular search you're doing ("every python file that contains the word 'torch'"), you're going to run into rate limits fairly quickly, but for example the following code would iterate over results, 10 at a time (or so), until 50 or more results have been read:
import os
import requests
import httplink
url = "https://api.github.com/search/code?q=torch +in:file + language:python&page=1&per_page=10"
headers = {"Authorization": f'Token {os.environ["GITHUB_TOKEN"]}'}
# This is the total number of items we want to fetch
max_items = 50
# This is how many items we've retrieved so far
total = 0
try:
while True:
res = requests.request("GET", url, headers=headers)
res.raise_for_status()
link = httplink.parse_link_header(res.headers["link"])
data = res.json()
for i, item in enumerate(data["items"], start=total):
print(f'[{i}] {item["html_url"]}')
if "next" not in link:
break
total += len(data["items"])
if total >= max_items:
break
url = link["next"].target
except requests.exceptions.HTTPError as err:
print(err)
print(err.response.text)
Here I'm using the httplink module to parse the Link header, but you could accomplish the same thing with an appropriate regular expression and the re module.

Is there a way to store previous response and compare to current?

I usually google my questions but couldn't really find any this time. I'm not really into programming either, just trying.
So basically I'm trying to do a script that constantly does GET request to html page, get the content, store it for some time and then get another request, compare response contect with previous and do different actions based on condition, if they're equal - do nothing, if they're different - perform new action.
Here is the code I wrote. I understand WHY it's not working, it does both requests at the same time. I just can't figure out how to store it properly.
import requests
import time
url = 'http://127.0.0.1:5000/'
result = requests.request("GET", url, verify=True)
new_result = requests.request("GET", url, verify=True)
print(result.content)
print(new_result.content)
while result.content == new_result.content:
print('condition 1')
time.sleep(15)
while result.content != new_result.content:
print('condition 2')
time.sleep(15)
This should get you started. I've replaced your 2 loops with just one (infinite for now, but you can always change that to suit your actual requirements). The key is to do the sleep before getting the new_result, then check whether it's equal to the old one. And then finally update the "old" one to be the result you just fetched.
import requests
import time
url = 'http://127.0.0.1:5000/'
result = requests.request("GET", url, verify=True)
while True:
time.sleep(15)
new_result = requests.request("GET", url, verify=True)
print(result.content)
print(new_result.content)
if result.content == new_result.content:
print('condition 1')
else:
print('condition 2')
result = new_result

Python - Finding total number of pages of a REST API

I am trying to loop through a REST API and fetch the complete data set.
url = f'https://apiurl.com/api/1.1/json/tickets?page=1'
auth = (f'{api_key}', f'{auth_code}')
res = requests.get(url, auth=auth)
data = json.loads(res.content)
The above returns data for page 1 and I am able to do it for all other pages, page by page by specifying the page number in the URL. I am not sure how do I find the total pages such that I can perform a for loop that does it for all pages in the API feed.
I was able to get the number of pages using the below code:
res = requests.get(url, auth=auth)
data=res.json()
while 'next' in res.links.keys():
res = requests.get(res.links['next']['url'])
data.extend(res.json())
page_count = repos['page_info']['page_count'] <<-- This returns the max page count

Overwrite response from api with python

I would like to get data from api, but api is presented in pages. So I have to iterate through all of them and save wanted data in variable.
I was trying to attach new page in loop and add data to my response, but only I got was error: "TypeError: must be str, not Response". I wanted to do it in this way:
response = "https://api.dane.gov.pl/resources/17201/data?page=1"
for i in range(2,32):
url = "https://api.dane.gov.pl/resources/17201/data?page="+str(i)
response += requests.get(url)
data = response.text
When I get the data I want to extract and operate on them.
requests.get(url) returns a Response object. At the moment, you are trying to add the Response object to a string.
Try something like this:
response = []
for i in range(2,32):
url = "https://api.dane.gov.pl/resources/17201/data?page="+str(i)
response.append(requests.get(url).text)
When that finishes running, response will be a list full of the response text instead of response objects.

Python client to a paginated HTTP API

I'm using requests to query a paginated API.
This answer already seems pretty un-pythonic to me but my case is even worse because the API doesn't have a last page parameter, it only has a last URL parameter, so I have to do
url = 'http://my_url'
result = request(url)
data = result['datum']
while not result['last_url'] == url:
url = result['next_page']
result = request(url)
data += result['datum']
This doesn't seem pythonic, is there a more elegant way?
I would write a generator function which yields pages:
def pages(url):
page = {'last_url': None,
'next_page': url}
while page['last_url'] != url:
url = page['next_page']
page = request(url)
yield page
Then you can use it as so:
data = ''.join([page['datum'] for page in pages(url)])

Categories