I am trying to loop through a REST API and fetch the complete data set.
url = f'https://apiurl.com/api/1.1/json/tickets?page=1'
auth = (f'{api_key}', f'{auth_code}')
res = requests.get(url, auth=auth)
data = json.loads(res.content)
The above returns data for page 1 and I am able to do it for all other pages, page by page by specifying the page number in the URL. I am not sure how do I find the total pages such that I can perform a for loop that does it for all pages in the API feed.
I was able to get the number of pages using the below code:
res = requests.get(url, auth=auth)
data=res.json()
while 'next' in res.links.keys():
res = requests.get(res.links['next']['url'])
data.extend(res.json())
page_count = repos['page_info']['page_count'] <<-- This returns the max page count
Related
If I perform a code search using the GitHub Search API and request 100 results per page, I get a varying number of results -
import requests
# url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:0..250&page=1&per_page=100"
url = "https://api.github.com/search/code?q=torch +in:file + language:python&page=1&per_page=100"
headers = {
'Authorization': 'Token xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
}
response = requests.request("GET", url, headers=headers).json()
print(len(response['items']))
Thanks to this answer, I have the following workaround: I run the query multiple times until I get the required results on a page.
My current project requires me to iterate through the search API looking for files of varying sizes. I am basically repeating the procedure described here. Therefore my code looks something like this -
url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:0..250&page=1&per_page=100"
In this case, I don't know in advance the number of results a page should actually have. Could someone tell me a workaround for this? Maybe I am using the Search API incorrectly?
GitHub provides documentation about Using pagination in the REST API. Each response includes a Link header that includes a link for the next set of results (along with other links); you can use this to iterate over the complete result set.
For the particular search you're doing ("every python file that contains the word 'torch'"), you're going to run into rate limits fairly quickly, but for example the following code would iterate over results, 10 at a time (or so), until 50 or more results have been read:
import os
import requests
import httplink
url = "https://api.github.com/search/code?q=torch +in:file + language:python&page=1&per_page=10"
headers = {"Authorization": f'Token {os.environ["GITHUB_TOKEN"]}'}
# This is the total number of items we want to fetch
max_items = 50
# This is how many items we've retrieved so far
total = 0
try:
while True:
res = requests.request("GET", url, headers=headers)
res.raise_for_status()
link = httplink.parse_link_header(res.headers["link"])
data = res.json()
for i, item in enumerate(data["items"], start=total):
print(f'[{i}] {item["html_url"]}')
if "next" not in link:
break
total += len(data["items"])
if total >= max_items:
break
url = link["next"].target
except requests.exceptions.HTTPError as err:
print(err)
print(err.response.text)
Here I'm using the httplink module to parse the Link header, but you could accomplish the same thing with an appropriate regular expression and the re module.
I am trying to get the links from all the pages on https://apexranked.com/. I tried using
url = 'https://apexranked.com/'
page = 1
while page != 121:
url = f'https://apexranked.com/?page={page}'
print(url)
page = page + 1
however, if you click on the page numbers it doesn't include a https://apexranked.com/?page=number as you see from https://www.mlb.com/stats/?page=2. How would I go about accessing and getting the links from all pages if the page doesn't include ?page=number after the link?
The page is not reloading when you click on page 2. Instead, it is firing a GET request to the website's backend.
The request is being sent to : https://apexranked.com/wp-admin/admin-ajax.php
In addition, several parameters are parsed directly onto the previous url.
?action=get_player_data&page=3&total_pages=195&_=1657230896643
Parameters :
action: As the endpoint can handle several purpose, you must indicate the performed action. Surely a mandatory parameter, don't omit it.
page: indicates the requested page (i.e the index you're iteraring over).
total_pages: indicates the total number of page (maybe it can be omitted, otherwise you can scrap it on the main page)
_: this one corresponds to an unix timestamp, same idea as above, try to omit and see what happens. Otherwise you can get a unix timestamp quite easily with time.time()
Once you get a response, it yields a rendered HTML, maybe try to set Accept: application/json field in request headers to get a Json, but that's just a detail.
All these informations wrapped up:
import requests
import time
url = "https://apexranked.com/wp-admin/admin-ajax.php"
# Issued from a previous scraping on the main page
total_pages = 195
params = {
"total_pages": total_pages,
"_": round(time.time() * 1000),
"action": "get_player_data"
}
# Make sure to include all mandatory fields
headers = {
...
}
for k in range(1, total_pages + 1):
params['page'] = k
res = requests.get(url, headers=headers, params=params)
# Make your thing :)
I don't exactly know what you mean but if you for example wanna get the raw text u can do it with requests
import requests
# A loop that will keep going until the page is not found.
while(requests.get(f"https://apexranked.com/?page={page}").status_code != 404):
#scrap content e.g whole page
link = f"https://apexranked.com/?page={page}"
page = page + 1
you can also add the link then to an array with nameOfArray.append(link)
I am trying to search through multiple pages through an API GET request but I am unsure how to search passed the first page of results.
The first GET request will download the first page of results but to get to the next page you need to GET the new URL which is listed at the bottom of the first page.
def getaccount_data():
view_code = input("Enter Account: ")
global token
header = {"X-Fid-Identity": token}
url = BASE_URL + '/data/v1/entity/account'
accountdata = requests.get(url, headers=header, verify = False)
newaccountdata = accountdata.json()
for data in newaccountdata ['results']:
if (data['fields']['view_code']) == view_code:
print("Account is set up")
else:
url =(newaccountdata['moreUri'])
print(url)
getaccount_data()
Is there anyway to search all the pages by updating the url to get to the next page?
Hi i want to crawl XHR request url which has JSON feed but when i change query paramater page value to 2 or any other it retrieve data from page 1 but when i did the same in browser it shows data according to its page.
enter code here
import json
import requests
url = 'https://www.daraz.pk/computer-graphic-cards/?'
params_dict = {}
params_dict['ajax']= 'true'
params_dict['page']= 1
params_dict['spm'] = 'a2a0e.home.cate_2_9.1.35e349378NoL6f'
res = requests.get(url, params=params_dict)
data = json.loads(res.text)
res.url # url changes but content is same of page 1
info = data.get('mods').get('listItems')
for i in info:
print(i['name'])
I think how the data is being returned has issues. I modified the call slightly by looping over the pages.
Looking at the data returned, it seems that some products are being returned on multiple pages even in the UI.
for page_num in range(1, 7):
res = requests.get('https://www.daraz.pk/computer-graphic-cards/?ajax=true&page=' + str(page_num)).json()
info = res.get('mods').get('listItems')
for i in info:
print('%s:%s:%s---------%s' % (i['itemId'],i['sellerName'],i['skuId'],i['name']))
print('----------------------- PAGE %s ------------------------------------------' % (page_num))
Data returned from this code snippet is linked here.
I am using an api to fetch orders from a website. The problem is at one time it only fetch only 20 orders. I figured out i need to use a pagination iterator but dont know to use it. How to fetch all the orders all at once.
My code:
def search_orders(self):
headers = {'Authorization':'Bearer %s' % self.token,'Content-Type':'application/json',}
url = "https://api.flipkart.net/sellers/orders/search"
filter = {"filter": {"states": ["APPROVED","PACKED"],},}
return requests.post(url, data=json.dumps(filter), headers=headers)
Here is a link to documentation.
Documentation
You need to do what the documentation suggests -
The first call to the Search API returns a finite number of results based on the pageSize value. Calling the URL returned in the nextPageURL field of the response gets the subsequent pages of the search result.
nextPageUrl - String - A GET call on this URL fetches the next page results. Not present for the last page
(Emphasis mine)
You can use response.json() to get the json of the response. Then you can check the flag - hasMore - to see if there are more if so, use requests.get() to get the response for next page, and keep doing this till hasMore is false. Example -
def search_orders(self):
headers = {'Authorization':'Bearer %s' % self.token,'Content-Type':'application/json',}
url = "https://api.flipkart.net/sellers/orders/search"
filter = {"filter": {"states": ["APPROVED","PACKED"],},}
s = requests.Session()
response = s.post(url, data=json.dumps(filter), headers=headers)
orderList = []
resp_json = response.json()
orderList.append(resp_json["orderItems"])
while resp_json.get('hasMore') == True:
response = s.get('"https://api.flipkart.net/sellers{0}'.format(resp_json['nextPageUrl']))
resp_json = response.json()
orderList.append(resp_json["orderItems"])
return orderList
The above code should return the complete list of orders.