Hi i want to crawl XHR request url which has JSON feed but when i change query paramater page value to 2 or any other it retrieve data from page 1 but when i did the same in browser it shows data according to its page.
enter code here
import json
import requests
url = 'https://www.daraz.pk/computer-graphic-cards/?'
params_dict = {}
params_dict['ajax']= 'true'
params_dict['page']= 1
params_dict['spm'] = 'a2a0e.home.cate_2_9.1.35e349378NoL6f'
res = requests.get(url, params=params_dict)
data = json.loads(res.text)
res.url # url changes but content is same of page 1
info = data.get('mods').get('listItems')
for i in info:
print(i['name'])
I think how the data is being returned has issues. I modified the call slightly by looping over the pages.
Looking at the data returned, it seems that some products are being returned on multiple pages even in the UI.
for page_num in range(1, 7):
res = requests.get('https://www.daraz.pk/computer-graphic-cards/?ajax=true&page=' + str(page_num)).json()
info = res.get('mods').get('listItems')
for i in info:
print('%s:%s:%s---------%s' % (i['itemId'],i['sellerName'],i['skuId'],i['name']))
print('----------------------- PAGE %s ------------------------------------------' % (page_num))
Data returned from this code snippet is linked here.
Related
I tried downloading pdf files from the website, which is contained in a table with pagination. I can download the pdf files from the first page, but it is not fetching the pdf from all the 4000+ pages. When I tried understanding the logic by observing the URL request, it seems static with out any additional value get appended on it during pagination and I couldn't figure out the way to fetch all pdfs from the table using BeautifulSoup.
Hereby attached the code that I am using to download pdf file from the table in website,
# Import libraries
import requests
from bs4 import BeautifulSoup
import re
import requests, json
# URL from which pdfs to be downloaded
url="https://loksabha.nic.in/Questions/Qtextsearch.aspx"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
span = soup.find("span", id="ContentPlaceHolder1_lblfrom")
Total_pages = re.findall(r'\d+', span.text)
print(Total_pages[0])
# Find all hyperlinks present on webpage
# links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
# for link in links:
for link in table1.find_all('a'):
if ('.pdf' in link.get('href', [])):
list2 = re.findall('CalenderUploading', link.get('href', []))
if len(list2)==0:
# url = re.findall('hindi', link.get('href', []))
print(link.get('href', []))
i += 1
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
Firstly
you need to establish a session first time you call to store cookie values
sess=requests.session()
and use sess.get subsequently instead of requests.get
Secondly:
its not static... its not get request for subsequent pages
its a post request with : ctl00$ContentPlaceHolder1$txtpage="2" for page 2
make a session with requests
capture the view parameters after first request using BeautifulSoup
the value of __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION
etc are in a <div class="aspNetHidden">
when you request for tthe page for first time...
for subsequent pages you will have to pass these parameters along with
page number in post parameter like this ... ctl00$ContentPlaceHolder1$txtpage="2"
using "POST" and not "GET"
this is what is sent by post request for eg. for page 4001 page here
on the loksabha site
workout other parts .. dont expect complete solution here :-)
sess=requests.session()
resp=sess.get('https://loksabha.nic.in/Questions/Qtextsearch.aspx')
soup=bs(resp.content)
vstat=soup.find('input',{'name':'__VIEWSTATE'})['value']
vstatgen=soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
vstatenc=soup.find('input',{'name':'__VIEWSTATEENCRYPTED'})['value']
eventval=soup.find('input',{'name':'__EVENTVALIDATION'})['value']
for pagenum in range(4000): # change as per your old code
postback={'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$cmdNext',
'__EVENTARGUMENT': '',
'__VIEWSTATE':vstat,
'__VIEWSTATEGENERATOR': vstatgen,
'__VIEWSTATEENCRYPTED': vstatenc,
'__EVENTVALIDATION': eventval,
'ctl00$txtSearchGlobal': '',
'ctl00$ContentPlaceHolder1$ddlfile': '.pdf',
'ctl00$ContentPlaceHolder1$TextBox1': '',
'ctl00$ContentPlaceHolder1$btn': 'allwordbtn',
'ctl00$ContentPlaceHolder1$btn1': 'titlebtn',
'ctl00$ContentPlaceHolder1$txtpage': str(pagenum) }
resp=sess.post('https://loksabha.nic.in/Questions/Qtextsearch.aspx',data=postback)
soup=bs(resp.content)
vstat=soup.find('input',{'name':'__VIEWSTATE'})['value']
vstatgen=soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
vstatenc=soup.find('input',{'name':'__VIEWSTATEENCRYPTED'})['value']
eventval=soup.find('input',{'name':'__EVENTVALIDATION'})['value']
### process next page...extract pdfs here
###
###
####
I am trying to get the links from all the pages on https://apexranked.com/. I tried using
url = 'https://apexranked.com/'
page = 1
while page != 121:
url = f'https://apexranked.com/?page={page}'
print(url)
page = page + 1
however, if you click on the page numbers it doesn't include a https://apexranked.com/?page=number as you see from https://www.mlb.com/stats/?page=2. How would I go about accessing and getting the links from all pages if the page doesn't include ?page=number after the link?
The page is not reloading when you click on page 2. Instead, it is firing a GET request to the website's backend.
The request is being sent to : https://apexranked.com/wp-admin/admin-ajax.php
In addition, several parameters are parsed directly onto the previous url.
?action=get_player_data&page=3&total_pages=195&_=1657230896643
Parameters :
action: As the endpoint can handle several purpose, you must indicate the performed action. Surely a mandatory parameter, don't omit it.
page: indicates the requested page (i.e the index you're iteraring over).
total_pages: indicates the total number of page (maybe it can be omitted, otherwise you can scrap it on the main page)
_: this one corresponds to an unix timestamp, same idea as above, try to omit and see what happens. Otherwise you can get a unix timestamp quite easily with time.time()
Once you get a response, it yields a rendered HTML, maybe try to set Accept: application/json field in request headers to get a Json, but that's just a detail.
All these informations wrapped up:
import requests
import time
url = "https://apexranked.com/wp-admin/admin-ajax.php"
# Issued from a previous scraping on the main page
total_pages = 195
params = {
"total_pages": total_pages,
"_": round(time.time() * 1000),
"action": "get_player_data"
}
# Make sure to include all mandatory fields
headers = {
...
}
for k in range(1, total_pages + 1):
params['page'] = k
res = requests.get(url, headers=headers, params=params)
# Make your thing :)
I don't exactly know what you mean but if you for example wanna get the raw text u can do it with requests
import requests
# A loop that will keep going until the page is not found.
while(requests.get(f"https://apexranked.com/?page={page}").status_code != 404):
#scrap content e.g whole page
link = f"https://apexranked.com/?page={page}"
page = page + 1
you can also add the link then to an array with nameOfArray.append(link)
I am trying to loop through a REST API and fetch the complete data set.
url = f'https://apiurl.com/api/1.1/json/tickets?page=1'
auth = (f'{api_key}', f'{auth_code}')
res = requests.get(url, auth=auth)
data = json.loads(res.content)
The above returns data for page 1 and I am able to do it for all other pages, page by page by specifying the page number in the URL. I am not sure how do I find the total pages such that I can perform a for loop that does it for all pages in the API feed.
I was able to get the number of pages using the below code:
res = requests.get(url, auth=auth)
data=res.json()
while 'next' in res.links.keys():
res = requests.get(res.links['next']['url'])
data.extend(res.json())
page_count = repos['page_info']['page_count'] <<-- This returns the max page count
I am writing a scraper to get all the movie list available on hungama.com
I am requesting "http://www.hungama.com/all/hungama-picks-54/4470/" url to get the response.
When you go to this url, this will show 12 movies on the screen but as you sroll down the movie count gets increasing by auto reload.
I am parsing the html source page with below code
response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()
but I only get 12 items whereas there are more movie items. how can I get all the movies available. Please help.
While scrolling down the content of that page, If you take a good look at xhr tab in network category within dev tools then you can see that it produces some URLs with pagination feature attached to it like :http://www.hungama.com/all/hungama-picks-54/3632/2/. So, changing the line as I did below, you can get all the content from that page.
import requests
from scrapy import Selector
page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"
while True:
page+=1
res = requests.get(URL)
sel = Selector(res)
container = sel.css(".leftbox")
if len(container)<=0:break
for item in container:
title = item.css("#pajax_a::text").extract_first()
year = item.css(".subttl::text").extract_first()
print(title,year)
next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
URL = next_page.format(page)
Btw, the URL you have provided above is not working. The one I've supplied is active now. However, you understood the logic I think.
There seems to be an ajax request as a lazy load feature with url http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN which fetches movies .
In the above url change 2 to 3 (http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN) and so on for getting next movies detail.
Is there a way to iterate through a page's archives where the format is
'http://base_url/page=#' - where # is 2-nth page number?
Ideally I'd like to deploy my scraper on every successive page after 'base_url'
is the a function or for loop in python where the base_url would be iterated through like:
page = i in range(nth)
base_url ='http://base_url/page={}'
e.g. http://www.businessinsider.com/?page=3 vs. http://www.businessinsider.com/
You can just request each page like so:
# python 2
from urllib2 import urlopen
# python 3
from urllib.request import urlopen
base_url = "http://example.com/"
# request page 1 through 10
n = 10
for i in range(1, n+1):
if (i == 1):
# handle first page
response = urlopen(base_url)
response = urlopen(base_url + "?page=%d" % i)
data = response.read()
# handle data here
EDIT: urlopen() returns an HTTPResponse or addinfourl object (depending on your Python version) - you need to call .read() on that to get the string of data. (I've updated my example code above, too).