Why won't python request pagination work? - python

I'm trying to use pagination to request multiple pages of rent listing from zillow. Otherwise I'm limited to the first page only. However, my code seems to load the first page only even if I specify specific pages manually.
# Rent
import requests
from bs4 import BeautifulSoup as soup
import json
url = 'https://www.zillow.com/torrance-ca/rentals'
params = {
'q': {"pagination":{"currentPage": 1},"isMapVisible":False,"filterState":{"fore":{"value":False},"mf":{"value":False},"ah":{"value":True},"auc":{"value":False},"nc":{"value":False},"fr":{"value":True},"land":{"value":False},"manu":{"value":False},"fsbo":{"value":False},"cmsn":{"value":False},"fsba":{"value":False}},"isListVisible":True}
}
headers = {
# headers were copied from network tab on developer tools in chrome
}
html = requests.get(url=url,headers=headers, params=params)
html.status_code
bsobj = soup(html.content, 'lxml')
for script in bsobj.find_all('script'):
inner_text_with_string = str(script.string)
if inner_text_with_string[:18] == '<!--{"queryState":':
my_query = inner_text_with_string
my_query = my_query.strip('><!-')
data = json.loads(my_query)
data = data['cat1']['searchResults']['listResults']
print(data)
This returns about 40 listings. However, if I change "pagination":{"currentPage": 1} to "pagination":{"currentPage": 2}, it returns the same listings! It's as if the pagination parameter isn't recognized.
I believe these are the correct parameters, as I took them straight from the url string query and used http://urlprettyprint.com/ to make it pretty.
Any thoughts on what I'm doing wrong?

Using the params argument with requests is sending the wrong data, you can confirm this by printing response.url. what i would do is use urllib.parse.urlencode:
from urllib.parse import urlencode
...
html = requests.get(url=f"{url}?{urlencode(params)}", headers=headers)

Related

Python Requests Post login

Yes, I know I'm green. I'm trying to learn how to POST into websites though I cannot seem pic the right fields to pass into the POST request.
Below you'll see the HTML for the site that I'm trying to grab everything from:
HTML picture
I've tried the following code to log into the website for Greetly but been having a hell of a time. I'm sure the values I'm passing must have the wrong keys but I can't seem to figure out what it is that I'm doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://app.greetly.com'
urlVisitorLog = 'https://app.greetly.com/locations/00001/check_in_records'
values = {
'user[email]':'email',
'user[password]':'password'
}
c = requests.Session()
results = c.get(url)
soup = BeautifulSoup(results.content, 'html.parser')
key = soup.find(name="authenticity_token")
authenticity_token = key['value']
values["authenticity_token"] = authenticity_token
c.post(urlVisitorLog, headers= values)
r = c.get(urlVisitorLog)
soup2 = BeautifulSoup(r.content, 'html.parser')
Also once I get the username and password I noticed the authenticity token isn't bound to a specific id but I also kind of need that login in order to parse through and see where that is.

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Trigger data response from .aspx page

from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
I'm trying to scrape data from this .aspx webpage. The page loads the data dynamically via javascript so scraping directly with requests/BeautifulSoup won't work.
By looking at the network traffic I can see that when you click the export (Exportar a) button for an element, select a type of export (excel, csv) then confirm a POST request is made to the page. It returns a base64 encoded string of the data I need. As far as I can tell there is no way to make a GET request for the file directly as it is only generated when requested.
What I'm trying to do is is copy the POST request which triggers the csv response. So first I scrape for __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION. __EVENTTARGET, DXCSS and DXScript look to be fixed. __EVENTARGUMENT is copied directly from the POST request.
My code returns a server application error. I'm thinking the problem is either a) wrong __EVENTARGUMENT (maybe part dynamic rather than fixed?), b) not really understanding how .aspx pages work or c) what I'm trying to do isn't possible with these tools.
I did look at using selenium to trigger the data export but I couldn't see a way to capture the server response.
I was able to get help from someone who knows more about aspx pages than me.
Link to the Github gist that provides the solution.
https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba

Scraping excel from website using python with _doPostBack link url hidden

For last few days I am trying to scrap the following website (link pasted below) which has a few excels and pdfs available in a table. I am able to do it for the home page successfully. There are total 59 pages from which these excels/ pdfs have to be scrapped. In most of the websites I have seen till now there is a query parameter which is available in the site url which changes as you move from one page to another. In this case, we have a _doPostBack function probably because of which the URL remains the same on every page you go to. I looked at multiple solutions and posts which are suggesting to see the parameters of post call and use them but I am not able to make sense of the parameters which are provided in post call (this is the first time I am scrapping a website).
Can someone please suggest some resource which can help me write a code which helps me in moving from one page to another using python. The details are as follows:
Website link - http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx
My current code which extracts the CAP excel sheet from the home page (this is working perfect and is provided just for reference)
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import urllib
Base = "http://accord.fairfactories.org/ffcweb/Web"
html = urlopen("http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx")
bs = BeautifulSoup(html)
name = bs.findAll("td", {"class":"column_style_right column_style_left"})
i = 1
for link in bs.findAll("a", {"id":re.compile("CAP(?!\w)")}):
if 'href' in link.attrs:
name = str(i)+".xlsx"
a = link.attrs['href']
b = a.strip("..")
c = Base+b
urlretrieve(c, name)
i = i+1
Please let me know if I have missed anything while providing the information and please don't rate me -ve else I won't be able to ask any questions further
For aspx sites, you need to look for things like __EVENTTARGET, __EVENTVALIDATION etc.. and post those parameters with each request, this will get all the pages and using requests with bs4:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin # python 3 use from urllib.parse import urljoin
# All the keys need values set bar __EVENTTARGET, that stays the same.
data = {
"__EVENTTARGET": "gvFlex",
"__VIEWSTATE": "",
"__VIEWSTATEGENERATOR": "",
"__VIEWSTATEENCRYPTED": "",
"__EVENTVALIDATION": ""}
def validate(soup, data):
for k in data:
# update post values in data.
if k != "__EVENTTARGET":
data[k] = soup.select_one("#{}".format(k))["value"]
def get_all_excel():
base = "http://accord.fairfactories.org/ffcweb/Web"
url = "http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx"
with requests.Session() as s:
# Add a user agent for each subsequent request.
s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
r = s.get(url)
bs = BeautifulSoup(r.content, "lxml")
# get links from initial page.
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
# need to re-validate the post data in our dict for each request.
validate(bs, data)
last = bs.select_one("a[href*=Page$Last]")
i = 2
# keep going until the last page button is not visible
while last:
# Increase the counter to set the target to the next page
data["__EVENTARGUMENT"] = "Page${}".format(i)
r = s.post(url, data=data)
bs = BeautifulSoup(r.content, "lxml")
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
last = bs.select_one("a[href*=Page$Last]")
# again re-validate for next request
validate(bs, data)
i += 1
for x in (get_all_excel()):
print(x)
If we run it on the first three pages, you can see we get the data you want:
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9965
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9552
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10650
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10086
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10905
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10840
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9229
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11310
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9178
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9614
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9734
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10063
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10871
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9468
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9799
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9278
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12252
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9342
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9966
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11595
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9652
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10271
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10365
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10087
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9967
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11740
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12375
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11643
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10952
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12013
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9810
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10953
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10038
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9664
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12256
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9262
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9210
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9968
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9811
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11610
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9455
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11899
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9766
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10088
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10366
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9393
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9813
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11795
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9814
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12187
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10954
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9556
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11709
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9676
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10251
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10602
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10089
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9908
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10358
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9469
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11333
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9238
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9816
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9817
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10736
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10622
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9394
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9818
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10592
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9395
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11271

Website form login using Python urllib2

I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)

Categories