Python POST requests - how to extract html of request destination - python

Scraping data of mortgage from official mortgage registry. The problem is that I can't extract the html of particular document. Everything happens on POST behalf - I have all of the data required to precise the POST request, but still when i'm printing the request.url it shows me the welcome screen page. It should retrieve html from particular document. All data like number of mortgage or current page are listed in dev tools > netowrk > Form Data, so I bet it must be possible. I'm quite new in web python so I will apprecaite any help.
My code:
import requests
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
r = requests.post('https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW', data=data)
print(r.url), print(r.content)

You are getting the welcome screen because you aren't sending all the requests required to view the next page.
Go to Chrome > Network tabs, and you will see that when you click the submit/search button, a bunch of other GET requests are being sent to different URLs after that first POST request.
You need to replicate that in your script. Depending upon the website it can be tough to get the response, so you should consider using Selenium
That said, it's not impossible to do this with requests:
session = requests.Session()
You need to send the POST request, and all other GET requests that follow in the same session.
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
session.post(URL, headers=headers, params=data)
# Start sending the GET requests
session.get(URL_1, headers=headers)
session.get(URL_2, headers=headers)

Related

webscraper no longer retrieving data - can still access website via browser

I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")
If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.

How to Login and Scrape Websites with Python?

I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).

View headers and body of a POST request made by Python Script

In my application, I have my API that is in localhost:8000/api/v0.1/save_with_post.
I've also made a Python Script in order to do a Post Request on such Api.
### My script
import requests
url = 'localhost:8000/api/v0.1/save_with_post'
myobj = {'key': 'value'}
x = requests.post(url, data = myobj)
Is it possible to view headers and body of the request in Chrome rather than debugging my application code?
You want Postman.
With Postman you can either generate a request to your service from Postman itself, or set up Postman as a proxy so you can see the requests that your API client is generating and the responses from the server.
If you want to view the response headers from the post request, have you tried:
>>> x.headers
Or you could just add headers yourself to your POST request as so:
h = {"Content-Type": "application/xml", ("etc")}
x = requests.post(url, data = myobj, headers = h)
well, I don't know if there is a way you could view the request in Chrome DevTools directly (I don't think there is) however, I know there are two alternatives for seeing the request body and response:
1 - use selenium with a chrome webdriver
this will allow you to run chrome automated by python. then you can open a test page and run javascript in it to do your post request,
see this for more info on how to do this:
1 https://selenium-python.readthedocs.io/getting-started.html
2 Getting the return value of Javascript code in Selenium
you will need to use Selenium-requests library to use requests library with selenium
https://pypi.org/project/selenium-requests/3
2 - use Wireshark
this program will allow you to see all the traffic that is going on your network card and therefore you will be able to monitor all the requests going back and forth. however, Wireshark will throw all the traffic that you network card send or receives it may be hard to see the specific request you want

Enabling cookies in python HTTP POST request

So I am trying to write a script that that submits a form that contains two fields for a username and password in a POST request, but the site responds with:
"This system requires the use of HTTP cookies to verify authorization information. Our system has detected that your browser has disabled HTTP cookies, or does not support them."
*EDIT: So I believe with the new modified code below that I can successfully login to the page. The only thing is that when I print out the page's html text to the terminal it only displays an html element and a head element that contains the url of the page; however, ive inspected the actual html of page when i log in and there is a lot missing, anyone know why this might be?
import requests
url = "https://someurl"
payload = {
'username': 'myname',
'password': '1234'
}
headers = {
'User-Agent': 'Mozilla/5.0'
}
session = requests.Session()
page = session.post(url, data=payload)
Without the precise URL it is very hard to give you an answer.
Many Web pages are dynamically built through JavaScript calls. The execution of the JavaScript will create a DOM that is rendered. If it's the case for the site you are looking at, you will get only the raw HTML response with Python but not the rendered DOM. You need something which actually executes the JS to get the final DOM. For example, SlimerJS

Python web scraping requests follow redirect

I'm trying to scrape a web site with the requests module.
Using chrome and inspect elements, I go to the url, fill in a form and click the continue button. Chrome's inspect elements (network documents) shows what chrome sent with post. It also shows multiple cookies. The site redirects to a url with among other things a session ID.
To simulate this, I try using requests. I take the form data from inspect elements and reformat it to a dictionary. I use requests.session to include the cookies.
import requests
form_data = 'currentCalForm=dep&currentCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')
payload = {}
for item in form_data:
key, value = item.split('=')
if value:
payload[key] = value
with requests.session() as s:
r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects=True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
f.write(r.text.encode('utf8'))
requests, however, does not appear to follow the redirect. history is empty and the url appears to be the data I sent rather than what the site returned. x.htm shows a web page, but does not contain the info I expected.
From http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history I expected r.url to contain the redirected url and r.history to contain an http response code.
What am I doing wrong?
ok what you do seems to be wrong. i am not sure how you decided to sent a post on https://www.aa.com/homePage.do, but that seems to be a get and doesnt take the params you send. when you click search your browser sends this post: https://www.americanairlines.co.uk/reservation/searchFlightsSubmit.do;jsessionid=XXXXXXXXXXXXXXXXXXX and parameters:
currentCalForm=dep
currentCodeFrom=
tripType=roundTrip
originAirport=LAX
flightParams.flightDateParams.travelMonth=10
flightParams.flightDateParams.travelDay=24
flightParams.flightDateParams.searchTime=040001
destinationAirport=JFK
returnDate.travelMonth=10
returnDate.travelDay=31
returnDate.searchTime=400001
adultPassengerCount=1
adultPassengerCount=1
childPassengerCount=0
hotelRoomCount=1
serviceclass=coach
searchTypeMode=matrix
awardDatesFlexible=true
originAlternateAirportDistance=0
destinationAlternateAirportDistance=0
discountCode=
flightSearch=revenue
dateChanged=false
fromSearchPage=true
advancedSearchOpened=false
numberOfFlightsToDisplay=10
searchCategory=
aairpassSearchType=false
moreOptionsIndicator=
seniorPassengerCount=0
youngAdultPassengerCount=0
infantPassengerCount=0
passengerCount=1
This will then give you an html back. preety mach you have to send all requests send in the browser. it might be easier for you to do it with selenium.
i found this using httpfox probably is similar to chrome networks.

Categories