I am trying to get the link from a web page. The web page sends the request using javascript, then the server sends a response which goes directly to download a PDF. This new PDF is automatically downloaded into your browser.
My first approach was to use selenium to get the information:
# Path chromedriver & get url
path = "/Users/my_user/Desktop/chromedriver"
browser = webdriver.Chrome(path)
browser.get("https://www.holzwickede.de/amtsblatt/index.php")
# Banner click
ban = WebDriverWait(browser,15).until(EC.element_to_be_clickable((By.XPATH,"//a[#id='cc_btn_accept_all']"))).click()
#Element to get
elem = browser.find_element_by_xpath("//div[#id='content']/div[7]/table//form[#name='gazette_52430']/a[#href='#gazette_52430']")
elem.click()
print (browser.current_url)
The result was the current URL which corresponds to the same webpage, while the request is directly to the server.
https://www.holzwickede.de/amtsblatt/index.php#gazette_52430
I tried after this unsuccessful result to grab it with requests.
# Access requests via the `requests` attribute
for request in browser.requests: #It captures all the requessin chronologica order
if request.response.headers:
print(
request.path,
request.response.status_code,
request.response.headers,
request.body,
"/n"
)
The result stills not the behind link from which the PDF is coming.
Do you guys have an idea what can I do ?
Thanks in advance.
I found the answer. The request sends a POST form. Therefore, we have to extract the header contents and their parameters. When you know the parameters the form sends, you can use the request to get back the link to your console.
response = requests.get(url, params={'key1': 'value1', 'key2': 'value2'})
print (response.url)
This question solves additionally this question: Capture AJAX response with selenium python
Cheers!
Related
I'm working with Playwright. I would like to get response body (HTML) from network events instead of waiting for DOM to load data in browser, and then parse the elements. Current workflow looks something like that:
Playwright opens headless chromium
Opens first page with captcha (no data)
Solves captcha and redirects to the page with data
Sometimes a lot of data is returned and page takes quite a while to load in the browser, but all the data is already received from the client side in network events. My question is it possible to get network events in Playwright instead of waiting for all the elements to load.
I found Network Events documentation, and was able to get the HTML, but it returns all the requests instead of single request.
I'm using Playwright simply for navigation, form submitting, and to get website HTML.
Just use some condition instead of print method, for example you could check if response contains some key in its json:
def run(playwright):
chromium = playwright.chromium
browser = chromium.launch()
page = browser.new_page()
# Subscribe to "request" and "response" events.
page.on("request", lambda request: print(">>", request.method, request.url))
page.on("response", lambda response: print("<<", response.status, response.url))
page.goto("https://example.com")
browser.close()
For Example:
page.on("response", lambda response: response if key in response.body())
There should be waitForResponse for python too, and you could use that.
I'm trying to scrape data about my band's upcoming shows from our agent's web service (such as venue capacity, venue address, set length, set start time ...).
With Python 3.6 and Selenium I've successfully logged in to the site, scraped a bunch of data from the main page, and opened the deal sheet, which is a PDF-like ASPX page. From there I'm unable to scrape the deal sheet. I've successfully switched the Selenium driver to the deal sheet. But when I inspect that page, none of the content is there, just a list of JavaScript scripts.
I tried...
innerHTML = driver.execute_script("return document.body.innerHTML")
...but this yields the same list of scripts rather than the PDF content I can see in the browser.
I've tried the solution suggested here: Python scraping pdf from URL
But the HTML that solution returns is for the login page, not the deal sheet. My problem is different because the PDF is protected by a password.
You won't be able to read the PDF file using Selenium Python API bindings, the solution would be:
Download the file from the web page using requests library. Given you need to be logged in my expectation is that you might need to fetch cookies from the browser session via driver.get_cookies() command and add them to the request which will download the PDF file
Once you download the file you will be able to read its content using, for instance, PyPDF2
This 3-part solution works for me:
Part 1 (Get the URL for the password protected PDF)
# with selenium
driver.find_element_by_xpath('xpath To The PDF Link').click()
# wait for the new window to load
sleep(6)
# switch to the new window that just popped up
driver.switch_to.window(driver.window_handles[1])
# get the URL to the PDF
plugin = driver.find_element_by_css_selector("#plugin")
url = plugin.get_attribute("src")
The element with the url might be different on your page. Michael Kennedy also suggested #embed and #content.
Part 2 (Create a persistent session with python requests, as described here: How to "log in" to a website using Python's Requests module? . And download the PDF.)
# Fill in your details here to be posted to the login form.
# Your parameter names are probably different. You can find them by inspecting the login page.
payload = {
'logOnCode': username,
'passWord': password
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as session:
session.post(logonURL, data=payload)
# An authorized request.
f = session.get(url) # this is the protected url
open('c:/yourFilename.pdf', 'wb').write(f.content)
Part 3 (Scrape the PDF with PyPDF2 as suggested by Dmitri T)
I'm using Python library requests for this, but I can't seem to be able to log in to this website.
The url is https://www.bet365affiliates.com/ui/pages/affiliates/, and I've been trying post requests to https://www.bet365affiliates.com/Members/CMSitePages/SiteLogin.aspx?lng=1 with the data of "ctl00$MasterHeaderPlaceHolder$ctl00$passwordTextbox", "ctl00$MasterHeaderPlaceHolder$ctl00$userNameTextbox", etc, but I never seem to be able to get logged in.
Could someone more experienced check the page's source code and tell me what am I am missing here?
The solution could be this: Please Take attention, you could do it without selenium. If you want to do without it, firstly you should get the main affiliate page, and from the response data you could fetch all the required information (which I gather by xpaths). I just didn't have enough time to write it in fully requests.
To gather the informations from response data you could use XML tree library. With the same XPATH method, you could easily find all the requested informations.
import requests
from selenium import webdriver
Password = 'YOURPASS'
Username = 'YOURUSERNAME'
browser = webdriver.Chrome(os.getcwd()+"/"+"Chromedriver.exe")
browser.get('https://www.bet365affiliates.com/ui/pages/affiliates/Affiliates.aspx')
VIEWSTATE=browser.find_element_by_xpath('//*[#id="__VIEWSTATE"]')
SESSIONID=browser.find_element_by_xpath('//*[#id="CMSessionId"]')
PREVPAG=browser.find_element_by_xpath('//*[#id="__PREVIOUSPAGE"]')
EVENTVALIDATION=browser.find_element_by_xpath('//* [#id="__EVENTVALIDATION"]')
cookies = browser.get_cookies()
session = requests.session()
for cookie in cookies:
print cookie['name']
print cookie['value']
session.cookies.set(cookie['name'], cookie['value'])
payload = {'ctl00_AjaxScriptManager_HiddenField':'',
'__EVENTTARGET':'ctl00$MasterHeaderPlaceHolder$ctl00$goButton',
'__EVENTARGUMENT':'',
'__VIEWSTATE':VIEWSTATE,
'__PREVIOUSPAGE':PREVPAG,
'__EVENTVALIDATION':EVENTVALIDATION,
'txtPassword':Username,
'txtUserName':Password,
'CMSessionId':SESSIONID,
'returnURL':'/ui/pages/affiliates/Affiliates.aspx',
'ctl00$MasterHeaderPlaceHolder$ctl00$userNameTextbox':Username,
'ctl00$MasterHeaderPlaceHolder$ctl00$passwordTextbox':Password,
'ctl00$MasterHeaderPlaceHolder$ctl00$tempPasswordTextbox':'Password'}
session.post('https://www.bet365affiliates.com/Members/CMSitePages/SiteLogin.aspx?lng=1',data=payload)
Did you inspected the http request used by the browser to log you in?
You should replicate it.
FB
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
I'm trying to scrape a web site with the requests module.
Using chrome and inspect elements, I go to the url, fill in a form and click the continue button. Chrome's inspect elements (network documents) shows what chrome sent with post. It also shows multiple cookies. The site redirects to a url with among other things a session ID.
To simulate this, I try using requests. I take the form data from inspect elements and reformat it to a dictionary. I use requests.session to include the cookies.
import requests
form_data = 'currentCalForm=dep¤tCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')
payload = {}
for item in form_data:
key, value = item.split('=')
if value:
payload[key] = value
with requests.session() as s:
r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects=True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
f.write(r.text.encode('utf8'))
requests, however, does not appear to follow the redirect. history is empty and the url appears to be the data I sent rather than what the site returned. x.htm shows a web page, but does not contain the info I expected.
From http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history I expected r.url to contain the redirected url and r.history to contain an http response code.
What am I doing wrong?
ok what you do seems to be wrong. i am not sure how you decided to sent a post on https://www.aa.com/homePage.do, but that seems to be a get and doesnt take the params you send. when you click search your browser sends this post: https://www.americanairlines.co.uk/reservation/searchFlightsSubmit.do;jsessionid=XXXXXXXXXXXXXXXXXXX and parameters:
currentCalForm=dep
currentCodeFrom=
tripType=roundTrip
originAirport=LAX
flightParams.flightDateParams.travelMonth=10
flightParams.flightDateParams.travelDay=24
flightParams.flightDateParams.searchTime=040001
destinationAirport=JFK
returnDate.travelMonth=10
returnDate.travelDay=31
returnDate.searchTime=400001
adultPassengerCount=1
adultPassengerCount=1
childPassengerCount=0
hotelRoomCount=1
serviceclass=coach
searchTypeMode=matrix
awardDatesFlexible=true
originAlternateAirportDistance=0
destinationAlternateAirportDistance=0
discountCode=
flightSearch=revenue
dateChanged=false
fromSearchPage=true
advancedSearchOpened=false
numberOfFlightsToDisplay=10
searchCategory=
aairpassSearchType=false
moreOptionsIndicator=
seniorPassengerCount=0
youngAdultPassengerCount=0
infantPassengerCount=0
passengerCount=1
This will then give you an html back. preety mach you have to send all requests send in the browser. it might be easier for you to do it with selenium.
i found this using httpfox probably is similar to chrome networks.