Download file from dynamic url by selenium & phantomjs - python

I'm try to write a web crawler that download a CSV file by a dynamic url.
The url is like http://aaa/bbb.mcv/Download?path=xxxx.csv
I put this url to my chrome browser but I just start to download immediately and the page won't change.
I can't even find any request in develop screen.
I've tried to ways to get the file
put the url in selenium
driver.get(url)
try to get file by requests lib
requests.get(url)
Both didn't work...
Any advice?
Output of two ways:
I try to get the screen shot and it seems doesn't change the page. (just like in chrome)
I try to print out the data I get and it seems like as html file.
Then open it in the browser it is a login page.

import requests
url = '...'
save_location = '...'
session = requests.session()
response = session.get(url)
with open(save_location, 'wb') as t:
for chunk in response.iter_content(1024):
t.write(chunk)

Thanks for everyone's help!
I finally find the problem is that...
I login the website by selenium and I use requests to download the file.
Selenium doesn't have any authentication information!
So my solution is get the cookies by selenium first.
Then send it into the requests!
Here is my Code
cookies = driver.get_cookies() #selenium web driver
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
response = s.get(url)

Related

How to Login and Scrape Websites with Python?

I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).

Save a PDF of a webpage that requires login

import requests
import pdfkit
# start a session
s = requests.Session()
data = {'username': 'name', 'password': 'pass'}
# POST request with cookies
s.post('https://www.facebook.com/login.php', data= data)
url = 'https://www.facebook.com'
# navigate to page with cookies set
options = {'cookie': s.cookies.items(), 'javascript-delay': 1000}
pdfkit.from_url(url, 'file.pdf', options= options)
I'm trying to automate the process of saving a login-protected webpage as a PDF by setting the cookies and navigating to the page using requests. Is there a better way to tackle this/something I'm doing wrong?
Portal sends login and password with different names and also sends hidden values which can change in every request. It sends to different url than login.php and it can check headers to block bots/scripts.
It can be easier with Selenium which control browser and you can take picture or get HTML to generate PDF.
import selenium.webdriver
import pdfkit
#import time
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get('https://www.facebook.com/login.php')
#time.sleep(1)
driver.find_element_by_id('email').send_keys('your_login')
driver.find_element_by_id('pass').send_keys('your_password')
driver.find_element_by_id('loginbutton').click()
#time.sleep(2)
driver.save_screenshot('output.png') # only visible part
#print(driver.page_source)
pdfkit.from_string(driver.page_source, 'file.pdf')
Maybe using driver "PhantomJS" or module PIL/pillow you could get full page as screenshot.
See generate-full-page-screenshot-in-chrome
With wkhtmltopdf, you can do something like this from command line:
wkhtmltopdf --cookie-jar cookies.txt https://example.com/loginform.html --post 'user_id' 'my_id' --post 'user_pass' 'my_pass --post 'submit_btn' 'submit' throw_away.pdf
wkhtmltopdf --cookie-jar cookies.txt https://example.com/securepage.html keep_this_one.pdf

Render HTTP Response(HTML content) in selenium webdriver(browser)

I am using Requests module to send GET and POST requests to websites and then processing their responses. If the Response.text meets a certain criteria, I want it to be opened up in a browser. To do so currently I am using selenium package and resending the request to the webpage via the selenium webdriver. However, I feel it's inefficient as I have already obtained the response once, so is there a way to render this obtained Response object directly into the browser opened via selenium ?
EDIT
A hacky way that I could think of is to write the response.text to a temporary file and open that in the browser. Please let me know if there is a better way to do it than this ?
To directly render some HTML with Selenium, you can use the data scheme with the get method:
from selenium import webdriver
import requests
content = requests.get("http://stackoverflow.com/").content
driver = webdriver.Chrome()
driver.get("data:text/html;charset=utf-8," + content)
Or you could write the page with a piece of script:
from selenium import webdriver
import requests
content = requests.get("http://stackoverflow.com/").content
driver = webdriver.Chrome()
driver.execute_script("""
document.location = 'about:blank';
document.open();
document.write(arguments[0]);
document.close();
""", content)

How to get content-type from selenium page_source

I know the content-type can be gotten from
response = urllib2.urlopen(url)
content-type = response.info().getheader('Content-type')
Now, I need to execute js code so I choose selenium with Phantomjs to fetch web page.
driver = webdriver.PhantomJS()
driver.get(url)
source = driver.page_source
How can I get content-type from source without downloading web page twice? I know I can save the response.read() as html file, and then driver render the local html file without downloading it again. However, it's too slow. Any suggestions?
Selenium does not get the headers but you can just request the head with requests:
import requests
print(requests.head(url).headers["Content-Type"])
You can use httplib2, urliib2 etc.. there are numerous answers here showing how to request the head with various libs.

Login to jsp website using Requests

I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.

Categories