Dowload images while I browse site - python

Goal: Collect all the images from a site as I browse.
I've tried:
requests and wget don't work even with cookies set and all headers changed to mimic Firefox.
Firefox cache has the images, but they all have a random string as the name. I need logical names to sort them.
selenium-wire is very close to working. When I do driver.get(), driver.requests gives me all the requests as expected which can then be saved. The problem is when I click buttons on the site, the new requests do not get added to driver.requests. I tried:
driver = webdriver.Firefox()
driver.get("url")
while True:
time.sleep(1)
# browse site
for request in driver.requests:
if request.response:
if "image/jpeg" in request.response.headers['Content-Type']:
with open(request.url, 'wb') as f:
request.response.body

Related

Selenium webdriver does not open the correct url, rather it opens a blank page

I am using selenium webdriver to try scrape information from realestate.com.au, here is my code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
path = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'
url = 'https://www.realestate.com.au/buy'
url2 = 'https://www.realestate.com.au/property-house-nsw-castle+hill-134181706'
webdriver = Chrome(path)
webdriver.get(url)
soup = BeautifulSoup(webdriver.page_source, 'html.parser')
print(soup)
it works fine with URL but when I try to do the same to open url2, it opens up a blank page, and I checked the console get the following:
"Failed to load resource: the server responded with a status of 429 ()
about:blank:1 Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME
149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint:1 Failed to load resource: the server responded with a status of 404 ()"
while opening up URL, I tried to search for anything, which also leads to a blank page like url2.
It looks like the www.realestate.com.au website is using an Akamai security tool.
A quick DNS lookup shows that www.realestate.com.au resolves to dualstack.realestate.com.au.edgekey.net.
They are most likely using the Bot Manager product (https://www.akamai.com/us/en/products/security/bot-manager.jsp). I have encountered this on another website recently.
Typically rotating user agents and IP addresses (ideally using residential
proxies) should do the trick. You want to load up the site with a "fresh" browser profile each time. You should also check out https://github.com/67-6f-64/akamai-sensor-data-bypass
I think you should try adding driver.implicitly_wait(10) before your get line, as this will add an implicit wait, in case the page loads too slowly for the driver to pull the site. Also you should consider trying out the Firefox webdriver, since this bug appears to be only affecting chromium browsers.

Downloading/Exporting sites Search Results by using Export button in Python

So I'm trying to use Python to scrape data from the following website (with sample query): https://par.nsf.gov/**search**/fulltext:NASA%20NOAA%20coral
However instead of scraping the search results, I realized that it would just be easier if I somehow click on the Save Results as "CSV" link in a programmatic way and work with that CSV data instead as it would free me from having to navigate all the pages of the search results.
I inspected the CSV link element and found it called an "exportSearch('csv') function.
By typing the function's name into the console I found that the CSV link is just setting window.location.href to: https://par.nsf.gov/export/format:csv/fulltext:NASA%20NOAA%20coral
If I follow that link in the same browser, the save prompt will open up with a csv to save.
My issue starts when I want to do replicate this process using python. If I try to call the export link directly using the Requests library, the response is empty.
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
response = requests.get(url)
print("Response: ", len(response.content))
Can someone show me what I'm missing? I don't know how to first establish search results on the website's server that I can then access for export using Python.
I believe the link to download the CSV is here:
https://par.nsf.gov/export/format:csv//term:your_search_term
your_search_term is URL encoded
in your case, the link is: https://par.nsf.gov/export/format:csv//filter-results:F/term:NASA%20NOAA%20coral
You can use the below to download the file in python using the urllib library.
# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import urllib.request
urllib.request.urlretrieve(url)
# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)
# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command
# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command
Similarly you can also use wget to fetch your file:
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import wget
wget.download(url)
Turns out I was missing some cookies that didn't come up when you do a simple requests GET (e.g. WT_FPC).
To get around this I used selenium's webdriver to do an initial GET request and use the cookies from that request to set up in the POST request for downloading the CSV data.
from selenium import webdriver
chrome_path = "path to chrome driver"
with requests.Session() as session:
url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")
#GET fetches website plus needed cookies
browser = webdriver.Chrome(executable_path = chrome_path)
browser.get(url)
## Session is set with webdriver's cookies
request_cookies_browser = browser.get_cookies()
[session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
response = session.post(url)
## No longer empty
print(response.content.decode('utf-8'))

Save a PDF of a webpage that requires login

import requests
import pdfkit
# start a session
s = requests.Session()
data = {'username': 'name', 'password': 'pass'}
# POST request with cookies
s.post('https://www.facebook.com/login.php', data= data)
url = 'https://www.facebook.com'
# navigate to page with cookies set
options = {'cookie': s.cookies.items(), 'javascript-delay': 1000}
pdfkit.from_url(url, 'file.pdf', options= options)
I'm trying to automate the process of saving a login-protected webpage as a PDF by setting the cookies and navigating to the page using requests. Is there a better way to tackle this/something I'm doing wrong?
Portal sends login and password with different names and also sends hidden values which can change in every request. It sends to different url than login.php and it can check headers to block bots/scripts.
It can be easier with Selenium which control browser and you can take picture or get HTML to generate PDF.
import selenium.webdriver
import pdfkit
#import time
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get('https://www.facebook.com/login.php')
#time.sleep(1)
driver.find_element_by_id('email').send_keys('your_login')
driver.find_element_by_id('pass').send_keys('your_password')
driver.find_element_by_id('loginbutton').click()
#time.sleep(2)
driver.save_screenshot('output.png') # only visible part
#print(driver.page_source)
pdfkit.from_string(driver.page_source, 'file.pdf')
Maybe using driver "PhantomJS" or module PIL/pillow you could get full page as screenshot.
See generate-full-page-screenshot-in-chrome
With wkhtmltopdf, you can do something like this from command line:
wkhtmltopdf --cookie-jar cookies.txt https://example.com/loginform.html --post 'user_id' 'my_id' --post 'user_pass' 'my_pass --post 'submit_btn' 'submit' throw_away.pdf
wkhtmltopdf --cookie-jar cookies.txt https://example.com/securepage.html keep_this_one.pdf

Download file from dynamic url by selenium & phantomjs

I'm try to write a web crawler that download a CSV file by a dynamic url.
The url is like http://aaa/bbb.mcv/Download?path=xxxx.csv
I put this url to my chrome browser but I just start to download immediately and the page won't change.
I can't even find any request in develop screen.
I've tried to ways to get the file
put the url in selenium
driver.get(url)
try to get file by requests lib
requests.get(url)
Both didn't work...
Any advice?
Output of two ways:
I try to get the screen shot and it seems doesn't change the page. (just like in chrome)
I try to print out the data I get and it seems like as html file.
Then open it in the browser it is a login page.
import requests
url = '...'
save_location = '...'
session = requests.session()
response = session.get(url)
with open(save_location, 'wb') as t:
for chunk in response.iter_content(1024):
t.write(chunk)
Thanks for everyone's help!
I finally find the problem is that...
I login the website by selenium and I use requests to download the file.
Selenium doesn't have any authentication information!
So my solution is get the cookies by selenium first.
Then send it into the requests!
Here is my Code
cookies = driver.get_cookies() #selenium web driver
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
response = s.get(url)

How can I download a file on a click event using selenium?

I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")

Categories