Automate print/save web page as pdf in chrome - python 3.6 - python

I am trying to create a script to automatically save read-only pdfs via Chrome's printing functionality to save it as another pdf in the same folder. This removes the 'read-only' feature. However while running the script I am not sure where I can specify my own specific destination folder and the script saves it in the Downloads folder directly.
Full props to https://stackoverflow.com/users/1432614/ross-smith-ii for the code below.
Any help will be very much appreciated.
import json
from selenium import webdriver
downloadPath = r'mypath\downloadPdf\\'
appState = {
"recentDestinations": [
{
"id": "Save as PDF",
"origin": "local"
}
],
"selectedDestinationId": "Save as PDF",
"version": 2
}
profile = {'printing.print_preview_sticky_settings.appState':json.dumps(appState)}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')
driver = webdriver.Chrome(chrome_options=chrome_options)
pdfPath = r'mypath\protected.pdf'
driver.get(pdfPath)
driver.execute_script('window.print();')

Ok, I think I figured out the solution. Just append the following line with the below code:
profile = {'printing.print_preview_sticky_settings.appState':json.dumps(appState),'savefile.default_directory':downloadPath}
It's not ideal still as you cannot specify the new file name you want but it works for now.
If anyone has a better solution, please do post it here. Thanks

Related

How to set a custom directory to output selenium saved pdf files

I am using this answer to download and save webpages as a pdf.
Although the method works, it is saving the output files in a different directory.
What could I add to the code to specify a custom output directory?
In addition to that, is there a way to change the output file name?
This is the code so far:
import json
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
settings = {"recentDestinations": [{"id": "Save as PDF", "origin": "local", "account": ""}], "selectedDestinationId": "Save as PDF", "version": 2}
prefs = {'printing.print_preview_sticky_settings.appState': json.dumps(settings)}
chrome_options.add_experimental_option('prefs', prefs)
chrome_options.add_argument('--kiosk-printing')
You can provide default directory in your preferences by adding 'savefile.default_directory' with a path to your folder, like this:
prefs = {
'printing.print_preview_sticky_settings.appState': json.dumps(settings),
'savefile.default_directory': '/path_to_folder/'
}
(Also, the answer here has an example of how the path should look for Windows)
As far, as saving the file with a custom name, I couldn't find an easy solution myself, but I used this example from SO when I was building something similar. I hope this helps. Good luck!
This is how you can specify your download directory.This is Java solution.
downloadPath=System.getProperty("user.dir")+"\\tempDownloadedFiles";

Printing with Selenium (Chrome): Select Pages

I've been using a recipe that I've cobbled together from online snippets in order to get Selenium to print the currently-visited webpage:
appState = {
"recentDestinations": [
{
"id": "Save as PDF",
"origin": "local",
"account": ""
}
],
"selectedDestinationId": "Save as PDF",
"version": 2,
"isCssBackgroundEnabled": True
}
downloadPath=r'C:\temp'
profile = {'printing.print_preview_sticky_settings.appState':json.dumps(appState),
'savefile.default_directory':downloadPath}
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', profile)
options.add_argument('--kiosk-printing')
options.add_argument('--enable-print-browser')
What I am struggling with now is to figure out a way to print only selected pages (e.g., only the first page). I know I can brute-force this with a PDF editor such as How to delete pages from pdf file using Python? but I was curious if anyone knew of a more elegant approach here.
Thank you very much.

I cannot get Chrome to default to saving as a PDF when using Selenium

I'm trying to save some web pages to PDF using Python, Selenium, and Chrome, and I can't get the printer to default to Chrome's built-in "save as PDF" option.
I have found examples of how to do this in various places online, including in questions people have asked on Stack Overflow, but they way they're all implementing it doesn't work and I'm not sure if something has changed in more recent versions of Chrome, or if I'm somehow doing something wrong (for example, here is a page that has these settings: Missing elements when using selenium chrome driver to automatically 'Save as PDF').
I only included the default download location change in this code to verify it's accepting any changes at all - if you download one of the Python installs from that page, it will download to the new location and not to the standard download folder, so Chrome seems to be accepting these changes.
The problem appears to be the option "selectedDestinationID", which doesn't seem to do anything.
from selenium import webdriver
import time
import json
chrome_options = webdriver.ChromeOptions()
app_state = {
'recentDestinations': [{
'id': 'Save as PDF',
'origin': 'local'
}],
'selectedDestinationId': 'Save as PDF',
'version': 2
}
prefs = {
'printing.print_preview_sticky_settings.appState': json.dumps(app_state),
'download.default_directory': 'c:\\temp\\seleniumtesting\\'
}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path='C:\\temp\\seleniumtesting\\chromedriver.exe', options=chrome_options)
driver.get('https://www.python.org/downloads/release/python-373/')
time.sleep(25)
driver.close()
After the page launches, hitting ctrl+p brings up the printing page, but it defaults to the default printer. If I bring up the same page in my standard Chrome installation, it defaults to printing to PDF. I want to get to the point where I can add kiosk printing and then call window.print(), but as of now all that does is send it to the actual paper printer.
Thanks for any help anyone can offer. I'm stumped, and at this point it probably would have been faster to just save all of these manually.
It seems that if you have network printers configured they load up after opening the dialog and override your selectedDestination.
There is a preference "printing.default_destination_selection_rules" which seems to resolve.
prefs = {
"printing.print_preview_sticky_settings.appState": json.dumps(app_state),
"download.default_directory": "c:\\temp\\seleniumtesting\\".startswith(),
"printing.default_destination_selection_rules": {
"kind": "local",
"namePattern": "Save as PDF",
},
}
https://chromium.googlesource.com/chromium/src/+/master/chrome/common/pref_names.cc#1318
https://www.chromium.org/administrators/policy-list-3#DefaultPrinterSelection

Using data parameters in Selenium Python

Very novice level question. What's the easiest way of opening several websites one by one after reading site names from an external file.
In the example below; I want to replace values of web URL from file and screemshot file name in same way.
Example script:
From selenium import webdriver
Driver=webdriver.ie(...driverpath)
Driver.get("facebook.com")
Driver.get_screenshot_as_file("facebook.png")
Driver.quit()
Try this, I have used json for storing the websites, a simple text file will do as well
import json
from selenium.webdriver import Chrome
with open('path to json file', encoding='utf-8') as s:
data = json.loads(s.read())
for site in data['sites']:
driver = Chrome('path to chrome driver')
driver.get(data['sites'][site])
driver.get_screenshot_as_file(site + '.png')
driver.close()
json file
{
"sites": {
"facebook": "https://www.facebook.com/",
"google": "https://www.google.com/",
"wikipedia": "https://www.wikipedia.org/"
}
}

How can I download a file on a click event using selenium?

I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")

Categories