Python Webscraper Download PDF in Firefox - python

I am programming a Python Webscraper which needs to be able to click on a download button and save a PDF to a location that is defined through an XML-File.
The problematic part of my code is the following:
profile = webdriver.FirefoxProfile()
download_Path = items.get(key = 'dir') # Get download path from XML.
if not os.path.exists(download_Path):
os.makedirs(download_Path)
profile.set_preference("browser.helperApps.alwaysAsk.force", False)
profile.set_preference("browser.download.panel.shown", False)
profile.set_preference("browser.download.manager.useWindow", False)
profile.set_preference("webdriver_enable_native_events", False)
profile.set_preference("browser.helperApps.neverAsk.openFile", "application/pdf;")
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf;")
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.dir", download_Path)
profile.update_preferences()
driver = webdriver.Firefox(executable_path = DriverPath, options = options, firefox_profile = profile)
Almost everything works fine, the download directory gets changed in the intended way, so the profile.set_preferences works, but the other preferences don't change. I'm searching for a while now and as you can see I tried different options so that the browser doesn't ask to open the file or where to save it, and just moves it in the given directory.

I solved it myself. The answere is, that you have to configure the PDF-Reader that is intergrated in Firefox ("PDF.js") separtly with the following code:
profile.set_preferences("pdfjs.disable", True)
That's it the rest functions as intended.

Related

Firefox preference update is not applied with robot framework

I am trying to turn of Firefox download dialog. I used this piece of python code that use selenium library. This should make that file is directly download into entered path without additional asking.
from selenium import webdriver
def disable_download_dialog(path):
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", path)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
fp.update_preferences()
return fp.path
Then I call this function in my RF test like this:
${ff_profile_path}= disable download dialog ${EXECDIR}\\path\\to\\my\\folder
and then Open browser like this:
Open Browser ${url} ${browser} ff_profile_dir=${ff_profile_path}
From the test run I can see that download window is still displayed. The path to my folder, where I want to send downloaded file is displayed in test logs like this:
D:\\path\\to\\the\\folder\\named\\Downloads
And the firefox profile is really updated and saved in Temp file. But it looks like it's not loaded and therefore used for my test. The path to the firefox profile is like this:
C:\Users\surname~1.name\AppData\Local\Temp\tmp83d29mnz
ofc it's everytime a new profile created, what is not an issue. Maybe it could be great if I can also set the path for this firefox profile I created with python function.
So the question(s) here are:
Why the download dialog is still show when I disabled it?
Can be firefox profile saved in the folder that is defined by me?
Ok, so I found out, what was the missing piece.
I added these two lines of code into the python function
fp.set_preference("browser.helperApps.alwaysAsk.force", False)
fp.set_preference("pdfjs.disabled", True)
So the final version of the function looks like this:
def disable_download_dialog(path):
from selenium import webdriver
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", path)
fp.set_preference("browser.helperApps.alwaysAsk.force", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/pdf')
fp.set_preference("pdfjs.disabled", True)
fp.update_preferences()
return fp.path

Why does Selenium still ask me to configure Saves when I have it set in Python already?

I'm not really a Python user, but I'm using some code that I got online to download a file. One of the code is:
urlpage = 'https://www150.statcan.gc.ca/n1/tbl/csv/' + '10100127' + '-eng.zip'
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", 'D:\downloads')
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip")
driver = webdriver.Firefox()
driver.get(urlpage)
Which from what I can see, should just download the file to my D: drive in the downloads folder, yet when I run the code, the webpage opens and then asks me if I would like to either view or download the file. Is there anything wrong with the code? or am I doing something wrong?
Not sure if it's important information, but I'm using PyCharm as my IDE
Here is the script that you should use, this will save the file in system default downloads folder.
FF_options = webdriver.FirefoxProfile()
FF_options.set_preference("browser.helperApps.neverAsk.saveToDisk","application/zip")
driver= webdriver.Firefox(firefox_profile=FF_options)
If you want to save the downloaded file in specific location then add the below prefs.
# change the path here, current line will save in the working directory meaning
# the location where your script is.
FF_options.set_preference("browser.download.dir", os.getcwd())
FF_options.set_preference("browser.download.folderList",2)

How to download file using remote Firefox webdriver?

I've tried to adapt several existing solutions (1, 2) to the remote Firefox webdriver running in a selenium/standalone-firefox Docker container:
options = Options()
options.set_preference('browser.download.dir', '/src/app/output')
options.set_preference('browser.download.folderList', 2)
options.set_preference('browser.download.manager.showWhenStarting', False)
options.set_preference('browser.helperApps.alwaysAsk.force', False)
options.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')
options.set_preference('pdfjs.disabled', True)
options.set_preference('pdfjs.enabledCache.state', False)
options.set_preference('plugin.disable_full_page_plugin_for_types', False)
cls.driver = webdriver.Remote(
command_executor='http://selenium:4444/wd/hub',
desired_capabilities={'browserName': 'firefox', 'acceptInsecureCerts': True},
options=options
)
Navigating and clicking the relevant download button works fine, but the file never appears in the download directory. I've verified everything I can think of:
The user in the Selenium container can create files in /src/app/output and those files are visible in the host OS.
I can download the file successfully using my desktop browser.
The response content type is application/pdf.
What am I missing?
It turned out other changes done while researching this were resulting in the server returning a text/plain document rather than a PDF file. For reference, this is the simplest set of options I could get to work:
options.set_preference('browser.download.dir', DOWNLOAD_DIRECTORY)
options.set_preference('browser.download.folderList', 2)
options.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')
options.set_preference('pdfjs.disabled', True)

Python: How can I get Firefox preferences to neverask.saveToDisk for a .eml file?

I am trying to set the preferences on my Firefox browser to never ask to save to disk when downloading a .eml file.
def setUp(self):
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', os.path.join(os.path.expanduser("~"), "Downloads\\"))
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv,message/rfc822')
self.driver = webdriver.Firefox(profile)
self.base_url = baseurl
self.verificationErrors = []
self.accept_next_alert = True
self.driver.implicitly_wait(3)
With this code I am able to download a .csv without having the saveToDisk pop-up appear in Firefox, however this will not work with .eml despite having the 'message/rfc822' MIME-type set.
Can any help explain if I am using an incorrect MIME-type to set preferences for .emls as well? Or is there something else I need to do in order to download .eml's without having any pop-ups be displayed?
Seems that you have right content type. Still you can verify the content type and then lets see from there
from mimetypes import MimeTypes
import urllib
mime = MimeTypes()
url = urllib.pathname2url('path\to\filesample.eml')
mime_type = mime.guess_type(url)
print mime_type

Downloading multiple files using Selenium click()?

Using Firefox/Python/Selenium-- I am able to use click() on a file link on a webpage to download it, and the file downloads to my Downloads folder as expected.
However, when I add more lines to click() on more than 1 link, the script no longer runs as expected. Instead of the files being downloaded, they are all opening in separate browser windows, which all close after the script completes.
Is this by design or is there a way around it or a better way to download multiple files on a webpage?
This is the website in question: https://www.treasury.gov/about/organizational-structure/ig/Pages/igdeskbook.aspx
I am trying to download the links to the Introduction and all parts of Volumes 1-4.
I have a dictionary of the locators:
IgDeskbookPageMap = dict(IgDeskbookBannerXpath = "//div[contains(text(), 'The Inspector General Deskbook')]",
IgDeskbookIntroId = "anch_202",
IgDeskbookVol1Part1Id = "anch_203",
IgDeskbookVol1Part2Id = "anch_204",
IgDeskbookVol1Part3Id = "anch_205",
IgDeskbookVol1Part4Id = "anch_206",
IgDeskbookVol2Id = "anch_207",
IgDeskbookVol3Id = "anch_208",
IgDeskbookVol4Part1Id = "anch_209",
IgDeskbookVol4Part2Id = "anch_210",
IgDeskbookVol4Part3Id = "anch_211"
This is the method:
def click(self, waitTime, locatorMode, Locator):
self.wait_until_element_clickable(waitTime, locatorMode, Locator).click()
These are the click() calls (there are more than 3, but just truncating here for space:
self.click(10,
"id",
IgDeskbookPageMap['IgDeskbookIntroId']
)
self.click(10,
"id",
IgDeskbookPageMap['IgDeskbookVol1Part1Id']
)
self.click(10,
"id",
IgDeskbookPageMap['IgDeskbookVol1Part2Id']
)
I added the following code for launching Firefox and now the download behavior works as expected when clicking on each file:
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.helperApps.alwaysAsk.force', False)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf,application/x-pdf')
profile.set_preference("plugin.disable_full_page_plugin_for_types", "application/pdf")
profile.set_preference("pdfjs.disabled", True)
self.driver = webdriver.Firefox(profile)
A way to download such multiple files if opened in different tabs could be to follow these algorithmic steps in your own coding language :
for( all such links) :
click() the pdf link
findElement the download element
click() the download link
close the tab
switch back to last tab //should ideally be completed with previous step

Categories