I wish to have Firefox using selenium for Python to download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage.
The problem: I can't get Firefox to download the file without asking where to save it first.
Let me first point out that the URL I'm trying to get the Excel file from, is really a Blob URL:
http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx
Perhaps the Blob is causing my problem? Or, perhaps the problem is in my MIME handling?
from selenium import webdriver
profile_dir = "path/to/ff_profile"
dl_dir = "path/to/dl/folder"
ff_profile = webdriver.FirefoxProfile(profile_dir)
ff_profile.set_preference("browser.download.folderList", 2)
ff_profile.set_preference("browser.download.manager.showWhenStarting", False)
ff_profile.set_preference("browser.download.dir", dl_dir)
ff_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
driver = webdriver.Firefox(ff_profile)
url = "http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs"
driver.get(url)
dl_link = driver.find_element_by_partial_link_text("Master data")
dl_link.click()
The actual mime-type to be used in this case is:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
How do I know that? Here is what I've done:
opened Firefox manually and navigated to the target site
when downloading the file, checked the checkbox to save these kind of files automatically
went to Help -> Troubleshooting Information and navigated to the "Profile Folder"
in the profile folder, foudn and opened mimetypes.rdf
inside the mimetypes.rdf found the record/resource corresponding to the excel file I've recently downloaded
Related
I'm not really a Python user, but I'm using some code that I got online to download a file. One of the code is:
urlpage = 'https://www150.statcan.gc.ca/n1/tbl/csv/' + '10100127' + '-eng.zip'
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", 'D:\downloads')
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip")
driver = webdriver.Firefox()
driver.get(urlpage)
Which from what I can see, should just download the file to my D: drive in the downloads folder, yet when I run the code, the webpage opens and then asks me if I would like to either view or download the file. Is there anything wrong with the code? or am I doing something wrong?
Not sure if it's important information, but I'm using PyCharm as my IDE
Here is the script that you should use, this will save the file in system default downloads folder.
FF_options = webdriver.FirefoxProfile()
FF_options.set_preference("browser.helperApps.neverAsk.saveToDisk","application/zip")
driver= webdriver.Firefox(firefox_profile=FF_options)
If you want to save the downloaded file in specific location then add the below prefs.
# change the path here, current line will save in the working directory meaning
# the location where your script is.
FF_options.set_preference("browser.download.dir", os.getcwd())
FF_options.set_preference("browser.download.folderList",2)
I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).
My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.
Main problems that I am not sure how to solve:
The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.
When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .
Approaches tried:
Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.
Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .
If you could propose some solution for this task, thank you in advance. :-)
You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:
from selenium import webdriver
download_dir = '/Users/doug/Desktop/'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')
container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()
You should do like so:
import requests
txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode
def download(file_type):
url = f'https://www.ceps.cz/download-data/?format={txt_format}'
response = requests.get(url)
if file_type is txt_format:
with open(f'file.{file_type}', 'w') as file:
file.write(response.text)
else:
with open(f'file.{file_type}', 'wb') as file:
file.write(response.content)
download(txt_format)
My scripts run on Python 3.6, Selenium 2.48 and Firefox 41 (can't upgrade, I'm on a company)
I want to download some XML files from a website using Python and Selenium Webdriver. I use a Firefox profile to avoid the dialog frame and save the file in a specific location.
profile = webdriver.firefox.firefox_profile.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.panel.shown", False)
profile.set_preference("browser.download.dir", dloadPath)
profile.set_preference("browser.helperApps.neverAsk.openFile","application/xml,text/xml")
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/xml,text/xml")
browser = webdriver.Firefox(firefox_profile=profile)
The program finds all links downloadable (tested : works)
links = []
elements = browser.find_elements_by_xpath("//a[contains(#href,'reception/')]")
for elem in elements:
href = elem.get_attribute("href")
links.append(href)
return links
To download the file I use get() from Selenium
browser.get(fileUrl)
The files I'm looking for have a very specific url, means that I can't use Requests or urllib (2 or 3) and I need to login to the website and navigate througth it, can do It with those modules.
The url is like :
https://www.example.com/cft/cft/reception/filename.xml?user=xxxxxxxx&password=xxxxxxxx
Here is the html link :
filename.xml
With my script I can access to the website, navigate throught it but when I get the file url the dialog frame pops up, with no reasons that I found.
The script works very well on other websites, I think the problem is the url.
Thanks for your help
I was told to do a cookie audit of our front facing sites, now we own alot of domains, so Im really not going to manually dig through each one extracting the cookies. I decided to go with selenium. This works up till the point where I want to grab third party cookies.
Currently (python) I can do
driver.get_cookies()
For all the cookies that are set from my domain, but this doesn't give me any Google, Twitter, Vimeo, or other 3rd party cookies
I have tried modifying the cookie permissions in the firefox driver, but it doesn't help. Anyone know how I can get hold of tehm
Your question has been answered on StackOverflow here
Step 1: You need to download and install "Get All Cookies in XML" extension for Firefox from here (don't forget to restart Firefox after installing the extension).
Step2: Execute this python code to have Selenium's FirefoxWebDriver save all cookies to an xml file and then read this file:
from xml.dom import minidom
from selenium import webdriver
import os
import time
def determine_default_profile_dir():
"""
Returns path of Firefox's default profile directory
#return: directory_path
"""
appdata_location = os.getenv('APPDATA')
profiles_path = appdata_location + "/Mozilla/Firefox/Profiles/"
dirs_files_list = os.listdir(profiles_path)
default_profile_dir = ""
for item_name in dirs_files_list:
if item_name.endswith(".default"):
default_profile_dir = profiles_path + item_name
if not default_profile_dir:
assert ("did not find Firefox default profile directory")
return default_profile_dir
#load firefox with the default profile, so that the "Get All Cookies in XML" addon is enabled
default_firefox_profile = webdriver.FirefoxProfile(determine_default_profile_dir())
driver = webdriver.Firefox(default_firefox_profile)
#trigger Firefox to save value of all cookies into an xml file in Firefox profile directory
driver.get("chrome://getallcookies/content/getAllCookies.xul")
#wait for a bit to give Firefox time to write all the cookies to the file
time.sleep(40)
#cookies file will not be saved into directory with default profile, but into a temp directory.
current_profile_dir = driver.profile.profile_dir
cookie_file_path = current_profile_dir+"/cookie.xml"
print "Reading cookie data from cookie file: "+cookie_file_path
#load cookies file and do what you need with it
cookie_file = open(cookie_file_path,'r')
xmldoc = minidom.parse(cookie_file)
cookie_file.close()
driver.close()
#process all cookies in xmldoc object
Selenium can only get the cookies of the current domain:
getCookies
java.util.Set getCookies()
Get all the cookies for the current domain. This is the equivalent of
calling "document.cookie" and parsing the result
Anyway, I heard somebody used a Firefox plugin that was able to save all the cookies in XML. As far as I know, it is your best option.
Yes, I don't believe Selenium allows you to interact with cookies other than your current domain.
If you know the domains in question, then you could navigate to that domain but I assume this is unlikely.
It would be a massive security risk if you could access cookies cross site
You can get any cookie out of the browsers sqlite database file in the profile folder.
I added a more complete answer here:
Selenium 2 get all cookies on domain
Tools: Ubuntu, Python, Selenium, Firefox
I am tying to automate the dowloading of image files from a subscription web site. I do not have access to the server other than through my paid subscription. To avoid having to click a button for each file download, I decided to automate it using Python, Selenium, and Firefox. (I have been using these three together for the first time for two days now. I also know very little about cookies.)
I am interested in downloading following three formats in order or preference: ['EPS', 'PNG', 'JPG']. A button for each format is available on the web site.
I have managed to have success in automating the downloading of the 'PNG' and 'JPG' files to disk by setting the Firefox preferences by hand as suggested in this post: python webcrawler downloading files
However, when the file is in an 'EPS' format, the "You have chosen to save" dialog box still pops open in the Firefox window.
As you can see from my code, I have set the preferences to save 'EPS' files to disk. (Again, 'JPG' and 'PNG' files are saved as expected.)
from selenium import webdriver
profile = webdriver.firefox.firefox_profile.FirefoxProfile()
profile.set_preference("browser.download.folderList", 1)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'image/jpeg,image/png,application/postscript,'
'application/eps,application/x-eps,image/x-eps,'
'image/eps')
profile.set_preference("browser.helperApps.alwaysAsk.force", False)
profile.set_preference("plugin.disable_full_page_plugin_for_types",
"application/eps,application/x-eps,image/x-eps,"
"image/eps")
profile.set_preference(
"general.useragent.override",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0)"
" Gecko/20100101 Firefox/26.0")
driver = webdriver.Firefox(firefox_profile=profile)
#I then log in and begin automated clicking to download files. 'JPG' and 'PNG' files are
#saved to disk as expected. The 'EPS' files present a save dialog box in Firefox.
I tried installing an extension for Firefox called "download-statusbar" that claims to negate any save dialog box from appearing. The extension loads in the Selenium Firefox browser, but it doesn't function. (A lot of reviews say the extension is broken despite the developers' insistence that it does function.) It isn't working for me anyway so I gave up on it.
I added this to the Firefox profile in that attempt:
#The extension loads, but it doesn't function.
download_statusbar = '/home/$USER/Downloads/'
'/download_statusbar_fixed-1.2.00-fx.xpi'
profile.add_extension(download_statusbar)
From reading other stackoverflow.com posts, I decided to see if I could download the file via the url with urllib2. As I understand how this would work, I would need to add cookies to the headers in order to authenticate the downloading of the 'EPS' file via a url.
I am unfamiliar with this technique, but here is the code I tried to use to download the file directly. It failed with a '403 Forbidden' response despite my attemps to set cookies in the urllib2 opener.
import urllib2
import cookielib
import logging
import sys
cookie_jar = cookielib.LWPCookieJar()
handlers = [
urllib2.HTTPHandler(),
urllib2.HTTPSHandler(),
]
[h.set_http_debuglevel(1) for h in handlers]
handlers.append(urllib2.HTTPCookieProcessor(cookie_jar))
#using selenium driver cookies, returns a list of dictionaries
cookies = driver.get_cookies()
opener = urllib2.build_opener(*handlers)
opener.addheaders = [(
'User-agent',
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) '
'Gecko/20100101 Firefox/26.0'
)]
logger = logging.getLogger("cookielib")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)
for item in cookies:
opener.addheaders.append(('Cookie', '{}={}'.format(
item['name'], item['value']
)))
logger.info('{}={}'.format(item['name'], item['value']))
response = opener.open('http://path/to/file.eps')
#Fails with a 403 Forbidden response
Any thoughts or suggestions? Am I missing something easy or do I need to give up hope on an automated download of the EPS files? Thanks in advance.
Thank you to #unutbu for helping me solve this. I just didn't understand the anatomy of a file download. I do understand a little bit better now.
I ended up installing an extension called "Live HTTP Headers" on Firefox to examine the headers sent by the server. As it turned out, the 'EPS' files were sent with a 'Content-Type' of 'application/octet-stream'.
Now the EPS files are saved to disk as expected. I modified the Firefox preferences to the following:
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'image/jpeg,image/png,'
'application/octet-stream')