How to download file from a page using python - python

I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).
My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.
Main problems that I am not sure how to solve:
The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.
When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .
Approaches tried:
Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.
Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .
If you could propose some solution for this task, thank you in advance. :-)

You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:
from selenium import webdriver
download_dir = '/Users/doug/Desktop/'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')
container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()

You should do like so:
import requests
txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode
def download(file_type):
url = f'https://www.ceps.cz/download-data/?format={txt_format}'
response = requests.get(url)
if file_type is txt_format:
with open(f'file.{file_type}', 'w') as file:
file.write(response.text)
else:
with open(f'file.{file_type}', 'wb') as file:
file.write(response.content)
download(txt_format)

Related

Download PDF as file object without downloading the file with Chrome and Selenium in Python

I am trying to download a PDF with Chromium and Selenium in Python. I am able to download it, if I download it as a file, but is it possible to get it as a string, like I would if it was downloaded with requests? Eg import requests; requests.get(url).content?
My current code is
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.africau.edu/images/default/sample.pdf")
This opens the pdf, but does not download. I can also do
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_experimental_option(
"prefs",
{
"download.default_directory": "/home/username/Downloads/", # Change default directory for downloads
"download.prompt_for_download": False, # To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True, # It will not show PDF directly in chrome
},
)
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("http://www.africau.edu/images/default/sample.pdf")
This works, but it downloads a PDF-file to my downloads folder, which means I would need to read it to get the PDF in code.
How can I get the pdf without needing to download it as a file first?
Not clearly getting question but if you want to read the file then just do
import requests
file_url = "http://codex.cs.yale.edu/avi/db-book/db4/slide-dir/ch1-2.pdf"
r = requests.get(file_url, stream = True)
#then open a file to write to py will auto create if not there
for chunk in r.iter_content(chunk_size=1024):
# do something
OR
You can download the file with request.get then writing it to disk
import requests
file_url = "http://codex.cs.yale.edu/avi/db-book/db4/slide-dir/ch1-2.pdf"
r = requests.get(file_url, stream = True)
#then open a file to write to py will auto create if not there
with open("python.pdf","wb") as pdf:
for chunk in r.iter_content(chunk_size=1024):
# writing one chunk at a time to a pdf file
if chunk:
pdf.write(chunk)
From what you described, I believe the following may achieve your goal:
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("http://www.africau.edu/images/default/sample.pdf")
# Select all text and copy it using hotkeys
ActionChains(driver).key_down(Keys.CONTROL).send_keys('a').send_keys('c').key_up(Keys.CONTROL).perform()
Then, your desired text is on the clipboard and you can paste it where you like via
ActionChains(driver).key_down(Keys.CONTROL).send_keys('v').key_up(Keys.CONTROL).perform()
You mention, this opens the pdf, but does not download, However in order to decode a PDF file inside a viewer it MUST have been downloaded 1st 100% to the device file storage (disk/chip or otherwise) then been fully decoded into text and pixels then displayed as pixels only, (the text may be in the browser tmp store to be deleted on disconnect).
Most python based systems include pdf tools like poppler-utils so from an OS shell simply run same equivalent of
curl -o "%tmp%\temp.pdf" RemoteURL && pdftotext [options] "%tmp%\temp.pdf" filename.txt
using example above in question If I wish to read on console it looks like this
easier is not view on console just default to sample.txt file using
curl -O http://www.africau.edu/images/default/sample.pdf && pdftotext -nopgbrk -enc UTF-8 -layout sample.pdf

Save SVG File from wikipedia as SVG in python

Goal:
save SVG from wikipedia
Requirements:
Needs to be automated
Currently I am using selenium to get some other information, and I tried to use a python script like this to extract the svg but the extracted SVG file gives an error when rendering.
Edit:
The same error occurs when using requests, maybe it has something to do with file wikipedia uploaded?
Error code:
Errorcode for svg file
It renders part of the svg later:
Rendered part of SVG
How it should look:
Map;Oslo zoomed out
Wikipedia file
Code imageEctractSingle.py:
from selenium import webdriver
DRIVER_PATH = 'chromedriver.exe'
link = 'https://upload.wikimedia.org/wikipedia/commons/7/75/NO_0301_Oslo.svg'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(link)
image = driver.page_source
#Creates an SVG File
f = open("kart/oslo.svg", "w")
f.write(image)
f.close()
driver.close()
Original artical that I get the file link from by running through the table in the article
Any ideas on how to exctract this image, I know chrome as a built in function to save as, how can I access that through selenium?
or does there exsist a tool for saving SVG files from selenium?
Thanks in advance for any help :D
Its not selenium but I got it working in requests, you shouldn't need selenium for something this simple unless you are doing more alongside it:
import requests
def write_text(data: str, path: str):
with open(path, 'w') as file:
file.write(data)
url = 'https://upload.wikimedia.org/wikipedia/commons/7/75/NO_0301_Oslo.svg'
svg = requests.get(url).text
write_text(svg, './NO_0301_Oslo.svg')

Python Selenium: Firefox neverAsk.saveToDisk when downloading from Blob URL

I wish to have Firefox using selenium for Python to download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage.
The problem: I can't get Firefox to download the file without asking where to save it first.
Let me first point out that the URL I'm trying to get the Excel file from, is really a Blob URL:
http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx
Perhaps the Blob is causing my problem? Or, perhaps the problem is in my MIME handling?
from selenium import webdriver
profile_dir = "path/to/ff_profile"
dl_dir = "path/to/dl/folder"
ff_profile = webdriver.FirefoxProfile(profile_dir)
ff_profile.set_preference("browser.download.folderList", 2)
ff_profile.set_preference("browser.download.manager.showWhenStarting", False)
ff_profile.set_preference("browser.download.dir", dl_dir)
ff_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
driver = webdriver.Firefox(ff_profile)
url = "http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs"
driver.get(url)
dl_link = driver.find_element_by_partial_link_text("Master data")
dl_link.click()
The actual mime-type to be used in this case is:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
How do I know that? Here is what I've done:
opened Firefox manually and navigated to the target site
when downloading the file, checked the checkbox to save these kind of files automatically
went to Help -> Troubleshooting Information and navigated to the "Profile Folder"
in the profile folder, foudn and opened mimetypes.rdf
inside the mimetypes.rdf found the record/resource corresponding to the excel file I've recently downloaded

Download files using Python 3.4 from Google Patents

I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.
As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)
The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)

How can I download a file on a click event using selenium?

I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")

Categories