I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.
As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)
The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)
Related
I try download this zip. I used selenium and requests, but neither of them works and I don't know why.
Thank you for your advice.
from selenium import webdriver
import requests
url = 'http://vdp.cuzk.cz/vymenny_format/csv/20200131_OB_ADR_csv.zip'
driver = webdriver.Chrome('drivers\chromedriver.exe')
driver.get(url)
requests.get(url)
requests.get() downloads the entity into memory. This needs to be explicitly written to a file using open.
Example:
import requests
url = 'http://vdp.cuzk.cz/vymenny_format/csv/20200131_OB_ADR_csv.zip'
filename = 'c:/users/user/downloads/csv.zip'
filebody = requests.get(url)
open(filename, 'wb').write(filebody.content)
First of all, you don't need requests to download a file (in this case at least). As I don't know the errors you are getting, I would suggest double-checking the path to your chromedriver.exe and you should escape backslashes.
driver = webdriver.Chrome('drivers\\chromedriver.exe')
I tried your code (while entering the location of chromedriver on my computer) and it worked - I was able to download the file.
I am currently trying to figure out how i can take a list of links and make python run through all of them and save them as pdf. (I'm not a python expert)
I found a python package called "pdfkit" which is quite good, but how do i set it up so that it follow my url-list and save the pdf as different names all the time?
import pdfkit
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files (x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_url('http://google.com', 'MyPDF.pdf', configuration=config)
This is my current code, lets say that i have a list of 10 webpages that i want to save as 10 different pdf files how do i make a setup that would allow me to do so?
Another issue is that i need to login to the page in order to scrape the information from the links, how would you implement that?
Best Regards,
Answer for the first question:
import pdfkit
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files (x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
url_list = [
['http://google.com', 'google.com.pdf'],
['http://facebook.com', 'facebook.com.pdf'],
['http://yahoo.com', 'yahoo.com.pdf'],
]
for k, v in url_list:
pdfkit.from_url(k, v, configuration=config)
For answer to the second question, you can use the requests module session feature to login first and then pass the cookie to pdfkit to download the page. See Create PDF of a https webpage which requires login using pdfkit
import selenium.webdriver
import pdfkit
import time
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files
(x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
driver = selenium.webdriver.Chrome()
driver.get('https://www.linkedin.com/')
time.sleep(1)
driver.find_element_by_id('login-email').send_keys('username')
driver.find_element_by_id('login-password').send_keys('password')
driver.find_element_by_id('login-submit').click()
time.sleep(2)
driver.save_screenshot('output.png') # only visible part
print(driver.page_source)
pdfkit.from_string(driver.page_source, 'file.pdf')
I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:
from selenium import webdriver
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)
# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)
At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.
content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)
I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)
Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?
UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:
<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>
Any idea why this is happening?
As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.
Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.
Full code, in which I also added conditional waits to improve speed:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)
# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.
Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.
This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.
from selenium import webdriver
import chromedriver_binary
from lxml import html
import requests
import os
driver = webdriver.Chrome()
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
base = 'https://blast.ncbi.nlm.nih.gov/'
driver.get(URL)
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
content = driver.page_source
# write the page content
os.mkdir('page')
with open('page/page.html', 'w') as fp:
fp.write(content)
# download the referenced files to the same path as in the html
sess = requests.Session()
sess.get(base) # sets cookies
# parse html
h = html.fromstring(content)
# get css/js files loaded in the head
for hr in h.xpath('head//#href'):
if not hr.startswith('http'):
local_path = 'page/' + hr
hr = base + hr
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)
# get image/js files from the body. skip anything loaded from outside sources
for src in h.xpath('//#src'):
if not src or src.startswith('http'):
continue
local_path = 'page/' + src
print(local_path)
src = base + src
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)
You should have a folder called page with a file called page.html in it with the content you are after.
Inspired by FThompson's answer above, I came up with the following tool that can download full/complete html for a given page url (see: https://github.com/markfront/SinglePageFullHtml)
UPDATE - follow up with Max's suggestion, below are steps to use the tool:
Clone the project, then run maven to build:
$> git clone https://github.com/markfront/SinglePageFullHtml.git
$> cd ~/git/SinglePageFullHtml
$> mvn clean compile package
Find the generated jar file in target folder: SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar
Run the jar in command line like:
$> java -jar .target/SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar <page_url>
The result file name will have a prefix "FP, followed by the hashcode of the page url, with file extension ".html". It will be found in either folder "/tmp" (which you can get by System.getProperty("java.io.tmp"). If not, try find it in your home dir or System.getProperty("user.home") in Java).
The result file will be a big fat self-contained html file that includes everything (css, javascript, images, etc.) referred to by the original html source.
I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).
I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).
My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.
Main problems that I am not sure how to solve:
The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.
When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .
Approaches tried:
Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.
Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .
If you could propose some solution for this task, thank you in advance. :-)
You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:
from selenium import webdriver
download_dir = '/Users/doug/Desktop/'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')
container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()
You should do like so:
import requests
txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode
def download(file_type):
url = f'https://www.ceps.cz/download-data/?format={txt_format}'
response = requests.get(url)
if file_type is txt_format:
with open(f'file.{file_type}', 'w') as file:
file.write(response.text)
else:
with open(f'file.{file_type}', 'wb') as file:
file.write(response.content)
download(txt_format)
I coded a small Python script to download a picture from a website using selenium:
from selenium import webdriver
import urllib.request
class FirefoxTest:
def firefoxTest(self):
self.driver=webdriver.Firefox()
self.driver.get("http://www.sitew.com")
self.r=self.driver.find_element_by_tag_name('img')
self.uri=self.r.get_attribute("src")
self.g=urllib.request.urlopen(self.uri)
with open("begueradj.png",'b+w') as self.f:
self.f.write(self.g.read())
if __name__=='__main__':
FT=FirefoxTest()
FT.firefoxTest()
How can I modify my code in order to:
download all the pictures on the webpage ?
not to name the image I downloaded and keep their default name instead ?
You need to switch to find_elements_by_tag_name. For downloading files, I'd use urllib.urlretrieve() - it would extract the filename from the url for you:
images = self.driver.find_elements_by_tag_name('img')
for image in images:
src = image.get_attribute("src")
if src:
urllib.urlretrieve(src)
You can use Ruby gems nokogiri to open the web page and download the images using their xpath.
require 'open-uri'
require 'nokogiri'
f = open('sample.flv')
begin
http.request_get('/sample.flv') do |resp|
resp.read_body do |segment|
f.write(segment)
end
end
ensure
f.close()
end