Download folder in python - python

I try download this zip. I used selenium and requests, but neither of them works and I don't know why.
Thank you for your advice.
from selenium import webdriver
import requests
url = 'http://vdp.cuzk.cz/vymenny_format/csv/20200131_OB_ADR_csv.zip'
driver = webdriver.Chrome('drivers\chromedriver.exe')
driver.get(url)
requests.get(url)

requests.get() downloads the entity into memory. This needs to be explicitly written to a file using open.
Example:
import requests
url = 'http://vdp.cuzk.cz/vymenny_format/csv/20200131_OB_ADR_csv.zip'
filename = 'c:/users/user/downloads/csv.zip'
filebody = requests.get(url)
open(filename, 'wb').write(filebody.content)

First of all, you don't need requests to download a file (in this case at least). As I don't know the errors you are getting, I would suggest double-checking the path to your chromedriver.exe and you should escape backslashes.
driver = webdriver.Chrome('drivers\\chromedriver.exe')
I tried your code (while entering the location of chromedriver on my computer) and it worked - I was able to download the file.

Related

How to capture the request url of mp3 file on the audiobook website?

website:
https://www.ting22.com/ting/659-2.html
I'd like to get some audiobooks from the website above. In other words, I want to download the MP3 files of the audiobook from 659-2.html to 659-1724.html.
By using F12 tools, In [Network]->[Media], I can see the Request URL of MP3 file, But I don't know how to get the URL using a script.
Here are some specs of what I'm using:
System: Windows 7 x64
Python: 3.7.0
Update:
For example, by using F12 tool, I can see the file's url is "http://audio.xmcdn.com/group58/M03/8D/07/wKgLc1zNaabhA__WAEJyyPUT5k4509.mp3"
But I don't know how to get the URL of MP3 file in code ? Rather than how to download the file.
which library should I use?
Thank you.
UPDATE
Well that would be a bit more complicated because requests packages won't return the .mp3 source, so you need to use Selenium. Here is a tested solution:
from selenium import webdriver # pip install selenium
import urllib3
import shutil
import os
if not os.path.exists(os.getcwd()+'/mp3_folder'):
os.mkdir(os.getcwd()+'/mp3_folder')
def downloadFile(url=None):
filename = url.split('/')[-1]
c = urllib3.PoolManager()
with c.request('GET', url, preload_content=False) as resp, open('mp3_folder/'+filename, 'wb') as out_file:
shutil.copyfileobj(resp, out_file)
resp.release_conn()
driver = webdriver.Chrome('chromedriver.exe') # download chromedriver from here and place it near the script: https://chromedriver.storage.googleapis.com/72.0.3626.7/chromedriver_win32.zip
for i in range(2, 1725):
try:
driver.get('https://www.ting22.com/ting/659-%s.html' % i)
src = driver.find_element_by_id('mySource').get_attribute('src')
downloadFile(src)
print(src)
except Exception as exc:
print(exc)

Getting HTML file from a webpage that is already opened in a browser in python 3

I have been looking on the internet for an answer for this but so far I haven't found quite what I was looking for. So far I'm able to open a webpage via Python webbrowser, but what I want to know is how to download the HTML file from that webpage that Python has asked the browser (firefox in this case) to open. This is because there are certain webpages with sections that I can not fully access without a certain browser extension/addon (MetaMask), as they require to also log in from within that extension, which is done automatically if I open firefox normally or with the webbrowser module. This is why requesting the HTML with an URL directly from Python with code such as this doesn't work:
import requests
url = 'https://www.google.com/'
r = requests.get(url)
r.text
from urllib.request import urlopen
with urlopen(url) as f:
html = f.read()
The only solution I have got so far is to open the webpage with the webbrowser module and then use the pyautogui module, which I can use to make my PC automatically press Ctrl+S (firefox browser hotkeys to save the HTML file from the webpage I'm currently in) and then make it press enter.
import webbrowser
import pyautogui
import time
def get_html():
url='https://example.com/'
webbrowser.open_new(url) #Open webpage in default browser (firefox)
time.sleep(1.2)
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.press('enter')
get_html()
However, I was wondering if there is a more sophisticated and efficient way that doesn't involve simulated key pressing with pyautogui.
Can you try the following:
import requests
url = 'https://www.google.com/'
r = requests.get(url)
with open('page.html', 'w') as outfile:
outfile.write(r.text)
If the above solution doesn't work, you can use selenium library to open a browser:
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
with open('page.html', 'w') as f:
f.write(driver.page_source)

Run URL to download a file with python

I'm working on a program that downloads data from a series of URLs, like this:
https://server/api/getsensordetails.xmlid=sesnsorID&username=user&password=password
the program goes through a list with IDs (about 2500) and running the URL, try to do it using the following code
import webbrowser
webbrowser.open(url)
but this code implies to open the URL in the browser and confirm if I want to download, I need him to simply download the files without opening a browser and much less without having to confirm
thanks for everything
You can use the Requests library.
import requests
print('Beginning file download with requests')
url = 'http://PathToFile.jpg'
r = requests.get(url)
with open('pathOfFileToReceiveDownload.jpg', 'wb') as f:
f.write(r.content)

Download files using Python 3.4 from Google Patents

I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.
As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)
The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)

Python web scraping gives wrong source code

I want to extract some data from Amazon(link in the following code)
Here is my code:
import urllib2
url="http://www.amazon.com/s/ref=sr_nr_n_11?rh=n%3A283155%2Cn%3A%2144258011%2Cn%3A2205237011%2Cp_n_feature_browse-bin%3A2656020011%2Cn%3A173507&bbn=2205237011&sort=titlerank&ie=UTF8&qid=1393984161&rnid=1000"
webpage=urllib2.urlopen(url).read()
doc=open("test.html","w")
doc.write(webpage)
doc.close()
When I open the test.html, the content of my page is different from the website in the Internet.
The page involves javascript execution.
urllib2.urlopen(..).read() simply read the url content. So they are different.
To get same content, you need to use library that can handle javascript.
For example, following code uses selenium:
from selenium import webdriver
url = 'http://www.amazon.com/s/ref=sr_nr_n_11?...161&rnid=1000'
driver = webdriver.Firefox()
driver.get(url)
with open('test.html', 'w') as f:
f.write(driver.page_source.encode('utf-8'))
driver.quit()
To complete falsetru's answer:
another solution is to use python-ghost. It is based on Qt. It's much heavier to install, so I advice Selenium too.
Using Firefox will open it up on script execution. To not have it on your way, use PhantomJS:
apt-get install nodejs # you get npm, the Node Package Manager
npm install -g phantomjs # install globally
[…]
driver = webdriver.PhantomJS()

Categories