urllib2 download HTML file

urllib2 download HTML file - python

Using urllib2 in Python 2.7.4, I can readily download an Excel file:
output_file = 'excel.xls'
url = 'http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
This results in the expected file that I can use as I wish.
However, trying to download just an HTML file gives me an empty file:
output_file = 'webpage.html'
url = 'http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
I had the same results using urllib. There must be something simple I'm missing or don't understand. How do I download an HTML file from a URL? Why doesn't my code work?

If you want to download files or simply save a webpage you can use urlretrieve(from urllib library)instead of use read and write.
import urllib
urllib.urlretrieve("http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html","doc.html")
#urllib.urlretrieve("url","save as..")
If you need to set a timeout you have to put it at the start of your file:
import socket
socket.setdefaulttimeout(25)
#seconds

It also Python 2.7.4 in my OS X 10.9, and the codes work well on it.
So I think there maybe other problems prevent its working. Can you open "http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls" in your browser?

This may not directly answer the question, but if you're working with HTTP and have sufficient privileges to install python packages, I'd really recommend doing this with 'requests'. There's a related answered here - https://stackoverflow.com/a/13137873/45698

Related

Python 'corrupts' downloaded pdf file

I am writing a program to automatically download a Wikipedia page as pdf, using their embedded tool.
I was able to fix a problem where I wasn't able to retrieve the data from a submit button. The new problem is, that I can download the file (I used open() and also urllib.request.urlretrieve(), but I'm not able to open it manually then.
It seems like the file is being corrupted while being downloaded (I think it's an encoding failure). This disables opening the PDF.. (unsupported datatype or corrupted file=
This is the code I use:
import requests
import urllib.request
base_url = 'https://de.wikipedia.org/wiki/'
def createURL(base):
title = 'Rektifikation (Verfahrenstechnik)'
name = title.replace(" ", "_")
url = (base + name).replace('(', '%28').replace(')', '%29')
print(url)
getPDF(url, title)
def getPDF(url, title):
r = requests.get(url, allow_redirects=True)
open('{}.pdf'.format(title), 'wb').write(r.content)
urllib.request.urlretrieve('https://de.wikipedia.org/w/index.php?title=Spezial:ElectronPdf&page=Rektifikation+%28Verfahrenstechnik%29&action=show-download-screen)', '{}_vl.pdf'.format(title))
createURL(base_url)
I hardcoded most of the stuff for debugging reasons. Feel free to help me improve the code, but note that this isn't my main purpose, please.
My question now is: What could I possibly do to stop the file from getting corrupted (encode it the right way)
This is the link (click the button) I'm trying to download from.
Note: It's an instant download link with a redirect.
Thanks for your help, if you need any more information just ask me.
EDIT: Opening the PDF via Word lets me see, that the data (texts, paragraphs..) is available. So the PDF contains the data and the download itself seems to be successful.
The sizes of the downloaded files differ, maybe someone can have a look at this problem too:
open: 58KB
urllib: 18KB
manually: 239KB

How to use iTune's API to download a file and import the string using python?

iTune has an API that lets me download a file about an app.
When I type in https://itunes.apple.com/search?term=yelp&country=us&entity=software, it prompts me to download a file.
Is there a command in python that downloads the file and import the string into a variable?
Thanks

To download the contents of the file you can simply do:
import requests
string_var = requests.get("https://itunes.apple.com/search?term=yelp&country=us&entity=software").content
Seeing, that the response is a JSON-file you would probably want to add:
import json
resp_dict = json.loads(string_var)
This will give you a dictionary to work with
These are just small snippets. I used Python2.7

`document.lastModified` in Python

In python, by using an HTML parser, is it possible to get the document.lastModified property of a web page. I'm trying to retrieve the date at which the webpage/document was last modified by the owner.

A somewhat related question "I am downloading a file using Python urllib2. How do I check how large the file size is?", suggests that the following (untested) code should work:
import urllib2
req = urllib2.urlopen("http://example.com/file.zip")
total_size = int(req.info().getheader('last-modified'))
You might want to add a default value as the second parameter to getheader(), in case it isn't set.

You can also look for a last-modified date in the HTML code, most notably in the meta-tags. The htmldate module does just that.
Here is how it could work:
1. Install the package:
pip/pip3/pipenv (your choice) -U htmldate
2. Retrieve a web page, parse it and output the date:
from htmldate import find_date
find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
(disclaimer: I'm the author)

Dump file from url with cron job

I wrote simple Python script with urllib to retrieve file. Everything is working if the file in address owns the ending, lets say .txt, or .exe or .csv. But I got an url with the ending of file like this -- action=download&user=kff&projectname=15&date=today; So as I see it query the data from sql database and dump it in csv format in browser. When I use this link simple in the browser it works perfectly. When I am trying to do it with python it dumps me file with 0 Kb size. Here is the code:
import os
import urllib
pp = os.path.join("path", "fileName")
urllib.urlretrieve ("urlLink", pp)
Should I use wget instead of that or any other ideas would be helpful. Thanks

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.

If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}

First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.

If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

urllib2 download HTML file - python

It also Python 2.7.4 in my OS X 10.9, and the codes work well on it. So I think there maybe other problems prevent its working. Can you open "http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls" in your browser?

This may not directly answer the question, but if you're working with HTTP and have sufficient privileges to install python packages, I'd really recommend doing this with 'requests'. There's a related answered here - https://stackoverflow.com/a/13137873/45698

Related

Python 'corrupts' downloaded pdf file

How to use iTune's API to download a file and import the string using python?

`document.lastModified` in Python

Dump file from url with cron job

Python urllib2 file upload problems

Categories

Resources