Establishing the name associated with a HTML link - python

I wish to download the files associated with a set of links in a html document.
A link might appear like this:
<a href="d?kjdfer87">
But when I click on it in my browser, I get the following file downloaded:
file2.txt
The following will download the file via python:
opener = urllib.request.build_opener()
r = opener.open("unknown.txt")
r.read()
but how do I establish that the file was actually called file2.txt?

Check the Content-Disposition header on the response. It can suggest a filename. I believe this would be in r.info().dict['Content-Disposition'].

It's actually this simple:
r.info().get_filename()

I'm not sure why you think you need the name. You should call it in exactly the same way as the browser does, ie with the value in the href.

The Content-Disposition header in the HTTP response is what specified that the response should be downloaded with a specific filename.
See:
How to encode the filename parameter of Content-Disposition header in HTTP?

Related

How to retrieve an attached gzipped JSON file with Python {requests,urllib4,mechanize ...}

I have an existing application that uses PyCurl to download gzipped JSON data via a REST type interface. This works well but is too slow for the desired use.
I'm trying to get an equivalent solution going that can use connection pooling. I have a simple example working with requests, but I don't see how to retrieve the attached gzipped JSON file that the returned header says is there.
My current sample code:
#!/usr/bin/python
import requests
headers = {"Authorization" : "XXX thisworksIgeta200Response",
"Content-type" : "application/json",
"Accept" : "application/json"}
r = requests.get("https://longickyGUIDyURL.noname.com",headers=headers,verify=False,stream=True)
data = r.raw.read(decode_content=True)
print data
This produces an HTML page, not the JSON output I want. The relevant returned headers look like this:
'content-disposition': 'attachment; filename="9d5c3c68-0e88-4b2d-88b9-94534b6cb80d"
'content-encoding': 'gzip',
So: requests or urllib4 (tried this a bit but don't see many examples or much documentation) or something else?
Any guidance or recommendations would be most welcome!
The Content-Disposition response-header field has been proposed as a means for the origin server to suggest a default filename if the user requests that the content is saved to a file (rfc2616)
The filename in the header is no more than a suggestion for what the browser should save it as. There is no other file there. The content you got back is all there is. The content-encoding: gzip header means that the content of the page was gzip-encoded for transit, but the requests module will have decoded that for you.
So, if it's HTML and you were expecting JSON, you probably have the wrong URL.

Python scraping ASP site with embedded PDF viewer

First post here, any help would be greatly appreciated :).
I'm trying to scrape from a website with an embedded pdf viewer. As far as I can tell, there is no way to directly download the PDF file.
The browser displays the pdf as multiple PNG image files, the problem is that the png files aren't directly accessible either. They are rendered from the original pdf and then displayed.
And the URL with the heading stripped out is in the codeblock.
The original URL to the pdf viewer (I'm using the second URL), and the link to render the pdf are included in the code.
My strategy here is to pull the viewstate and eventvalidation using urllib, then use wget to download all files from the site. This method does work without post data (page 1). I am getting the rest of the parameters from fiddler (sniffing tool)
But when I use post data to specify the page, I get 405 errors like these when trying to download the image files. However, it downloads the actual html page without a problem, just none of the png files that go along with it. Here is an example of the wget errors.
HTTP request sent, awaiting response... 405 Method Not Allowed
2014-03-27 17:09:38 ERROR 405: Method Not Allowed.
Since I can't access the image file link directly, I thought grabbing the entire page with wget would be my best bet. If anyone knows some better alternatives, please let me know :). The post data seems to work at least partially since the downloaded html file is set to the page I specified in parameters.
According to fiddler, the site automatically does a get request for the image file. I'm not quite sure how to emulate this however.
Any help is appreciated, thanks for your time!
imglink = 'http://201.150.36.178/consultaexpedientes/render/2132495e-863c-4b96-8135-ea7357ff41511.png'
origurl = 'http://201.150.36.178/consultaexpedientes/sistemas/boletines/wfBoletinVisor.aspx?tomo=1&numero=9760&fecha=14/03/2014%2012:40:00'
url = 'http://201.150.36.178/consultaexpedientes/usercontrol/Default.aspx?name=e%3a%5cBoletinesPdf%5c2014%5c3%5cBol_9760.pdf%7c0'
f = urllib2.urlopen(url)
html = f.read()
soup = BeautifulSoup(html)
eventargs = soup.findAll(attrs={'type':'hidden'})
reValue = re.compile(r'value=\"(.*)\"', re.DOTALL)
viewstate = re.findall(reValue, str(eventargs[0]))[0]
validation = re.findall(reValue, str(eventargs[1]))[0]
params = urllib.urlencode({'__VIEWSTATE':viewstate,
'__EVENTVALIDATION':validation,
'PDFViewer1$PageNumberTextBox':6,
'PDFViewer1_BookmarkPanelScrollX':0,
'PDFViewer1_BookmarkPanelScrollY':0,
'PDFViewer1_ImagePanelScrollX' : 0,
'PDFViewer1_ImagePanelScrollY' : 0,
'PDFViewer1$HiddenPageNumber':6,
'PDFViewer1$HiddenAplicaMarcaAgua':0,
'PDFViewer1$HiddenBrowserWidth':1920,
'PDFViewer1$HiddenBrowserHeight':670,
'PDFViewer1$HiddenPageNav':''})
command = '/usr/bin/wget -E -H -k -K -p --post-data=\"%s' % params + '\" ' + url
print command
os.system(command)

Opening Local File Works with urllib but not with urllib2

I'm trying to open a local file using urllib2. How can I go about doing this? When I try the following line with urllib:
resp = urllib.urlopen(url)
it works correctly, but when I switch it to:
resp = urllib2.urlopen(url)
I get:
ValueError: unknown url type: /path/to/file
where that file definitely does exit.
Thanks!
Just put "file://" in front of the path
>>> import urllib2
>>> urllib2.urlopen("file:///etc/debian_version").read()
'wheezy/sid\n'
In urllib.urlopen method: If the URL parameter does not have a scheme identifier, it will opens a local file. but the urllib2 doesn't behave like this.
So, the urllib2 method can't process it.
It's always be good to include the 'file://' schema identifier in both of the method call for the url parameter.
I had the same issue and actually, I just realized that if you download the source of the page, and then open it on chrome your browser will show you the exact local path on the url bar. Good luck!

how to capture redirected url in python

I have created a page on my site http://shedez.com/test.html this page redirects the users to a jpg on my server
I want to copy this image to my local drive using a python script. I want the python script to goto main url first and then get to the destination url of the picture
and than copy the image. As of now the destination url is hardcoded but in future it will be dynamic, because I will be using geocoding to find the city via ip and then redirect my users to the picture of day from their city.
== my present script ===
import urllib2, os
req = urllib2.urlopen("http://shedez.com/test.html")
final_link = req.info()
print req.info()
def get_image(remote, local):
imgData = urllib2.urlopen(final_link).read()
output = open(local,'wb')
output.write(imgData)
output.close()
return local
fn = os.path.join(self.tmp, 'bells.jpg')
firstimg = get_image(final_link, fn)
It doesn't seem to be header redirection. This is the body of the url -
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<html>\n<head>\n<title>Your Page Title</title>\n<meta http-equiv="REFRESH" content="0;url=htt
p://2.bp.blogspot.com/-hF8PH92aYT0/TnBxwuDdcwI/AAAAAAAAHMo/71umGutZhBY/s1600/Professional%2BBusiness%2BCard%2BDesign%2B1.jpg"></HEAD>\n<BODY>\nOptional page t
ext here.\n</BODY>\n</HTML>
You can easily fetch the content with urllib or requests and parse the HTML with BeautifulSoup or lxml to get the image url from the meta tag.
You seem to be using html http-equiv redirect. To handle redirects with Python transparently, use HTTP 302 response header on the server side instead. Otherwise, you'll have to parse HTML and follow redirects manually or use something like mechanize.
As the answers mention: either redirect to the image itself, or parse out the url from the html.
Concerning the former, redirecting, if you're using nginx or HAproxy server side you can set the X-Accel-Redirect to the image's uri, and it will be served appropriately. See http://wiki.nginx.org/X-accel for more info.
The urllib2 urlopen function by default follows the redirect 3XX HTTP status code. But in your case you are using html header based redirect for which you will have use what Bibhas is proposing.

Download a URL only if it is a HTML Webpage

I want to write a python script which downloads the web-page only if the web-page contains HTML. I know that content-type in header will be used. Please suggest someway to do it as i am unable to get a way to get header before the file download.
Use http.client to send a HEAD request to the URL. This will return only the headers for the resource then you can look at the content-type header and see if it text/html. If it is then send a GET request to the URL to get the body.

Categories