how to download in python

how to download in python - python

I am writing a script which will run on my server. Its purpose is to download the document. If any person hit the particular url he/she should be able to download the document. I am using urllib.urlretrieve but it download document on the server side not on the client. How to download in python at client side?

If the script runs on your server, its purpose is to serve a document, not to download it (the latter would be the urllib solution).
Depending on your needs you can:
Set up static file serving with e.g. Apache
Make the script execute on a certain URL (e.g. with mod_wsgi), then the script should set the Content-Type (provides document type such as "text/plain") and Content-Disposition (provides download filename) headers and send the document data
As your question is not more specific, this answer can't be either.

Set the appropriate Content-type header, then send the file contents.

If the document is on your server and your intention is that the user should be able to download this file, couldn't you just serve the url to that resource as a hyperlink in your HTML code. Sorry if I have been obtuse but this seems the most logical step given your explanation.

You might want to take a look at the SocketServer module.

Related

Send PDF file from Django website to LogicalDOC

I'm developing my Django website since about 2 months and I begin to get a good global result with my own functions.
But, now I have to start a very hard part (to my mind) and I need some advices, ideas before to do that.
My Django website creates some PDF files from HTML templates with Django variables. Up to now, I'm saving PDF files directly on my Desktop (in a specific folder) but it's completely unsecured.
So, I installed another web application which is named LogicalDoc in order to save PDF file directly on this application. PDF files are created and sent to LogicalDoc.
LogicalDoc owns 2 API : SOAP and REST (http://wiki.logicaldoc.com/rest/#/) and I know that Django could communicate with REST method.
I'm reading this part of Django documentation too in order to understand How I can process : https://docs.djangoproject.com/en/dev/topics/http/file-uploads/
I made a scheme in order to understand what I'm exposing :
Then, I write a script which makes some things :
When the PDF file is created, I create a folder inside LogicalDoc which takes for example the following name : lastname_firstname_birthday
Two possibilities : If the folder exists,I don't create a new folder, else I create it.
Once it's done, I send the PDF file directly inside the folder by comparing PDF name with folder name to do that
I have some questions about this process :
Firstly, is it possible to make this kind of things ?
Is it hard to do that ?
What kind of advices could you give me ?
Thank you so much !
PS : If you need some part of my script, mainly PDF creating part, I can post it just after my question ;)

An idea is pretty simple, however it always requires some practice.
I strongly advice you to use REST api and forget about SOAP as the only thing it can bring to you - is 'pain' :)
If we check documentation, document/create it gives next information.
Endpoint we have to communicate with.
[protocol]://[server]:[port]/document/create
HTTP method to use - POST
List of parameters to provide with your request: body,
document, content
Even more, you can test API by clicking on "Try it out" button and check requests in "Network" tab of your browser (if you open Developer Tools)
I am not sure what kind of metadata do you have to provide in 'document' parameter but what I know you can easy get an idea of what should be done by testing it and putting XML or JSON data into 'document' parameter.
Content is an array of bytes transferred to the server (which would be your file).
To sum up, a request to 'document/create' uri will be simple
body = { 'headers': {},'object': {},}
document = "<note>data</note>"
content=open('report.xls', 'rb') #r - reading, b - binary
r = requests.post('http://logicaldoc/document/create', body=body, document=document, content=content)
Please keep in mind that file transferring requests take time and sometimes you may get timeout exception. Your code will stop and will be waiting for response, so it may be a good idea to get some practice with asyncio or celery. Just keep in mind those kind of possible issues.

How to download a file pushed to a browser using python?

I want to download a zip file using python.
With this type of url,
http://server.com/file.zip
this is quite simple by using urllib2.urlopen and writing it in a local file.
But in my case I have this type of url:
http://server.com/customer/somedata/download?id=121&m=zip,
the download is launched after a form validation.
It could be useful to precise that in my case I want to deploy it on heroku, so I can't use spynner that is built with C++. This download is launched after a scraping that uses scrapy.
From a browser the download works well, I get a good zip file with its name. Using python I just get html and header data...
Is there any way to get a file from this type of url in python ?

This Site is serving JavaScript which then invokes the download.
You have no choice but to: a) evaluate the JavaScript in a simulated Browser environment or b) parse manually what the JS does, and re-implement that in python. e.g. string extraction of the URL and download key, possibly invoking an AJAX request, and finally download the file
I generally recommend Mechanize for webpage related automation, but it cannot deal with JavaScript either, so I guess you can stick with Scrapy if you want to go for plan b).

When you do the download in the browser, open up the network tab of the developer console and record what HTTP method (probably POST), the POST parameters, the cookie, and everything else that is part of the validation; then use a library to replicate that.

Grabbing a .jsp generated PNG in Python

I am trying to grab a PNG image which is being dynamically generated with JSP in a web service.
I have tried visiting the web page it is contained in and grabbing the image src attribute; but the link leads to a .jsp file. Reading the response with urllib2 just shows a lot of gibberish.
I also need to do this while logged into the web service in question, using mechanize. This seems to exclude the option of grabbing a screenshot with webkit2png or similar.
Thanks for any suggestions.

If you use urllib correctly (for example, making sure your User-Agent resembles a browser etc), the "gibberish" you get back is the actual file, so you just need to write it out to disk (open the file with "wb" for writing in binary mode) and re-read it with some image-manipulation library if you need to play with it. Or you can use urlretrieve to save it directly on the filesystem.
If that's a jsp, chances are that it takes parameters, which might be appended by the browser via javascript before the request is done; you should look at the real request your browser makes, before trying to reproduce it. You can do that with the Chrome Developer Tools, Firefox LiveHTTPHeaders, etc etc.
I do hope you're not trying to break a captcha.

Using python (urllib) to download a file, how to get the real filename?

So I finally managed to get my script to login to a website and download a file... however, in some instances I will have a url like "http://www.test.com/index.php?act=Attach&type=post&id=3345". Firefox finds the filename ok... so I should be able to.
I am unable to find the "Content-Disposition" header via something like remotefile.info()['Content-Disposition']
Also, remotefile.geturl() returns the same url.
What am I missing? How do I get the actual filename? I would prefer using the built-in libraries.

It is the task of the remote server/Service to provide the content-disposition header.
There is nothing you can do unless the remote server/service is under your own control..

Using Python to download a document that's not explicitly referenced in a URL

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.

There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.

In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.

As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').

No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.

Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.

You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.

Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.