Upload PDF using requests in Python - python

I am trying to upload a pdf file to an API using requests in python.
In Postman e.g. I can use the body and select "binary" to upload the file. How can I do this in Python? (payload="")
The Content-type needs to be application/pdf
The PDF needs to be downloaded from an URL.
Thanks in advance
I tried to use PyPDF to read the file from the URL.

First, define the headers that your upload request will have. Then get pdf from remote url and save its content to the memory. After that, just use a post request with specified headers to upload the document.
headers = {'content-type': 'application/pdf'}
pdf = requests.get('http://mypdfurl').content
requests.post('http://myserver/upload', headers = headers, data = pdf)

Related

urllib.request.urlretrieve returns corrupt file (How to handle this kind of url?)

I want to download about 1000 pdf files from a web page.
Then I encountered this awkward pdf url format.
Both requests.get() and urllib.request.urlretrieve() don't work for me.
Usual pdf url looks like :
https://webpage.com/this_file.pdf
But this url is like :
https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9000001&fileSn=1&wrtFileTy=01
So it doesn't have .pdf in url, and if you click on it, you can download it, But using python's urllib, you get corrupt file.
At first I thought it is redirected into some other url.
So I used request.get(url, allow_retrieves=True) option,
the result is the same url as before..
filename = './novel/pdf1.pdf'
url = 'https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9031938&fileSn=1&wrtFileTy=01'
urllib.request.urlretrieve(url, filename)
this code downloads corrupt pdf file.
I solved it using content field in the retrieved object.
filename = './novel1/pdf1.pdf'
url = . . .
object = requests.get(url)
with open('./novels/'+filename, 'wb') as f:
f.write(t.content)
refered to this QnA ; Download and save PDF file with Python requests module

Python - Download File Returned by Web Form Submission

* Updated to clarify information from responses *
There is a website my IT organization set up that allows us to submit a set of parameters to a web form, click on a "submit" button, and then it generates a .txt file of users provisioned to specified applications which is (at least using my current Chrome settings) automatically sent to the download folder.
In order to automate this process and get an updated list of users each week, I've been trying to write a python script that uses urllib (+ urllib2, requests, etc.) in order to submit the form and then grab the .txt file that is downloaded.
When I try running the code below...
import urllib, urllib2
url = 'my url'
values = {'param1' : 'response1',
'param2' : 'response2',
'param3' : 'response3'
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
data = response.read()
...it doesn't throw any errors, but I don't get any response either. I've checked all the likely paths that the file would download to and can't find anything.
And if I add something like...
with open('response.txt', 'w') as f:
f.write(data)
...then it just writes the source HTML for the page to the file; it doesn't actually grab the file generated by the query I'm essentially posting through the form.
Any help here would be greatly appreciated!
You haven't saved the response to a file.
with open('response.txt', 'w') as f:
f.write(data)
That will save a file called response.txt to the directory you have run the script from. If you just want to check the contents of the response you can use:
print(data)

how to generate a download link for a file present in amazon S3 using python boto?

How to generate a download link for a file present in amazon S3 using python boto? I tried (key.generate_url). But it is opening the .txt file in browser instead of download.
When creating the URL, you should specify a "content disposition" response header:
headers = {'response-content-disposition': 'attachment; filename="your-filename.txt"'}
url = key.generate_url(expires_in=600, response_headers=headers)
When the URL is used, it will cause S3 to return a Content-Disposition header that will indicate to the browser that it should download the file instead of display it directly.

How to retrieve an attached gzipped JSON file with Python {requests,urllib4,mechanize ...}

I have an existing application that uses PyCurl to download gzipped JSON data via a REST type interface. This works well but is too slow for the desired use.
I'm trying to get an equivalent solution going that can use connection pooling. I have a simple example working with requests, but I don't see how to retrieve the attached gzipped JSON file that the returned header says is there.
My current sample code:
#!/usr/bin/python
import requests
headers = {"Authorization" : "XXX thisworksIgeta200Response",
"Content-type" : "application/json",
"Accept" : "application/json"}
r = requests.get("https://longickyGUIDyURL.noname.com",headers=headers,verify=False,stream=True)
data = r.raw.read(decode_content=True)
print data
This produces an HTML page, not the JSON output I want. The relevant returned headers look like this:
'content-disposition': 'attachment; filename="9d5c3c68-0e88-4b2d-88b9-94534b6cb80d"
'content-encoding': 'gzip',
So: requests or urllib4 (tried this a bit but don't see many examples or much documentation) or something else?
Any guidance or recommendations would be most welcome!
The Content-Disposition response-header field has been proposed as a means for the origin server to suggest a default filename if the user requests that the content is saved to a file (rfc2616)
The filename in the header is no more than a suggestion for what the browser should save it as. There is no other file there. The content you got back is all there is. The content-encoding: gzip header means that the content of the page was gzip-encoded for transit, but the requests module will have decoded that for you.
So, if it's HTML and you were expecting JSON, you probably have the wrong URL.

Download a URL only if it is a HTML Webpage

I want to write a python script which downloads the web-page only if the web-page contains HTML. I know that content-type in header will be used. Please suggest someway to do it as i am unable to get a way to get header before the file download.
Use http.client to send a HEAD request to the URL. This will return only the headers for the resource then you can look at the content-type header and see if it text/html. If it is then send a GET request to the URL to get the body.

Categories