I would like to download file over HTTP protocol using urllib3.
I have managed to do this using following code:
url = 'http://url_to_a_file'
connection_pool = urllib3.PoolManager()
resp = connection_pool.request('GET',url )
f = open(filename, 'wb')
f.write(resp.data)
f.close()
resp.release_conn()
But I was wondering what is the proper way of doing this.
For example will it work well for big files and If no what to do to make this code more bug tolerant and scalable.
Note. It is important to me to use urllib3 library not urllib2 for example, because I want my code to be thread safe.
Your code snippet is close. Two things worth noting:
If you're using resp.data, it will consume the entire response and return the connection (you don't need to resp.release_conn() manually). This is fine if you're cool with holding the data in-memory.
You could use resp.read(amt) which will stream the response, but the connection will need to be returned via resp.release_conn().
This would look something like...
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', url, preload_content=False)
with open(path, 'wb') as out:
while True:
data = r.read(chunk_size)
if not data:
break
out.write(data)
r.release_conn()
The documentation might be a bit lacking on this scenario. If anyone is interested in making a pull-request to improve the urllib3 documentation, that would be greatly appreciated. :)
The most correct way to do this is probably to get a file-like object that represents the HTTP response and copy it to a real file using shutil.copyfileobj as below:
url = 'http://url_to_a_file'
c = urllib3.PoolManager()
with c.request('GET',url, preload_content=False) as resp, open(filename, 'wb') as out_file:
shutil.copyfileobj(resp, out_file)
resp.release_conn() # not 100% sure this is required though
Most easy way with urllib3, you can use shutil do auto-manage packages.
import urllib3
import shutil
http = urllib3.PoolManager()
with open(filename, 'wb') as out:
r = http.request('GET', url, preload_content=False)
shutil.copyfileobj(r, out)
Related
I'm wanting to cycle through pages until returning None, but not certain how to achieve this with a cURL API. I'm also wanting to combine all results at the end into one file. I achieved this with a noob method of repeating variables, but this obviously is inefficient. I tried looking for existing answers to similar questions but wasn't able to get anything to work for my instance.
The API has no headers, by the way, just in case those are considered for any reason.
I also tried download pycurl, but the pip install appears to be broken and am not experienced enough to manually install from file, but I'm sure this can be achieved with requests.
import requests
import json
url = 'https://API.API.io/?page='
username = 'API key'
password = ''
params1={"page":"1","per_page":"500"}
params2={"page":"2","per_page":"500"}
params3={"page":"3","per_page":"500"}
r1=requests.get(url,params=params1,auth=(username,password))
r2=requests.get(url,params=params2,auth=(username,password))
r3=requests.get(url,params=params3,auth=(username,password))
rj1=r1.json()
rj2=r2.json()
rj3=r3.json()
writeFile = open('file.json','w',encoding='utf-8')
json.dump(
rj1+
rj2+
rj3,
writeFile)
writeFile.close()
You can use for-loop to get responses from various pages. Also, use with open(...) when opening the file for writing:
import json
import requests
url = "https://API.API.io/?page="
username = "API key"
password = ""
params = {"page": 1, "per_page": 500}
all_data = []
for params["page"] in range(1, 4): # <--- this will get page 1, 2 and 3
r = requests.get(url, params=params, auth=(username, password))
all_data.extend(r.json())
with open("file.json", "w", encoding="utf-8") as f_out:
json.dump(all_data, f_out, indent=4)
In the code below I am able to get each request and save the responses to a file. A 2000 line search took over 12 hours to complete. How can I speed this process up? Would implementing something like asynchio work?
import requests
with open('file.txt', 'r') as f:
urls = f.readlines()
for url in urls:
try:
data = requests.get(url)
except:
printf(url + " failed")
continue #moves on to the next url as nothing to write to file
with open('file_complete.txt', 'a+') as f: #change to mode "a+" to append
f.write(data.text + "\n")
There's a library which I've used to a similar use case. It's called faster-than-requests which you can pass the URL's as a list and let it do the rest
Depending on the response type that you might have on the URL you could change the method. Here is an example of saving the response body
import faster_than_requests as requests
result = requests.get2str2(["https://github.com", "https://facebook.com"], threads = True)
Use a Session so that all your requests are made via a single TCP connection, rather than having to reopen a new connection for each URL.
import requests
with open('file.txt', 'r') as f, \
open('file_complete.txt', 'a') as out, \
requests.Session() as s:
for url in f:
try:
data = s.get(url)
except Exception:
print(f'{url} failed')
continue
print(data.text, file=out)
Here, I open file_complete.txt before the loop and leave it open, but the overhead of reopening the file each time is likely small, especially compared to the time it actually takes for get to complete.
Besides the libraries and multi-threading, another possibility is to make the requests without TLS − that is, using http:// endpoints rather than https://.
This will skip the SSL handshake (a few requests between you and the server) for each of your calls.
Over thousands of calls, the effect can add up.
Of course, you'll be exposing yourself to the possibility that you might be communicating with someone pretending to be the intended server.
You'll also be exposing your traffic, so that everyone along the way can read it, like a postcard. Email has this same security vulnerability btw.
I have this code:
import shutil
import urllib.request
import zipfile
url = "http://wwww.some-file.com/my-file.zip"
file_name = url.split('/')[-1]
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
When trying the code I receive the following error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
I have been trying to combine solutions from here and here with no luck. Can anyone help me? Thanks
I have used this solution for my project when I needed to download .gz files, maybe it will work for you.
from urllib.request import Request as urllibRequest
request = urllibRequest(url)
with open(file_name, 'wb') as output:
output.write(urlopen(request).read())
I think the remote server can cause this issue. The server expects user headers and if there are no headers, then cause a redirect. As you mentioned, it is described here
Try to add user headers
https://developers.pipedrive.com/docs/api/v1/#!/Files/post_files
Doesn't show request example and I can't send POST request via python.
My error is: "No files provided"
Maybe someone has an example for this request?
My code:
import requests
with open('qwerty.csv', 'rb') as f:
r = requests.post('https://api.pipedrive.com/v1/files',
params={'api_token': 'MY_TOKEN'}, files={'file': f})
Try to decouple the operation.
files = {'file': open('qwerty.csv', 'rb')}
r = requests.post('https://api.pipedrive.com/v1/files',
params={'api_token': 'MY_TOKEN'}, files=files)
Well, I was dumb.
All you need is check request in chrome develop tools and play with it for some time.
import requests
files = {'file': ('FILE_NAME', open('fgsfds.jpg', 'rb'), 'CONTENT_TYPE')}
r = requests.post('https://api.pipedrive.com/v1/files',
params={'api_token': 'TOKEN'},
data={'file_type':'img', 'deal_id':DEAL_ID}, files=files)
Update
Recently used this endpoint (Feb 2021). Turns out the endpoint doesn't accept the 'file_type' parameter anymore.
Is it possible to download a large file in chunks using httplib2. I am downloading files from a Google API, and in order to use the credentials from the google OAuth2WebServerFlow, I am bound to use httplib2.
At the moment I am doing:
flow = OAuth2WebServerFlow(
client_id=XXXX,
client_secret=XXXX,
scope=XYZ,
redirect_uri=XYZ
)
credentials = flow.step2_exchange(oauth_code)
http = httplib2.Http()
http = credentials.authorize(http)
resp, content = self.http.request(url, "GET")
with open(file_name, 'wb') as fw:
fw.write(content)
But the content variable can get more than 500MB.
Any way of reading the response in chunks?
You could consider streaming_httplib2, a fork of httplib2 with exactly that change in behaviour.
in order to use the credentials from the google OAuth2WebServerFlow, I am bound to use httplib2.
If you need features that aren't available in httplib2, it's worth looking at how much work it would be to get your credential handling working with another HTTP library. It may be a good longer-term investment. (e.g. How to download large file in python with requests.py?.)
About reading response in chunks (works with httplib, must work with httplib2)
import httplib
conn = httplib.HTTPConnection("google.com")
conn.request("GET", "/")
r1 = conn.getresponse()
try:
print r1.fp.next()
print r1.fp.next()
except:
print "Exception handled!"
Note: next() may raise StopIteration exception, you need to handle it.
You can avoid calling next() like this
F=open("file.html","w")
for n in r1.fp:
F.write(n)
F.flush()
You can apply oauth2client.client.Credentials to a urllib2 request.
First, obtain the credentials object. In your case, you're using:
credentials = flow.step2_exchange(oauth_code)
Now, use that object to get the auth headers and add them to the urllib2 request:
req = urllib2.Request(url)
auth_headers = {}
credentials.apply(auth_headers)
for k,v in auth_headers.iteritems():
req.add_header(k,v)
resp = urllib2.urlopen(req)
Now resp is a file-like object that you can use to read the contents of the URL