I'm trying to create a multithreaded downloader using python. Lets say I have a link to a video of size 100MB and I want to download it using 5 threads with each thread downloading 20MB simultaneously. For that to happen I have to divide the initial response to 5 parts which represents different parts of the file (like this 0-20MB, 20-40MB, 40-60MB, 60-80MB, 80-100MB), I searched and found http range headers might help.
Here's the sample code
from urllib.request import urlopen,Request
url= some video url
header = {'Range':'bytes=%d-%d' % (5000,10000)} # trying to capture all the bytes in between 5000th and 1000th byte.
req=Request(url,headers=header)
res=urlopen(req)
r=res.read()
But the above code is reading the whole video instead of the bytes I wanted and it clearly isn't working. So is there any way to read specified range of bytes in any part of the video instead of reading from the start ? Please try to explain in simple words.
But the above code is reading the whole video instead of the bytes I
wanted and it clearly isn't working.
The core problem is the the default request uses the HTTP GET method which pulls down the entire file all at once.
This can be fixed by adding request.get_method = lambda : 'HEAD'. This uses the HTTP HEAD method to fetch the Content-Length and to verify than range requests are supported.
Here is a working example of chunked requests. Just change the url to your url of interest:
from urllib.request import urlopen, Request
url = 'http://www.jython.org' # This is an example. Use your own url here.
n = 5
request = Request(url)
request.get_method = lambda : 'HEAD'
r = urlopen(request)
# Verify that the server supports Range requests
assert r.headers.get('Accept-Ranges', '') == 'bytes', 'Range requests not supported'
# Compute chunk size using a double negation for ceiling division
total_size = int(r.headers.get('Content-Length'))
chunk_size = -(-total_size // n)
# Showing chunked downloads. This should be run in multiple threads.
chunks = []
for i in range(n):
start = i * chunk_size
end = start + chunk_size - 1 # Bytes ranges are inclusive
headers = dict(Range = 'bytes=%d-%d' % (start, end))
request = Request(url, headers=headers)
chunk = urlopen(request).read()
chunks.append(chunk)
The separate requests in the for-loop can be done in parallel using threads or processes. This will give a nice speed-up when run in an environment with multiple physical connections to the internet. But if you only have one physical connection, that is likely to be the bottleneck, so parallel requests won't help as much as expected.
Related
So I am trying to stream the chunks of data returned back from the sql database. The chunks seem to be streamed, however when I hit the endpoint, it shows the response at the very end when the request is completed, instead of showing the streamed data chunk by chunk. I know there are already questions about this but adding mimetype doesn't seem to work for me. I have the following code:
Any help is highly appreciated!
def generate_chunks():
result = _get_query_service(repo_url, True).stream_query(qry)
chunk_counter = 0
while True:
chunk = result.fetchmany(5)
chunk_counter += 1
if not chunk:
break
for value in chunk:
yield str(chunk)
return Response(stream_with_context(generate_chunks()), content_type='application/json', status=200)
Actually it was a small thing. The above code works.
But tools like Postman and Insomnia do not support streaming data.
If you want to see your data streamed in action, use CURL or python requests.
For CURL, you need to add --no-buffer option to see the streamed data.
curl --no-buffer -v http://localhost:8082/healthy
For Python requests, you need to add stream=True. Example:
r = requests.post('http://localhost:8082/stream_query', json=dc, stream=True)
r.encoding = 'utf-8'
for line in r.iter_content(chunk_size=10): # prints the streamed data in chunks
print(line)
I have been working on some code that will grab emergency incident information from a service called PulsePoint. It works with software built into computer controlled dispatch centers.
This is an app that empowers citizen heroes that are CPR trained to help before a first resonder arrives on scene. I'm merely using it to get other emergency incidents.
I reversed-engineered there app as they have no documentation on how to make your own requests. Because of this i have knowingly left in the api key and auth info because its in plain text in the Android manifest file.
I will definitely make a python module eventually for interfacing with this service, for now its just messy.
Anyhow, sorry for that long boring intro.
My real question is, how can i simplify this function so that it looks and runs a bit cleaner in making a timed request and returning a json object that can be used through subscripts?
import requests, time, json
def getjsonobject(agency):
startsecond = time.strftime("%S")
url = REDACTED
body = []
currentagency = requests.get(url=url, verify=False, stream=True, auth=requests.auth.HTTPBasicAuth(REDACTED, REDCATED), timeout = 13)
for chunk in currentagency.iter_content(1024):
body.append(chunk)
if(int(startsecond) + 5 < int(time.strftime("%S"))): #Shitty internet proof, with timeout above
raise Exception("Server sent to much data")
jsonstringforagency = str(b''.join(body))[2:][:-1] #Removes charecters that define the response body so that the next line doesnt error
currentagencyjson = json.loads(jsonstringforagency) #Loads response as decodable JSON
return currentagencyjson
currentincidents = getjsonobject("lafdw")
for inci in currentincidents["incidents"]["active"]:
print(inci["FullDisplayAddress"])
Requests handles acquiring the body data, checking for json, and parsing the json for you automatically, and since you're giving the timeout arg I don't think you need separate timeout handling. Request also handles constructing the URL for get requests, so you can put your query information into a dictionary, which is much nicer. Combining those changes and removing unused imports gives you this:
import requests
params = dict(both=1,
minimal=1,
apikey=REDACTED)
url = REDACTED
def getjsonobject(agency):
myParams = dict(params, agency=agency)
return requests.get(url, verify=False, params=myParams, stream=True,
auth=requests.auth.HTTPBasicAuth(REDACTED, REDACTED),
timeout = 13).json()
Which gives the same output for me.
I am crawling the web using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.
I found this question - Link - and it suggests using urllib and urlopen. I don't want to contact the server twice.
I want to limit the file size to 25MB.
Is there a way i can do this with urllib3?
If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.
To do this, you'll need to make sure that you're not preloading the full response.
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url, preload_content=False)
# Maximum amount we want to read
max_bytes = 1000000
content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
# Expected body is smaller than our maximum, read the whole thing
data = response.read()
# Do something with data
...
elif content_bytes is None:
# Alternatively, stream until we hit our limit
amount_read = 0
for chunk in r.stream():
amount_read += len(chunk)
# Save chunk
...
if amount_read > max_bytes:
break
# Release the connection back into the pool
response.release_conn()
Using the poster.encode module, this works when I post a whole file to Solr:
f = open(filePath, 'rb')
datagen, headers = multipart_encode({'file': f})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
However, for some files I'd like to send only the first n bytes of the files. I thought this would work:
f = open(filePath, 'rb')
mp = MultipartParam('file', fileobj=f, filesize=f)
datagen, headers = multipart_encode({'file': mp})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
...but I get a timed out request (and the odd thing is that I then have to restart apache before requests to my web2py app work again). I get a 'http 400 content missing' error from urlopen() when I leave off the filesize argument. Am I just using MultipartParam incorrectly?
(The point of all this is that I'm using Solr to extract text content and metadata from files. For video and audio files, I'd like to get away with sending just the first 100-300k or so, as presumably the relevant data's all in the file headers.)
The reason you're having trouble is that mime encoding introduces sentinels in the post, if you don't specify the file size - that means that you have to do chunked transfer encoding so that the web server knows when to stop reading the file. But, that's the other problem - if you stop sending a MIME encoded POST to a server mid-stream, it'll just sit there waiting for the block to finish. Chunked transfer encoding and mixed-multipart mime encoding are both dead serious when it comes down to message segment sizes.
If you only want to send 100-300k of data, then only read that much, then every post you make to the server will terminate at the byte you want and the web server is expecting.
Here is a python script that loads a url and captures response time:
import urllib2
import time
opener = urllib2.build_opener()
request = urllib2.Request('http://example.com')
start = time.time()
resp = opener.open(request)
resp.read()
ttlb = time.time() - start
Since my timer is wrapped around the whole request/response (including read()), this will give me the TTLB (time to last byte).
I would also like to get the TTFB (time to first byte), but am not sure where to start/stop my timing. Is urllib2 granular enough for me to add TTFB timers? If so, where would they go?
you should use pycurl, not urllib2
install pyCurl:
you can use pip / easy_install, or install it from source.
easy_install pyCurl
maybe you should be a superuser.
usage:
import pycurl
import sys
import json
WEB_SITES = sys.argv[1]
def main():
c = pycurl.Curl()
c.setopt(pycurl.URL, WEB_SITES) #set url
c.setopt(pycurl.FOLLOWLOCATION, 1)
content = c.perform() #execute
dns_time = c.getinfo(pycurl.NAMELOOKUP_TIME) #DNS time
conn_time = c.getinfo(pycurl.CONNECT_TIME) #TCP/IP 3-way handshaking time
starttransfer_time = c.getinfo(pycurl.STARTTRANSFER_TIME) #time-to-first-byte time
total_time = c.getinfo(pycurl.TOTAL_TIME) #last requst time
c.close()
data = json.dumps({'dns_time':dns_time,
'conn_time':conn_time,
'starttransfer_time':starttransfer_time,
'total_time':total_time})
return data
if __name__ == "__main__":
print main()
Using your current open / read pair there's only one other timing point possible - between the two.
The open() call should be responsible for actually sending the HTTP request, and should (AFAIK) return as soon as that has been sent, ready for your application to actually read the response via read().
Technically it's probably the case that a long server response would make your application block on the call to read(), in which case this isn't TTFB.
However if the amount of data is small then there won't be much difference between TTFB and TTLB anyway. For a large amount of data, just measure how long it takes for read() to return the first smallest possible chunk.
By default, the implementation of HTTP opening in urllib2 has no callbacks when read is performed. The OOTB opener for the HTTP protocol is urllib2.HTTPHandler, which uses httplib.HTTPResponse to do the actual reading via a socket.
In theory, you could write your own subclasses of HTTPResponse and HTTPHandler, and install it as the default opener into urllib2 using install_opener. This would be non-trivial, but not excruciatingly so if you basically copy and paste the current HTTPResponse implementation from the standard library and tweak the begin() method in there to perform some processing or callback when reading from the socket begins.
To get a good proximity you have to do read(1). And messure the time.
It works pretty well for me.
The ony thing you should keep in mind: python might load more than one byte on the call of read(1). Depending on it's internal buffers. But i think the most tools will behave alike inaccurate.
import urllib2
import time
opener = urllib2.build_opener()
request = urllib2.Request('http://example.com')
start = time.time()
resp = opener.open(request)
# read one byte
resp.read(1)
ttfb = time.time() - start
# read the rest
resp.read()
ttlb = time.time() - start