Here is a python script that loads a url and captures response time:
import urllib2
import time
opener = urllib2.build_opener()
request = urllib2.Request('http://example.com')
start = time.time()
resp = opener.open(request)
resp.read()
ttlb = time.time() - start
Since my timer is wrapped around the whole request/response (including read()), this will give me the TTLB (time to last byte).
I would also like to get the TTFB (time to first byte), but am not sure where to start/stop my timing. Is urllib2 granular enough for me to add TTFB timers? If so, where would they go?
you should use pycurl, not urllib2
install pyCurl:
you can use pip / easy_install, or install it from source.
easy_install pyCurl
maybe you should be a superuser.
usage:
import pycurl
import sys
import json
WEB_SITES = sys.argv[1]
def main():
c = pycurl.Curl()
c.setopt(pycurl.URL, WEB_SITES) #set url
c.setopt(pycurl.FOLLOWLOCATION, 1)
content = c.perform() #execute
dns_time = c.getinfo(pycurl.NAMELOOKUP_TIME) #DNS time
conn_time = c.getinfo(pycurl.CONNECT_TIME) #TCP/IP 3-way handshaking time
starttransfer_time = c.getinfo(pycurl.STARTTRANSFER_TIME) #time-to-first-byte time
total_time = c.getinfo(pycurl.TOTAL_TIME) #last requst time
c.close()
data = json.dumps({'dns_time':dns_time,
'conn_time':conn_time,
'starttransfer_time':starttransfer_time,
'total_time':total_time})
return data
if __name__ == "__main__":
print main()
Using your current open / read pair there's only one other timing point possible - between the two.
The open() call should be responsible for actually sending the HTTP request, and should (AFAIK) return as soon as that has been sent, ready for your application to actually read the response via read().
Technically it's probably the case that a long server response would make your application block on the call to read(), in which case this isn't TTFB.
However if the amount of data is small then there won't be much difference between TTFB and TTLB anyway. For a large amount of data, just measure how long it takes for read() to return the first smallest possible chunk.
By default, the implementation of HTTP opening in urllib2 has no callbacks when read is performed. The OOTB opener for the HTTP protocol is urllib2.HTTPHandler, which uses httplib.HTTPResponse to do the actual reading via a socket.
In theory, you could write your own subclasses of HTTPResponse and HTTPHandler, and install it as the default opener into urllib2 using install_opener. This would be non-trivial, but not excruciatingly so if you basically copy and paste the current HTTPResponse implementation from the standard library and tweak the begin() method in there to perform some processing or callback when reading from the socket begins.
To get a good proximity you have to do read(1). And messure the time.
It works pretty well for me.
The ony thing you should keep in mind: python might load more than one byte on the call of read(1). Depending on it's internal buffers. But i think the most tools will behave alike inaccurate.
import urllib2
import time
opener = urllib2.build_opener()
request = urllib2.Request('http://example.com')
start = time.time()
resp = opener.open(request)
# read one byte
resp.read(1)
ttfb = time.time() - start
# read the rest
resp.read()
ttlb = time.time() - start
Related
In the code below I am able to get each request and save the responses to a file. A 2000 line search took over 12 hours to complete. How can I speed this process up? Would implementing something like asynchio work?
import requests
with open('file.txt', 'r') as f:
urls = f.readlines()
for url in urls:
try:
data = requests.get(url)
except:
printf(url + " failed")
continue #moves on to the next url as nothing to write to file
with open('file_complete.txt', 'a+') as f: #change to mode "a+" to append
f.write(data.text + "\n")
There's a library which I've used to a similar use case. It's called faster-than-requests which you can pass the URL's as a list and let it do the rest
Depending on the response type that you might have on the URL you could change the method. Here is an example of saving the response body
import faster_than_requests as requests
result = requests.get2str2(["https://github.com", "https://facebook.com"], threads = True)
Use a Session so that all your requests are made via a single TCP connection, rather than having to reopen a new connection for each URL.
import requests
with open('file.txt', 'r') as f, \
open('file_complete.txt', 'a') as out, \
requests.Session() as s:
for url in f:
try:
data = s.get(url)
except Exception:
print(f'{url} failed')
continue
print(data.text, file=out)
Here, I open file_complete.txt before the loop and leave it open, but the overhead of reopening the file each time is likely small, especially compared to the time it actually takes for get to complete.
Besides the libraries and multi-threading, another possibility is to make the requests without TLS − that is, using http:// endpoints rather than https://.
This will skip the SSL handshake (a few requests between you and the server) for each of your calls.
Over thousands of calls, the effect can add up.
Of course, you'll be exposing yourself to the possibility that you might be communicating with someone pretending to be the intended server.
You'll also be exposing your traffic, so that everyone along the way can read it, like a postcard. Email has this same security vulnerability btw.
I am crawling the web using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.
I found this question - Link - and it suggests using urllib and urlopen. I don't want to contact the server twice.
I want to limit the file size to 25MB.
Is there a way i can do this with urllib3?
If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.
To do this, you'll need to make sure that you're not preloading the full response.
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url, preload_content=False)
# Maximum amount we want to read
max_bytes = 1000000
content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
# Expected body is smaller than our maximum, read the whole thing
data = response.read()
# Do something with data
...
elif content_bytes is None:
# Alternatively, stream until we hit our limit
amount_read = 0
for chunk in r.stream():
amount_read += len(chunk)
# Save chunk
...
if amount_read > max_bytes:
break
# Release the connection back into the pool
response.release_conn()
I'm trying to create a multithreaded downloader using python. Lets say I have a link to a video of size 100MB and I want to download it using 5 threads with each thread downloading 20MB simultaneously. For that to happen I have to divide the initial response to 5 parts which represents different parts of the file (like this 0-20MB, 20-40MB, 40-60MB, 60-80MB, 80-100MB), I searched and found http range headers might help.
Here's the sample code
from urllib.request import urlopen,Request
url= some video url
header = {'Range':'bytes=%d-%d' % (5000,10000)} # trying to capture all the bytes in between 5000th and 1000th byte.
req=Request(url,headers=header)
res=urlopen(req)
r=res.read()
But the above code is reading the whole video instead of the bytes I wanted and it clearly isn't working. So is there any way to read specified range of bytes in any part of the video instead of reading from the start ? Please try to explain in simple words.
But the above code is reading the whole video instead of the bytes I
wanted and it clearly isn't working.
The core problem is the the default request uses the HTTP GET method which pulls down the entire file all at once.
This can be fixed by adding request.get_method = lambda : 'HEAD'. This uses the HTTP HEAD method to fetch the Content-Length and to verify than range requests are supported.
Here is a working example of chunked requests. Just change the url to your url of interest:
from urllib.request import urlopen, Request
url = 'http://www.jython.org' # This is an example. Use your own url here.
n = 5
request = Request(url)
request.get_method = lambda : 'HEAD'
r = urlopen(request)
# Verify that the server supports Range requests
assert r.headers.get('Accept-Ranges', '') == 'bytes', 'Range requests not supported'
# Compute chunk size using a double negation for ceiling division
total_size = int(r.headers.get('Content-Length'))
chunk_size = -(-total_size // n)
# Showing chunked downloads. This should be run in multiple threads.
chunks = []
for i in range(n):
start = i * chunk_size
end = start + chunk_size - 1 # Bytes ranges are inclusive
headers = dict(Range = 'bytes=%d-%d' % (start, end))
request = Request(url, headers=headers)
chunk = urlopen(request).read()
chunks.append(chunk)
The separate requests in the for-loop can be done in parallel using threads or processes. This will give a nice speed-up when run in an environment with multiple physical connections to the internet. But if you only have one physical connection, that is likely to be the bottleneck, so parallel requests won't help as much as expected.
I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat¶s=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc
Which is the best way to request constant data from a server in Python? I've tried with Urllib3 but for some reason after a while the python script stops. And I am also trying urllib2 (see below the code), but I notice there's a huge delay sometimes (that did not happen as frequently with urllib3) and the response is not every 0.5 seconds (sometimes it's every 6 seconds). What can I do to solve this?
import socket
import urllib2
import time
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
while True:
try:
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('https://www.okcoin.com/api/v1/future_ticker.do?symbol=btc_usd&contract_type=this_week')
response = urllib2.urlopen(req)
r = response.read()
req2 = urllib2.Request('http://market.bitvc.com/futures/ticker_btc_week.js')
response2 = urllib2.urlopen(req2)
r2 = response2.read()
except:
continue
print r + str(time.time())
print r2 + str(time.time())
time.sleep(0.5)
I think I found the problem. I needed to keep an open http session. That way I get the data more continuously. What's the best way of doing this? I did "http = requests.Session()" and using requests now.