Parallel/Concurrent http request sending from urllib2 in python

Parallel/Concurrent http request sending from urllib2 in python - python

I was trying to send out http post requests in parallel to a web service. To be specific, I'd like to do load test the web service under concurrent requests environment in terms of response time. So I plan to use thread and urllib2 to achieve the job - each thread carries out a http request by urllib2.
Here is how I did it:
import urllib2 as u2
import time
from threading import Thread
def run_job():
try:
req = u2.Request(
url = **the web service url**,
data = **data send to the web service**,
headers = **http header web service requires**,
)
opener = u2.build_opener()
u2.install_opener(opener)
start_time = time.time()
response = u2.urlopen(req, timeout = 60)
end_time = time.time()
html = response.read()
code = response.code
response.close()
if code == 200:
print end_time - start_time
else:
print -1
except Exception, e:
print -2
if __name__ == "__main__":
N = 1
if len(sys.argv) > 1:
N = int(sys.argv[1])
threads = []
for i in range(1, N):
t = Thread(target=run_job, args=())
threads.append(t)
[x.start() for x in threads]
[x.join() for x in threads]
In the meantime, I use fidller2 to capture the requests sending out. Fiddler is a tool to compose http request and send it out, and it can also capture the http requests through the host.
When I looked at the fidller2, those requests are sending out one by one instead of sending out all together at a time, which is what I expect to happen. If what represented in fiddler is right (requests are sending out one by one), to my knowledge, I think there must be some queue that the requests are waiting in. Could someone shed some light what happened behind this? And if it possible, how to do the real parallel requests?
Also, I have put two time stamps before and after the request takes place, if requests are waiting in queue after urllib2.urlopen is executed, then the delta of two time stamps includes the time spending in waiting queue. Is it possible to be more precise - to measure the time between request sending out and received?
Many Thanks,

Related

How to continuously pull data from a URL in Python?

I have a link, e.g. www.someurl.com/api/getdata?password=..., and when I open it in a web browser it sends a constantly updating document of text. I'd like to make an identical connection in Python, and dump this data to a file live as it's received. I've tried using requests.Session(), but since the stream of data never ends (and dropping it would lose data), the get request also never ends.
import requests
s = requests.Session()
x = s.get("www.someurl.com/api/getdata?password=...") #never terminates
What's the proper way to do this?

I found the answer I was looking for here: Python Requests Stream Data from API
Full implementation:
import requests
url = "www.someurl.com/api/getdata?password=..."
s = requests.Session()
with open('file.txt','a') as fp:
with s.get(url,stream=True) as resp:
for line in resp.iter_lines(chunk_size=1):
fp.write(str(line))
Note that chunk_size=1 is necessary for the data to immediately respond to new complete messages, rather than waiting for an internal buffer to fill before iterating over all the lines. I believe chunk_size=None is meant to do this, but it doesn't work for me.

You can keep making get requests to the url
import requests
import time
url = "www.someurl.com/api/getdata?password=..."
sess = requests.session()
while True:
req = sess.get(url)
time.sleep(10)

this will terminate the request after 1 second ,
import multiprocessing
import time
import requests
data = None
def get_from_url(x):
s = requests.Session()
data = s.get("www.someurl.com/api/getdata?password=...")
if __name__ == '__main__':
while True:
p = multiprocessing.Process(target=get_from_url, name="get_from_url", args=(1,))
p.start()
# Wait 1 second for get request
time.sleep(1)
p.terminate()
p.join()
# do something with the data
print(data) # or smth else

Checking website response within x seconds

Good day the problem I am facing is that I want to check if my website is up or not this is the sample pseudo code
Check(website.com)
if checking_time > 10 seconds:
print "No response Recieve"
else:
print "Site is up"
I already try the code below but not working
try:
response = urllib.urlopen("http://insurance.contactnumbersph.com").getcode()
time.sleep(5)
if response == "" or response == "403":
print "No response"
else:
print "ok"

If the website is not up and running, you will get connection refused error and actually doesn't return any status code. So, you can catch the error in python with simple try: and except: blocks.
import requests
URL = 'http://some-url-where-there-is-no-server'
try:
resp = requests.get(URL)
except Exception as e:
# handle here
print(e) # for example
You can also check repeatedly 10 times, each per second to check if there is an exception, if there is you will check again
import requests
URL = 'http://some-url'
canCheck = False
counts = 0
gotConnected = False
while counts < 10 :
try:
resp = requests.get(URL)
gotConnected = True
break
except Exception as e:
counts +=1
time.sleep(1)
The result will be available in gotConnected flag, which you can use later to handle appropriate actions.

note that the timeout that gets passed around by urllib applies to the "wrong thing". that is each individual network operation (e.g. hostname resolution, socket connection, sending headers, reading a few bytes of the headers, reading a few more bytes of the response) each get this same timeout applied. hence passing a "timeout" of 10 seconds could allow a large response to continue for hours
if you want to stick to built in Python code then it would be nice to use a thread to do this, but it doesn't seem to be possible to cancel running threads nicely. an async library like trio would allow better timeout and cancellation handling, but we can make do by using the multiprocessing module instead:
from urllib.request import Request, urlopen
from multiprocessing import Process
from time import perf_counter
def _http_ping(url):
req = Request(url, method='HEAD')
print(f'trying {url!r}')
start = perf_counter()
res = urlopen(req)
secs = perf_counter() - start
print(f'response {url!r} of {res.status} after {secs*1000:.2f}ms')
res.close()
def http_ping(url, timeout):
proc = Process(target=_http_ping, args=(url,))
try:
proc.start()
proc.join(timeout)
success = not proc.is_alive()
finally:
proc.terminate()
proc.join()
proc.close()
return success
you can use https://httpbin.org/ to test this, e.g:
http_ping('https://httpbin.org/delay/2', 1)
should print out a "trying" message, but not a "response" message. you can adjust the delay time and timeout to explore how this behaves...
note that this spins up a new process for each request, but as long as you're doing this less than a thousand pings a second it should be OK

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)

You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.

tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Make Post Requests with Files Simultaneously

I write a simple server and it runs well.
So I want to write some codes which will make many post requests to my server simultaneously to simulate a pressure test. I use python.
Suppose the url of my server is http://myserver.com.
file1.jpg and file2.jpg are the files needed to be uploaded to the server.
Here is my testing code. I use threading and urllib2.
async_posts.py
from Queue import Queue
from threading import Thread
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2, sys
num_thread = 4
queue = Queue(2*num_thread)
def make_post(url):
register_openers()
data = {"file1": open("path/to/file1.jpg"), "file2": open("path/to/file2.jpg")}
datagen, headers = multipart_encode(data)
request = urllib2.Request(url, datagen, headers)
start = time.time()
res = urllib2.urlopen(request)
end = time.time()
return res.code, end - start # Return the status code and duration of this request.
def deamon():
while True:
url = queue.get()
status, duration = make_post(url)
print status, duration
queue.task_done()
for _ in range(num_thread):
thd = Thread(target = daemon)
thd.daemon = True
thd.start()
try:
urls = ["http://myserver.com"] * num_thread
for url in urls:
queue.put(url)
queue.join()
except KeyboardInterrupt:
sys.exit(1)
When num_thread is small (ex: 4), my code runs smoothly. But as I switch num_thread to slightly larger number, say 10, all the threading things break down and keep throwing httplib.BadStatusLine error.
I don't know why my code goes wrong or maybe there is better way to do this?
A a reference, my server is written in python using flask and gunicorn.
Thanks in advance.

Best way to constantly request http data?

Which is the best way to request constant data from a server in Python? I've tried with Urllib3 but for some reason after a while the python script stops. And I am also trying urllib2 (see below the code), but I notice there's a huge delay sometimes (that did not happen as frequently with urllib3) and the response is not every 0.5 seconds (sometimes it's every 6 seconds). What can I do to solve this?
import socket
import urllib2
import time
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
while True:
try:
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('https://www.okcoin.com/api/v1/future_ticker.do?symbol=btc_usd&contract_type=this_week')
response = urllib2.urlopen(req)
r = response.read()
req2 = urllib2.Request('http://market.bitvc.com/futures/ticker_btc_week.js')
response2 = urllib2.urlopen(req2)
r2 = response2.read()
except:
continue
print r + str(time.time())
print r2 + str(time.time())
time.sleep(0.5)

I think I found the problem. I needed to keep an open http session. That way I get the data more continuously. What's the best way of doing this? I did "http = requests.Session()" and using requests now.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel/Concurrent http request sending from urllib2 in python - python

Related

How to continuously pull data from a URL in Python?

Checking website response within x seconds

Faster Scraping of JSON from API: Asynchronous or?

Make Post Requests with Files Simultaneously

Best way to constantly request http data?

Categories

Resources