Currently i'm trying to scrape a site but the site didn't allow more than 100 request for one tcp connection. So, i tried to create multiple connection pool for requests. I tried the following code. Shouldn't it create 15 connection pool?
from urllib3 import HTTPConnectionPool
for i in range(15):
pool = HTTPConnectionPool('ajax.googleapis.com', maxsize=15)
for j in range(15):
resp= pool.request('GET', '/ajax/services/search/web')
pool.num_connections
pool.num_connection always print 1
The issue is that the requests are made synchronously one after the other.
For this reason the pool will always use the same connection with no need to create any others.
Now let's say we run the code using threads, multiple requests will be issued concurrently.
In this case pool.num_connections will be greater than 1:
from concurrent.futures.thread import ThreadPoolExecutor
from urllib3 import HTTPConnectionPool
pool = HTTPConnectionPool('ajax.googleapis.com', maxsize=15)
def send_request(_):
pool.request('GET', '/ajax/services/search/web')
print(pool.num_connections)
with ThreadPoolExecutor(max_workers=5) as executor:
executor.map(send_request, range(5))
If you need to close sockets every 100 requests, then you'll need to do that manually. Here's an example which closes all sockets every 5 requests:
import urllib3
urllib3.add_stderr_logger() # This lets you see when new connections are made
http = urllib3.PoolManager()
url = 'http://ajax.googleapis.com/ajax/services/search/web'
for j in range(15):
resp = http.request('GET', url)
if j % 5 == 0:
# Reset the PoolManager's connections.
# This might be overkill if you need more granular control per-host.
http.clear()
You could do something similar using HTTPConnectionPool and doing .close() on it before replacing it with a fresh one. I prefer to use PoolManager when possible (there is generally no downside).
If you'd like to get super granular with connections, you can manually take connections out of an HTTPConnectionPool using pool._get_conn() and .close()'ing it.
Related
I want to parallely send a GET request for the specified count say 100 times. How to achieve this using JMeter or Python ?
I tried bzm parallel executor but that doesn't workout.
import requests
import threading
totalRequests = 0
numberOfThreads = 10
threads = [0] * numberOfThreads
def worker(thread):
r = requests.get("url")
threads[thread] = 0 # free thread
while totalRequests < 100:
for thread in range(numberOfThreads):
if threads[thread] == 0:
threads[thread] = 1 # occupy thread
t = threading.Thread(target=worker, args=(thread,))
t.start()
totalRequests += 1
In JMeter:
Add Thread Group to your Test Plan and configure it like:
Add HTTP Request sampler as a child of the Thread Group and specify protocol, host, port, path and parameters:
if you're not certain regarding properly configuring the HTTP Request sampler - you can just record the request using your browser and JMeter's HTTP(S) Test Script Recorder or JMeter Chrome Extension
For Python the correct would be using Locust framework as I believe you're interested in metrics like response times, latencies and so on. The official website is down at the moment
so in the meantime you can check https://readthedocs.org/projects/locust/
Using a loop to make multiple requests to various websites, how is it possible to do this with a proxy in urllib3?
The code will read in a tuple of URLs, and use a for loop to connect to each site, however, currently it does not connect past the first url in the tuple. There is a proxy in place as well.
list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
for i in list:
http = ProxyManager("PROXY-PROXY")
http_get = http.request('GET', i, preload_content=False).read().decode()
I have removed the urls and proxy information from the above code. The first URL in the tuple will run fine, but after this, nothing else occurs, just waiting. I have tried the clear() method to reset the connection for each time in the loop.
unfortunately urllib3 is synchronous and blocks. You could use it with threads, but that is a hassle and usually leads to more problems. The main approach these days is to use some asynchronous network. Twisted and asyncio (with aiohttp maybe) are the popular packages.
I'll provide an example using trio framework and asks:
import asks
import trio
asks.init('trio')
path_list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
results = []
async def grabber(path):
r = await s.get(path)
results.append(r)
async def main(path_list):
async with trio.open_nursery() as n:
for path in path_list:
n.spawn(grabber(path))
s = asks.Session()
trio.run(main, path_list)
Using threads is not really that much of a hassle since 3.2 when concurrent.futures was added:
from urllib3 import ProxyManager
from concurrent.futures import ThreadPoolExecutor,wait
url_list:list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
thread_pool:ThreadPoolExecutor = ThreadPoolExecutor(max_workers=min(len(url_list),20))
tasks = []
for url in url_list:
def send_request() -> type:
# copy i into this function's stack frame
this_url:str = url
# could this assignment be removed from the loop?
# I'd have to read the docs for ProxyManager but probably
http:ProxyManager = ProxyManager("PROXY-PROXY")
return http.request('GET', this_url, preload_content=False).read().decode()
tasks.append(thread_pool.submit(send_request))
wait(tasks)
all_responses:list = [task.result() for task in tasks]
Later versions offer an event loop via asyncio. Issues I've had with asyncio are usually related to portability of libraries (IE aiohttp via pydantic), most of which are not pure python and have external libc dependencies. This can be an issue if you have to support a lot of docker apps which might have musl-libc(alpine) or glibc(everyone else).
I have what I would think is a pretty common use case for Gevent. I need a UDP server that listens for requests, and based on the request submits a POST to an external web service. The external web service essentially only allows one request at a time.
I would like to have an asynchronous UDP server so that data can be immediately retrieved and stored so that I don't miss any requests (this part is easy with the DatagramServer gevent provides). Then I need some way to send requests to the external web service serially, but in such a way that it doesn't ruin the async of the UDP server.
I first tried monkey patching everything and what I ended up with was a quick solution, but one in which my requests to the external web service were not rate limited in any way and which resulted in errors.
It seems like what I need is a single non-blocking worker to send requests to the external web service in serial while the UDP server adds tasks to the queue from which the non-blocking worker is working.
What I need is information on running a gevent server with additional greenlets for other tasks (especially with a queue). I've been using the serve_forever function of the DatagramServer and think that I'll need to use the start method instead, but haven't found much information on how it would fit together.
Thanks,
EDIT
The answer worked very well. I've adapted the UDP server example code with the answer from #mguijarr to produce a working example for my use case:
from __future__ import print_function
from gevent.server import DatagramServer
import gevent.queue
import gevent.monkey
import urllib
gevent.monkey.patch_all()
n = 0
def process_request(q):
while True:
request = q.get()
print(request)
print(urllib.urlopen('https://test.com').read())
class EchoServer(DatagramServer):
__q = gevent.queue.Queue()
__request_processing_greenlet = gevent.spawn(process_request, __q)
def handle(self, data, address):
print('%s: got %r' % (address[0], data))
global n
n += 1
print(n)
self.__q.put(n)
self.socket.sendto('Received %s bytes' % len(data), address)
if __name__ == '__main__':
print('Receiving datagrams on :9000')
EchoServer(':9000').serve_forever()
Here is how I would do it:
Write a function taking a "queue" object as argument; this function will continuously process items from the queue. Each item is supposed to be a request for the web service.
This function could be a module-level function, not part of your DatagramServer instance:
def process_requests(q):
while True:
request = q.get()
# do your magic with 'request'
...
in your DatagramServer, make the function running within a greenlet (like a background task):
self.__q = gevent.queue.Queue()
self.__request_processing_greenlet = gevent.spawn(process_requests, self.__q)
when you receive the UDP request in your DatagramServer instance, you push the request to the queue
self.__q.put(request)
This should do what you want. You still call 'serve_forever' on DatagramServer, no problem.
I'm currently testing something with Threading/ workpool; I create 400 Threads which download a total of 5000 URLS... The problem is that some of the 400 threads are "freezing", when looking into my Processes I see that +- 15 threads in every run freeze, and after a time eventually close 1 by 1.
My question is if there is a way to have some sort of 'timer' / 'counter' that kills a thread if it isn't finished after x seconds.
# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer
class DownloadJob(workerpool.Job):
"Job for downloading a given URL."
def __init__(self, url):
self.url = url # The url we'll need to download when the job runs
def run(self):
try:
url = urllib2.urlopen(self.url).read()
except:
pass
# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)
# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
job = DownloadJob(url.strip())
pool.put(job)
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()
The problem is that some of the 400 threads are "freezing"...
That's most likely because of this line...
url = urllib2.urlopen(self.url).read()
By default, Python will wait forever for a remote server to respond, so if a one of your URLs points to a server which is ignoring the SYN packet, or is otherwise just really slow, the thread could potentially be blocked forever.
You can use the timeout parameter of urlopen() set a limit as to how long the thread will wait for the remote host to respond...
url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds
...or you can set it globally instead with socket.setdefaulttimeout() by putting these lines at the top of your code...
import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds
urlopen accepts a timeout value, that would be the best way to handle it I think.
But I agree with the commenter that 400 threads is probably way too many
I have the following code which creates an HTTPConnectionPool using TwistedMatrix Python framework, and an Agent for HTTP requests:
self.pool = HTTPConnectionPool(reactor, persistent=True)
self.pool.retryAutomatically = False
self.pool.maxPersistentPerHost = 1
self.agent = Agent(reactor, pool=self.pool)
then I create requests to connect to a local server:
d = self.agent.request(
"GET",
url,
Headers({"Host": ["localhost:8333"]}),
None)
The problem is: the local server sometimes behaves incorrectly when multiple simultaneous requests are made, so I would like to limit the number of simultaneous requests to 1.
The additional requests should be queued until the pending request completes.
I've tried with self.pool.maxPersistentPerHost = 1 but it doesn't work.
Does twisted.web.client.Agent with HTTPConnectionPool support limiting the maximum number of connections per host, or do I have to implement a request FIFO queue myself?
The reason setting maxPersistentPerHost to 1 didn't help is that maxPersistentPerHost is for controlling the maximum number of persistent connections to cache per host. It does not prevent additional connections from being opened in order to service new requests, it will only cause them to be closed immediately after a response is received, if the maximum number of cached connections has already been reached.
You can enforce serialization in a number of ways. One way to have a "FIFO queue" is with twisted.internet.defer.DeferredLock. Use it together with Agent like this:
lock = DeferredLock()
d1 = lock.run(agent.request, url, ...)
d2 = lock.run(agent.request, url, ...)
The second request will not run until after the first as completed.