Using a loop to make multiple requests to various websites, how is it possible to do this with a proxy in urllib3?
The code will read in a tuple of URLs, and use a for loop to connect to each site, however, currently it does not connect past the first url in the tuple. There is a proxy in place as well.
list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
for i in list:
http = ProxyManager("PROXY-PROXY")
http_get = http.request('GET', i, preload_content=False).read().decode()
I have removed the urls and proxy information from the above code. The first URL in the tuple will run fine, but after this, nothing else occurs, just waiting. I have tried the clear() method to reset the connection for each time in the loop.
unfortunately urllib3 is synchronous and blocks. You could use it with threads, but that is a hassle and usually leads to more problems. The main approach these days is to use some asynchronous network. Twisted and asyncio (with aiohttp maybe) are the popular packages.
I'll provide an example using trio framework and asks:
import asks
import trio
asks.init('trio')
path_list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
results = []
async def grabber(path):
r = await s.get(path)
results.append(r)
async def main(path_list):
async with trio.open_nursery() as n:
for path in path_list:
n.spawn(grabber(path))
s = asks.Session()
trio.run(main, path_list)
Using threads is not really that much of a hassle since 3.2 when concurrent.futures was added:
from urllib3 import ProxyManager
from concurrent.futures import ThreadPoolExecutor,wait
url_list:list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
thread_pool:ThreadPoolExecutor = ThreadPoolExecutor(max_workers=min(len(url_list),20))
tasks = []
for url in url_list:
def send_request() -> type:
# copy i into this function's stack frame
this_url:str = url
# could this assignment be removed from the loop?
# I'd have to read the docs for ProxyManager but probably
http:ProxyManager = ProxyManager("PROXY-PROXY")
return http.request('GET', this_url, preload_content=False).read().decode()
tasks.append(thread_pool.submit(send_request))
wait(tasks)
all_responses:list = [task.result() for task in tasks]
Later versions offer an event loop via asyncio. Issues I've had with asyncio are usually related to portability of libraries (IE aiohttp via pydantic), most of which are not pure python and have external libc dependencies. This can be an issue if you have to support a lot of docker apps which might have musl-libc(alpine) or glibc(everyone else).
Related
I'd like to be able to trigger a long-running python script via a web request, in bare-bones fashion. Also, I'd like to be able to trigger other copies of the script with different parameters while initial copies are still running.
I've looked at flask, aiohttp, and queueing possibilities. Flask and aiohttp seem to have the least overhead to set up. I plan on executing the existing python script via subprocess.run (however, I did consider refactoring the script into libraries that could be used in the web response function).
With aiohttp, I'm trying something like:
ingestion_service.py:
from aiohttp import web
from pprint import pprint
routes = web.RouteTableDef()
#routes.get("/ingest_pipeline")
async def test_ingest_pipeline(request):
'''
Get the job_conf specified from the request and activate the script
'''
#subprocess.run the command with lookup of job conf file
response = web.Response(text=f"Received data ingestion request")
await response.prepare(request)
await response.write_eof()
#eventually this would be subprocess.run call
time.sleep(80)
return response
def init_func(argv):
app = web.Application()
app.add_routes(routes)
return app
But though the initial request returns immediately, subsequent requests block until the initial request is complete. I'm running a server via:
python -m aiohttp.web -H localhost -P 8080 ingestion_service:init_func
I know that multithreading and concurrency may provide better solutions than asyncio. In this case, I'm not looking for a robust solution, just something that will allow me to run multiple scripts at once via http request, ideally with minimal memory costs.
OK, there were a couple of issues with what I was doing. Namely, time.sleep() is blocking, so asyncio.sleep() should be used. However, since I'm interested in spawning a subprocess, I can use asyncio.subprocess to do that in a non-blocking fashion.
nb:
asyncio: run one function threaded with multiple requests from websocket clients
https://docs.python.org/3/library/asyncio-subprocess.html.
Using these help, but there's still an issue with the webhandler terminating the subprocess. Luckily, there's a solution here:
https://docs.aiohttp.org/en/stable/web_advanced.html
aiojobs has a decorator "atomic" that will protect the process until it is complete. So, code along these lines will function:
from aiojobs.aiohttp import setup, atomic
import asyncio
import os
from aiohttp import web
#atomic
async def ingest_pipeline(request):
#be careful what you pass through to shell, lest you
#give away the keys to the kingdom
shell_command = "[your command here]"
response_text = f"running {shell_command}"
response_code = 200
response = web.Response(text=response_text, status=response_code)
await response.prepare(request)
await response.write_eof()
ingestion_process = await asyncio.create_subprocess_shell(shell_command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE)
stdout, stderr = await ingestion_process.communicate()
return response
def init_func(argv):
app = web.Application()
setup(app)
app.router.add_get('/ingest_pipeline', ingest_pipeline)
return app
This is very bare bones, but might help others looking for a quick skeleton for a temporary internal solution.
I am trying to perform async HTTP requests by using the requests library in Python. I found that the last version of the library does not directly support async requets. To achive it they provide the requests-threads library that makes use of Twisted to handle asynchronicity. I tried modifying the examples provided to use callbacks instead of await/yield, but the callbacks are not being called.
My sample code is:
session = AsyncSession(n=10)
def processResponse(response):
print(response)
def main():
a = session.get('https://reqres.in/api/users')
a.addCallbacks(processResponse, processResponse)
time.sleep(5)
The requests-threads library: https://github.com/requests/requests-threads
I suspect the callbacks are not called because you aren't running Twisted's eventloop (known as the reactor). Remove your sleep function and replace it with reactor.run().
from twisted.internet import reactor
# ...
def main():
a = session.get('https://reqres.in/api/users')
a.addCallbacks(processResponse, processResponse)
#time.sleep(5) # never use blocking functions like this w/ Twisted
reactor.run()
The catch is Twisted's reactor cannot be restarted, so once you stop the event loop (ie. reactor.stop()), an exception will be raised when reactor.run() is executed again. In other words, your script/app will only "run once". To circumvent this issue, I suggest you use crochet. Here's a quick example using a similar example from requests-thread:
import crochet
crochet.setup()
print('setup')
from twisted.internet.defer import inlineCallbacks
from requests_threads import AsyncSession
session = AsyncSession(n=100)
#crochet.run_in_reactor
#inlineCallbacks
def main(reactor):
responses = []
for i in range(10):
responses.append(session.get('http://httpbin.org/get'))
for response in responses:
r = yield response
print(r)
if __name__ == '__main__':
event = main(None)
event.wait()
And just as an FYI requests-thread is not for production systems and is subject to significant change (as of Oct 2017). The end goal of this project is to design an awaitable design pattern for requests in the future. If you need production ready concurrent requests, consider grequests or treq.
I think the only mistake here is that you forgot to run the reactor/event loop.
The following code works for me:
from twisted.internet import reactor
from requests_threads import AsyncSession
session = AsyncSession(n=10)
def processResponse(response):
print(response)
a = session.get('https://reqres.in/api/users')
a.addCallbacks(processResponse, processResponse)
reactor.run()
Currently i'm trying to scrape a site but the site didn't allow more than 100 request for one tcp connection. So, i tried to create multiple connection pool for requests. I tried the following code. Shouldn't it create 15 connection pool?
from urllib3 import HTTPConnectionPool
for i in range(15):
pool = HTTPConnectionPool('ajax.googleapis.com', maxsize=15)
for j in range(15):
resp= pool.request('GET', '/ajax/services/search/web')
pool.num_connections
pool.num_connection always print 1
The issue is that the requests are made synchronously one after the other.
For this reason the pool will always use the same connection with no need to create any others.
Now let's say we run the code using threads, multiple requests will be issued concurrently.
In this case pool.num_connections will be greater than 1:
from concurrent.futures.thread import ThreadPoolExecutor
from urllib3 import HTTPConnectionPool
pool = HTTPConnectionPool('ajax.googleapis.com', maxsize=15)
def send_request(_):
pool.request('GET', '/ajax/services/search/web')
print(pool.num_connections)
with ThreadPoolExecutor(max_workers=5) as executor:
executor.map(send_request, range(5))
If you need to close sockets every 100 requests, then you'll need to do that manually. Here's an example which closes all sockets every 5 requests:
import urllib3
urllib3.add_stderr_logger() # This lets you see when new connections are made
http = urllib3.PoolManager()
url = 'http://ajax.googleapis.com/ajax/services/search/web'
for j in range(15):
resp = http.request('GET', url)
if j % 5 == 0:
# Reset the PoolManager's connections.
# This might be overkill if you need more granular control per-host.
http.clear()
You could do something similar using HTTPConnectionPool and doing .close() on it before replacing it with a fresh one. I prefer to use PoolManager when possible (there is generally no downside).
If you'd like to get super granular with connections, you can manually take connections out of an HTTPConnectionPool using pool._get_conn() and .close()'ing it.
In my current application I use Tornado AsyncHttpClient to make requests to a web site.
The flow is complex, procesing responses from previous request results in another request.
Actually, I download an article, then analyze it and download images mention in it
What bothers me is that while in my log I clearly see the message indicating that .fetch() on photo URL has beeen issued, no actual HTTP request is made, as sniffed in Wireshark
I tried tinkering with max_client_count and Curl/Simple HTTP client, but the bahvior is always the same - until all articles are downloaded not photo requests are actually issued. How can change this?
upd. some pseudo code
#VictorSergienko I am on Linux, so by default, I guess, EPoll version is used. The whole system is too complicated but it boils down to:
#gen.coroutine
def fetch_and_process(self, url, callback):
body = yield self.async_client.fetch(url)
res = yield callback(body)
return res
#gen.coroutine
def process_articles(self,urls):
wait_ids=[]
for url in urls:
#Enqueue but don't wait for one
IOLoop.current().add_callback(self.fetch_and_process(url, self.process_article))
wait_ids.append(yield gen.Callback(key=url))
#wait for all tasks to finish
yield wait_ids
#gen.coroutine
def process_article(self,body):
photo_url=self.extract_photo_url_from_page(body)
do_some_stuff()
print('I gonna download that photo '+photo_url)
yield self.download_photo(photo_url)
#gen.coroutine
def download_photo(self, photo_url):
body = yield self.async_client.fetch(photo_url)
with open(self.construct_filename(photo_url)) as f:
f.write(body)
And when it prints I gonna download that photo no actual request is made!
Instead, it keeps on downloading more articles and enqueueing more photos untils all articles are downloaded, only THEN all photos are requested in a bulk
AsyncHTTPClient has a queue, which you are filling up immediately in process_articles ("Enqueue but don't wait for one"). By the time the first article is processed its photos will go at the end of the queue after all the other articles.
If you used yield self.fetch_and_process instead of add_callback in process_articles, you would alternate between articles and their photos, but you could only be downloading one thing at a time. To maintain a balance between articles and photos while still downloading more than one thing at a time, consider using the toro package for synchronization primitives. The example in http://toro.readthedocs.org/en/stable/examples/web_spider_example.html is similar to your use case.
I have what I would think is a pretty common use case for Gevent. I need a UDP server that listens for requests, and based on the request submits a POST to an external web service. The external web service essentially only allows one request at a time.
I would like to have an asynchronous UDP server so that data can be immediately retrieved and stored so that I don't miss any requests (this part is easy with the DatagramServer gevent provides). Then I need some way to send requests to the external web service serially, but in such a way that it doesn't ruin the async of the UDP server.
I first tried monkey patching everything and what I ended up with was a quick solution, but one in which my requests to the external web service were not rate limited in any way and which resulted in errors.
It seems like what I need is a single non-blocking worker to send requests to the external web service in serial while the UDP server adds tasks to the queue from which the non-blocking worker is working.
What I need is information on running a gevent server with additional greenlets for other tasks (especially with a queue). I've been using the serve_forever function of the DatagramServer and think that I'll need to use the start method instead, but haven't found much information on how it would fit together.
Thanks,
EDIT
The answer worked very well. I've adapted the UDP server example code with the answer from #mguijarr to produce a working example for my use case:
from __future__ import print_function
from gevent.server import DatagramServer
import gevent.queue
import gevent.monkey
import urllib
gevent.monkey.patch_all()
n = 0
def process_request(q):
while True:
request = q.get()
print(request)
print(urllib.urlopen('https://test.com').read())
class EchoServer(DatagramServer):
__q = gevent.queue.Queue()
__request_processing_greenlet = gevent.spawn(process_request, __q)
def handle(self, data, address):
print('%s: got %r' % (address[0], data))
global n
n += 1
print(n)
self.__q.put(n)
self.socket.sendto('Received %s bytes' % len(data), address)
if __name__ == '__main__':
print('Receiving datagrams on :9000')
EchoServer(':9000').serve_forever()
Here is how I would do it:
Write a function taking a "queue" object as argument; this function will continuously process items from the queue. Each item is supposed to be a request for the web service.
This function could be a module-level function, not part of your DatagramServer instance:
def process_requests(q):
while True:
request = q.get()
# do your magic with 'request'
...
in your DatagramServer, make the function running within a greenlet (like a background task):
self.__q = gevent.queue.Queue()
self.__request_processing_greenlet = gevent.spawn(process_requests, self.__q)
when you receive the UDP request in your DatagramServer instance, you push the request to the queue
self.__q.put(request)
This should do what you want. You still call 'serve_forever' on DatagramServer, no problem.