Im currently working on a scraper where I am trying to figure out how I can assign proxies that are avaliable to use, meaning that if I use 5 threads and if thread-1 uses proxy A, no other threads should be able to access proxy A and should try do randomize all available proxy pool.
import random
import time
from threading import Thread
import requests
list_op_proxy = [
"http://test.io:12345",
"http://test.io:123456",
"http://test.io:1234567",
"http://test.io:12345678"
]
session = requests.Session()
def handler(name):
while True:
try:
session.proxies = {
'https': f'http://{random.choice(list_op_proxy)}'
}
with session.get("https://stackoverflow.com"):
print(f"{name} - Yay request made!")
time.sleep(random.randint(5, 10))
except requests.exceptions as err:
print(f"Error! Lets try again! {err}")
continue
except Exceptions as err:
print(f"Error! Lets debug! {err}")
raise Exception
for i in range(5):
Thread(target=handler, args=(f'Thread {i}',)).start()
I wonder how I can create a way where I can use proxies that are available and not being used in any threads and "block" the proxy to not be able to be used to other threads and release once it is finished?
One way to go about this would be to just use a global shared list, that holds the currently active proxies or to remove the proxies from the list and readd them after the request is finished. You do not have to worry about concurrent access on the list, since CPython suffers from the GIL.
proxy = random.choice(list_op_proxy)
list_op_proxy.remove(proxy)
session.proxies = {
'https': f'http://{proxy}'
}
# ... do request
list_op_proxy.append(proxy)
you could also do this using a queue and just pop and add to make it more efficient.
Using a Proxy Queue
Another option is to put the proxies into a queue and get() a proxy before each query, removing it from the available proxies, and the put() it back after the request has been finished. This is a more efficient version of the above mentioned list approach.
First we need to initialize the proxy queue.
proxy_q = queue.Queue()
for proxy in proxies:
proxy_q.put(proxy)
Within the handler we then get a proxy from the queue before performing a request. We perform the request and put the proxy back to the queue.
We are using block=True, such that the queue blocks the thread if there is no proxy currently available. Otherwise the thread would terminate with a queue.Empty exception once all proxies are in use and a new one should be aquired.
def handler(name):
global proxy_q
while True:
proxy = proxy_q.get(block=True) # we want blocking behaviour
# ... do request
proxy_q.put(proxy)
# ... response handling can be done after proxy put to not
# block it longer than required
# do not forget to define a break condition
Using Queue and Multiprocessing
First you would initialize the manager and put all your data into the queue and initialize another structure for collecting your results (here we initialize a shared list).
manager = multiprocessing.Manager()
q = manager.Queue()
for e in entities:
q.put(e)
print(q.qsize())
results = manager.list()
The you initialize the scraping processes:
for proxy in proxies:
processes.append(multiprocessing.Process(
target=scrape_function,
args=(q, results, proxy)
daemon=True))
And then start each of them
for w in processes:
w.start()
lastly you join every process to ensure that the main process is not terminated before the subprocesses are finished
for w in processes:
w.join()
Inside the scrape_function you then simply get one item at a time and perform the request. The queue object in the default configuration raises an queue.Empty error when it is empty, so we are using an infinite while loop with a break condition catching the exception.
def scrape_function(q, results, proxy)
session = requests.Session()
session.proxies = {
'https': f'http://{proxy}'
}
while True:
try:
request_uri = q.get(block=False)
with session.get("https://stackoverflow.com"):
print(f"{name} - Yay request made!")
results.append(result)
time.sleep(random.randint(5, 10))
except queue.Empty:
break
The results of each query are appended to the results list, which is also shared among the different processes.
Related
I am fetching JSON data from a website (https://www.nseindia.com/) using python request library. The website only gives data if correct cookies are provided. So I use selenium webdriver to get cookies and then get data for 850 different stocks.
Now, my code is such that if the cookies are wrong, then selenium should open again and get new cookie value. But the problem is that when I am using concurrent.futures, the tasks are very fast (due to asynchronicity) and it opens new drivers for each symbol till new cookies are not found. My code is as below:
--Initially get cookies
cookie_dict = get_cookies()
for cookie in cookie_dict:
if cookie == "bm_sv" or cookie == "nsit" or cookie == "nseappid":
session.cookies.set(cookie,cookie_dict[cookie])
--This function is used in Thread Pool Executor
def final(u):
try:
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
list_done.append(data['data'][0]['CH_SYMBOL'])
except:
print("Error")
cookie_dict = get_cookies()
for cookie in cookie_dict:
if cookie == "bm_sv" or cookie == "nsit" or cookie == "nseappid":
session.cookies.set(cookie,cookie_dict[cookie])
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
list_done.append(data['data'][0]['CH_SYMBOL'])
As you can see that if exception is met, it should get new cookies. But as I previously mentioned, that futures will keep running for other stocks and exceptions will be met till new cookie is not found.
with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:
executor.map(final,urls)
Is there a way I can change my code, or some built in facility by futures, so that I can stop it till I don't get new cookies and continue running only if new cookies are appended?
You could use the wait() method of concurrent.futures to monitor for the first exception. first, you have to change final to be throwing. You should not append to a shared list inside this function, you should return the value:
def final(u):
data = session.get(u,headers = headers).json()
print(data['data'][0]['CH_SYMBOL'])
return data['data'][0]['CH_SYMBOL']
Second, instead of using .map() you submit work and store futures:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
futures = [executor.submit(final, u) for u in urls]
Then you use wait() to wait for an exception to occur. Here, either there is an exception and you handle it before continuing, of there is no error and you unwrap the futures:
from concurrent.futures import wait, FIRST_EXCEPTION
completed, not_done = wait(futures, return_when=FIRST_EXCEPTION)
if not_done:
# An exception occurred before all futures completed,
# handle it here.
# Here, handle the results of the futures,
# `completed` contains both finished and failed futures,
# `not_done` contains pending futures.
Note that you should check for each future if succeeded of failed before unwrapping it, and you should resubmit futures that failed. Also, the wait() function does not cancels pending futures, so not_done futures are still progressing.
I want to make repeated requests to a server that will return with some tasks. The response from the server will be a dictionary with a list of functions that need to be called. For example:
{
tasks: [
{
function: "HelloWorld",
id: 1212
},
{
function: "GoodbyeWorld"
id: 1222
}
]
}
NOTE: I'm dummying it down.
For each of these tasks, I will run the specified function using multiprocessing. Here is an example of my code:
r = requests.get('https://localhost:5000', auth=('user', 'pass'))
data = r.json()
if len(data["tasks"]) > 0:
manager = multiprocessing.Manager()
for task in data["tasks"]:
if task["function"] == "HelloWorld":
helloObj = HelloWorldClass()
hello = multiprocessing.Process(target=helloObj.helloWorld)
hello.start()
hello.join()
elif task["function"] == "GoodbyeWorld":
byeObj = GoodbyeWorldClass()
bye = multiprocessing.Process(target=byeObj.byeWorld)
bye.start()
bye.join()
The problem is, I want to make repeated requests and fill the data["tasks"] array as the other processes are running. If I throw everything into some while loop, it'll only make a request after all the processes from the initial response is done (when join() has been reached for all processes).
Can anyone help me to make repeated requests and fill the array continuously? Please let me know if I need to make any clarifications.
If I understood you correctly, you need something like this:
import time
from multiprocessing import Process
import requests
from task import FunctionFactory
def get_tasks():
resp = requests.get('https://localhost:5000', auth=('user', 'pass'))
data = resp.json()
return data['tasks']
if __name__ == '__main__':
procs = {}
for _ in range(10):
tasks = get_tasks()
if not tasks:
time.sleep(5)
continue
for task in tasks:
if task['id'] in procs:
# This task has been already submitted for execution.
continue
func = FunctionFactory.build(task['function'])
proc = Process(target=func)
proc.start()
procs[task['id']] = proc
# Waiting for all the submitted tasks to finish.
for proc in procs.values():
proc.join()
Here, the function get_tasks is used to request a list of dictionaries with id and function keys from the server. In the main section, there is a procs dictionary that maps id to running process instances which execute functions built by a FunctionFactory using received tasks' function names. In the case there is already a running task with the same id, it gets ignored.
With this approach, you can request tasks as often as needed (here, 10 requests are used in a for loop) and start processes to execute them in parallel. In the end, you just wait for all the submitted tasks to finish.
You have a bug in your program, you should call the joins after you've created all the tasks. Join blocks until the process has finished -- in your case before you start the next one. Which practically makes you whole program run sequentially.
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
This is my main function. If I receive new offer, I need to check the payment. I have HandleNewOffer() function on that. But the problem with this code happens if there are 2(or more) offers at the same time. One of the buyers will have to wait until the closing of the transaction. So is this possible to generate new process with HandleNewOffer() function and kill it when it`s done to make several transactions at the same time? Thank you in advance.
def handler():
try:
conn = k.call('GET', '/api/').json() #connect
response = conn.call('GET', '/api/notifications/').json()
notifications = response['data']
for notification in notifications:
if notification['contact']:
HandleNewOffer(notification) # need to dynamically start new process if notification
except Exception as err:
error= ('Error')
Send(error)
I'd recommend to use the Pool of workers pattern here to limit the amount of concurrent calls to HandleNewOffer.
The concurrent.futures module offers ready-made implementations of the above mentioned pattern.
from concurrent.futures import ProcessPoolExecutor
def handler():
with ProcessPoolExecutor() as pool:
try:
conn = k.call('GET', '/api/').json() #connect
response = conn.call('GET', '/api/notifications/').json()
# collect notifications to process into a list
notifications = [n for n in response['data'] if n['contact']]
# send the list of notifications to the concurrent workers
results = pool.map(HandleNewOffer, notifications)
# iterate over the list of results from every HandleNewOffer call
for result in results:
print(result)
except Exception as err:
error= ('Error')
Send(error)
This logic will handle as many offers in parallel as many CPU cores you computer has.
It says urllib2 is not defined
Nameerror : Global name 'urllib2' is not defined
I am using Python 2.7, can anyone help me out to fix this one? It's not my code tho but still needs some help regarding it.
And please if you guys can make some edit to make it use less cpu usage and good
execution time?
I am just a beginner in sending http request
from grab import Grab, GrabError
from Tkinter import *
from tkFileDialog import *
import requests
from urllib2 import urlopen
""" A multithreaded proxy checker
Given a file containing proxies, per line, in the form of ip:port, will attempt
to establish a connection through each proxy to a provided URL. Duration of
connection attempts is governed by a passed in timeout value. Additionally,
spins off a number of daemon threads to speed up processing using a passed in
threads parameter. Proxies that passed the test are written out to a file
called results.txt
Usage:
goodproxy.py [-h] -file FILE -url URL [-timeout TIMEOUT] [-threads THREADS]
Parameters:
-file -- filename containing a list of ip:port per line
-url -- URL to test connections against
-timeout -- attempt time before marking that proxy as bad (default 1.0)
-threads -- number of threads to spin off (default 16)
Functions:
get_proxy_list_size -- returns the current size of the proxy holdingQueue
test_proxy -- does the actual connecting to the URL via a proxy
main -- creates daemon threads, write results to a file
"""
import argparse
import queue
import socket
import sys
import threading
import time
def get_proxy_list_size(proxy_list):
""" Return the current Queue size holding a list of proxy ip:ports """
return proxy_list.qsize()
def test_proxy(url, url_timeout, proxy_list, lock, good_proxies, bad_proxies):
""" Attempt to establish a connection to a passed in URL through a proxy.
This function is used in a daemon thread and will loop continuously while
waiting for available proxies in the proxy_list. Once proxy_list contains
a proxy, this function will extract that proxy. This action automatically
lock the queue until this thread is done with it. Builds a urllib.request
opener and configures it with the proxy. Attempts to open the URL and if
successsful then saves the good proxy into the good_proxies list. If an
exception is thrown, writes the bad proxy to a bodproxies list. The call
to task_done() at the end unlocks the queue for further processing.
"""
while True:
# take an item from the proxy list queue; get() auto locks the
# queue for use by this thread
proxy_ip = proxy_list.get()
# configure urllib.request to use proxy
proxy = urllib.request.ProxyHandler({'http': proxy_ip})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
# some sites block frequent querying from generic headers
request = urllib.request.Request(
url, headers={'User-Agent': 'Proxy Tester'})
try:
# attempt to establish a connection
urllib.request.urlopen(request, timeout=float(url_timeout))
# if all went well save the good proxy to the list
with lock:
good_proxies.append(proxy_ip)
except (urllib.request.URLError,
urllib.request.HTTPError,
socket.error):
# handle any error related to connectivity (timeouts, refused
# connections, HTTPError, URLError, etc)
with lock:
bad_proxies.append(proxy_ip)
finally:
proxy_list.task_done() # release the queue
def main(argv):
""" Main Function
Uses argparse to process input parameters. File and URL are required while
the timeout and thread values are optional. Uses threading to create a
number of daemon threads each of which monitors a Queue for available
proxies to test. Once the Queue begins populating, the waiting daemon
threads will start picking up the proxies and testing them. Successful
results are written out to a results.txt file.
"""
proxy_list = queue.Queue() # Hold a list of proxy ip:ports
lock = threading.Lock() # locks good_proxies, bad_proxies lists
good_proxies = [] # proxies that passed connectivity tests
bad_proxies = [] # proxies that failed connectivity tests
# Process input parameters
parser = argparse.ArgumentParser(description='Proxy Checker')
parser.add_argument(
'-file', help='a text file with a list of proxy:port per line',
required=True)
parser.add_argument(
'-url', help='URL for connection attempts', required=True)
parser.add_argument(
'-timeout',
type=float, help='timeout in seconds (defaults to 1', default=1)
parser.add_argument(
'-threads', type=int, help='number of threads (defaults to 16)',
default=16)
args = parser.parse_args(argv)
# setup daemons ^._.^
for _ in range(args.threads):
worker = threading.Thread(
target=test_proxy,
args=(
args.url,
args.timeout,
proxy_list,
lock,
good_proxies,
bad_proxies))
worker.setDaemon(True)
worker.start()
start = time.time()
# load a list of proxies from the proxy file
with open(args.file) as proxyfile:
for line in proxyfile:
proxy_list.put(line.strip())
# block main thread until the proxy list queue becomes empty
proxy_list.join()
# save results to file
with open("result.txt", 'w') as result_file:
result_file.write('\n'.join(good_proxies))
# some metrics
print("Runtime: {0:.2f}s".format(time.time() - start))
if __name__ == "__main__":
main(sys.argv[1:])