Threading in python doesn't happen parallel - python

I'm doing data scraping calls with an urllib2, yet they each take around 1 seconds to complete. I was trying to test if I could multi-thread the URL-call loop into threading with different offsets.
I'm doing this now with my update_items() method, where first and second parameter are the offset and limit to do loops:
import threading
t1 = threading.Thread(target=trade.update_items(1, 100))
t2 = threading.Thread(target=trade.update_items(101, 200))
t3 = threading.Thread(target=trade.update_items(201, 300))
t1.start()
t2.start()
t3.start()
#t1.join()
#t2.join()
#t3.join()
Like the code, I tried to commment out the join() to prevent waiting of the threads, but it seems I get the idea of this library wrong. I inserted print() functions into the update_items() method, funny tho it shows that it's still looping just in serial routine and not all 3 threads in parallel, like I wanted to achieve.
My normal scraping protocol takes about 5 hours to complete and it's only very small pieces of data, but the HTTP call always takes some time. I want to multi-thread this task at least a few times to shorten the fetching at least to around 30-45minutes.

To get multiple urls in parallel limiting to 20 connections at a time:
import urllib2
from multiprocessing.dummy import Pool
def generate_urls(): # generate some dummy urls
for i in range(100):
yield 'http://example.com?param=%d' % i
def get_url(url):
try: return url, urllib2.urlopen(url).read(), None
except EnvironmentError as e:
return url, None, e
pool = Pool(20) # limit number of concurrent connections
for url, result, error in pool.imap_unordered(get_url, generate_urls()):
if error is None:
print result,

Paul Seeb has correctly diagnosed your issue.
You are calling trade.update_items, and then passing the result to the threading.Thread constructor. Thus, you get serial behavior: your threads don't do any work, and the creation of each one is delayed until the update_items call returns.
The correct form is threading.Thread(target=trade.update_items, args=(1, 100) for the first line, and similarly for the later ones. This will pass the update_items function as the thread entry point, and the *[1, 100] as its positional arguments.

Related

When running two functions simultaneously how to return the first result and use it for further processes

So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)
As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results
I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.

Multithreading / Multiprocessing with a for-loop in Python3

I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.

Store the results of a multiprocessing queue in python

I'm trying to store the results of multiple API requests using multiprocessing queue as the API can't handle more than 5 connections at once.
I found part of a solution of How to use multiprocessing with requests module?
def worker(input_queue, stop_event):
while not stop_event.is_set():
try:
# Check if any request has arrived in the input queue. If not,
# loop back and try again.
request = input_queue.get(True, 1)
input_queue.task_done()
except queue.Empty:
continue
print('Started working on:', request)
api_request_function(request) #make request using a function I wrote
print('Stopped working on:', request)
def master(api_requests):
input_queue = multiprocessing.JoinableQueue()
stop_event = multiprocessing.Event()
workers = []
# Create workers.
for i in range(3):
p = multiprocessing.Process(target=worker,
args=(input_queue, stop_event))
workers.append(p)
p.start()
# Distribute work.
for requests in api_requests:
input_queue.put(requests)
# Wait for the queue to be consumed.
input_queue.join()
# Ask the workers to quit.
stop_event.set()
# Wait for workers to quit.
for w in workers:
w.join()
print('Done')
I've looked at the documentation of threading and pooling but missing a step. So the above runs and all requests get a 200 status code which is great. But I do I store the results of the requests to use?
Thanks for your help
Shan
I believe you have to make a Queue. The code can be a little tricky, you need to read up on the multiprocessing module. In general, with multiprocessing, all the variables are copied for each worker, hence you can't do something like appending to a global variable. Since that will literally be copied and the original will be untouched. There are a few functions that already automatically incorporate workers, queues, and return values. Personally, I try to write my functions to work with mp.map, like below:
def worker(*args,**kargs):
#do stuff
return 'thing'
output = multiprocessing.Pool().map(worker,[1,2,3,4,5])

Make threads in python join automatically when they get to the end of the function

I have a piece of code implemented which can be seen below:
while True:
t = threading.Thread(target=myfunction, args=(param1, param2))
t.start()
t.join()
time.sleep(1)
My function contains a get request to an API like so:
def myfunction(param1, param2):
x = requests.get(url=param1, param = param2)
print("Status code: {}".format(x.status_code))
The reason I have the time.sleep() line in the while True loop is to avoid the API getting rate limited as I am only allowed one request every second. However, the timer only starts once the the t thread has joined. This means that if the request to the api takes 0.2 seconds, I am only sending a request every 1.2 seconds, rather than every second.
Is there anyway to make t thread automatically join, once its reached the end of the myfunction function?

Multiprocessing hanging with requests.get

I have been working with a very simple bit of code, but the behavior is very strange. I am trying to send a request to a webpage using requests.get, but if the request takes longer than a few seconds, I would like to kill the process. I am following the response from the accepted answer here, but changing the function body to include the request. My code is below:
import multiprocessing as mp, requests
def get_page(_r):
_rs = requests.get('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas').text
_r.put(_rs)
q = mp.Queue()
p = mp.Process(target=get_page, args=(q,))
p.start()
time.sleep(3)
p.terminate()
p.join()
try:
result = q.get(False)
print(result)
except:
print('failed')
The code above simply hanges when running it. However, when I run
requests.get('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas').text
independently, a response is returned in under two seconds. Therefore, main code should print the page's HTML, however, it just stalls. Oddly, when I put an infinite loop in get_page:
def get_page(_r):
while True:
pass
_r.put('You will not see this')
the process is indeed terminated after three seconds. Therefore, I am certain the behavior has to do with requests. How could this be? I discovered a similar question here, but I am not using async. Could the issue have to do with monkey patching, as I am using requests along with time and multiprocessing? Any suggestions or comments would be appreciated. Thank you!
I am using:
Python 3.7.0
requests 2.21.0
Edit: #Hitobat pointed out that a param timeout can be used instead with requests. This does indeed work, however, I would appreciate any other ideas pertaining to why the requests is failing with multiprocessing.
I have reproduced your scenario and I have to refute the mentioned supposition "I am certain the behavior has to do with requests".
requests.get(...) returns the response as expected.
Let see how the process goes with some debug points:
import multiprocessing as mp, requests
import time
def get_page(_r):
_rs = requests.get('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas').text
print('--- response header', _rs[:17])
_r.put(_rs)
q = mp.Queue()
p = mp.Process(target=get_page, args=(q,))
p.start()
time.sleep(3)
p.terminate()
p.join()
try:
print('--- get data from queue of size', q.qsize())
result = q.get(False)
print(result)
except Exception as ex:
print('failed', str(ex))
The output:
--- response header
<!DOCTYPE html>
--- get data from queue of size 1
As we see the response is there and the process even advanced to try block statements but it hangs/stops at the statement q.get() when trying to extract data from the queue.
Therefore we may conclude that the queue is likely to be corrupted.
And we have a respective warning in multiprocessing library documentation (Pipes and Queues section):
Warning
If a process is killed using Process.terminate() or os.kill() while
it is trying to use a Queue, then the data in the queue is likely to
become corrupted. This may cause any other process to get an exception
when it tries to use the queue later on.
Looks like this is that kind of case.
How can we handle this issue?
A known workaround is using mp.Manager().Queue() (with intermediate proxying level) instead of mp.Queue:
...
q = mp.Manager().Queue()
p = mp.Process(target=get_page, args=(q,))

Categories