Parallelize processing of a huge list

Parallelize processing of a huge list - python

I need to do a lot of webscraping from domains stored in a .txt file (about 50 MB size).
I want to do it multi-threaded. Hence I am loading a number of entries into a Python list and process each with threads.
Example:
biglist = ['google.com','facebook.com','apple.com']
threads = [threading.Thread(target=fetch_url, args=(chuck,))
for domain in biglist]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
It works but it seems to me that it's not very efficient, as there is a lot of memory usage and it takes a lot of time to complete.
What better ways are there to achieve what I'm doing?

Don't use lists/threads, but a queue/processes instead.
If you know Redis I suggest RQ (http://python-rq.org) - I'm doing the same thing and works quite nicely.

You are using the wrong library. threading.Thread does not really benefit from multiple processors as it is blocked by the Global Interpreter Lock.
From the documentation of the threading module (c.f.):
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
I suggest you to use a process pool from the multiprocessing module and map() to compute the results in parallel. It doesn't make sense to use more processes than you have processors then.
From the documentation of the multiprocessing module (c.f.):
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
Example:
from multiprocessing import Pool
number_of_processors = 3
data = range(10)
def func(x):
print "processing", x
return x*x
pool = Pool(number_of_processors)
ret = pool.map(func, data)
The output is:
$ python test.py
processing 0
processing 1
processing 3
processing 2
processing 4
processing 5
processing 6
processing 7
processing 8
processing 9
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Adding to moooeeeeps answer.
There is another way of handling lot of connections in one thread, without spawning expensive processes. Gevent has a similar api to multiprocessing/multithreading.
Docs and tutorials:
http://www.gevent.org/intro.html
http://sdiehl.github.io/gevent-tutorial/
Also there is a python framework for scraping urls:
http://scrapy.org/
Example from docs asyc/sync fetching urls:
import gevent.monkey
gevent.monkey.patch_socket()
import gevent
import urllib2
import simplejson as json
def fetch(pid):
response = urllib2.urlopen('http://json-time.appspot.com/time.json')
result = response.read()
json_result = json.loads(result)
datetime = json_result['datetime']
print('Process %s: %s' % (pid, datetime))
return json_result['datetime']
def synchronous():
for i in range(1,10):
fetch(i)
def asynchronous():
threads = []
for i in range(1,10):
threads.append(gevent.spawn(fetch, i))
gevent.joinall(threads)
print('Synchronous:')
synchronous()
print('Asynchronous:')
asynchronous()

Related

How do I write a robust multithreading map function?

I can apply this function successfully to parallelize function across a list:
def multiprocessing_run_func_with_list(func_now, list_now,threadingcount=8):
from multiprocessing import Pool, TimeoutError
with Pool(processes=threadingcount) as pool:
results = pool.map(func_now, list_now)
return results
However, when i switch to threading, Jupiter notebook just dies when running the following. (previously it runs well... not sure why it does not run well now...)
def multithreading_run_func_with_list(func_now, list_now,threadingcount=8):
from multiprocessing.pool import ThreadPool
if threadingcount is not None:
threads = ThreadPool(threadingcount) # Initialize the desired number of threads
else:
threads = ThreadPool(len(list_now)) # Initialize the desired number of threads
# results = threads.map(func_now, list_now)
results = threads.starmap(func_now, [(j,) for j in list_now])
return results
Is there any error/ bettie way to call such multithreading?

Python restricts multi-threading to avoid memory conflicts (it does so using the GIL). I would recommend you just use multi-processing and batching to make use of all your cores instead of trying to multi-thread - multi-threading and python just do not mix at the most fundamental level of the language.
Good Luck!

Multithreading inside Multiprocessing in Python

I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?

You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.

As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.

Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on

I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.

Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.

Simple multithread for loop in Python

I searched everywhere and I don't find any simple example of iterating a loop with multithreading.
For example, how can I multithread this loop?
for item in range(0, 1000):
print(item)
Is there any way to cut it in like 4 threads, so each thread has 250 iterations?

Easiest way is with multiprocessing.dummy (which uses threads instead of processes) and a Pool
import multiprocessing.dummy as mp
def do_print(s):
print s
if __name__=="__main__":
p=mp.Pool(4)
p.map(do_print,range(0,10)) # range(0,1000) if you want to replicate your example
p.close()
p.join()
Maybe you want to try real multiprocessing, too if you want to better utilize multiple CPUs but there are several caveats and guidelines to follow then.
Possibly other methods of Pool would better suit your needs - depending on what you are actually trying to do.

You'll have to do the splitting manually:
import threading
def ThFun(start, stop):
for item in range(start, stop):
print item
for n in range(0, 1000, 100):
stop = n + 100 if n + 100 <= 1000 else 1000
threading.Thread(target = ThFun, args = (n, stop)).start()
This code uses multithreading, which means that everything will be run within a single Python process (i.e. only one Python interpreter will be launched).
Multiprocessing, discussed in the other answer, means running some code in several Python interpreters (in several processes, not threads). This may make use of all the CPU cores available, so this is useful when you're focusing on the speed of your code (print a ton of numbers until the terminal hates you!), not simply on parallel processing. 1
1. multiprocessing.dummy turns out to be a wrapper around the threading module. multiprocessing and multiprocessing.dummy have the same interface, but the first module does parallel processing using processes, while the latter - using threads.

Since Python 3.2, the concurrent.futures standard library provides primitives to concurrently map a function across iterables. Since map and for are closely related, this allows to easily convert a for loop into a multi-threaded/multi-processed loop:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
executor.map(print, range(0, 1000))

multiprocessing.pool context and load balancing

I've encountered some unexpected behaviour of the python multiprocessing Pool class.
Here are my questions:
1) When does Pool creates its context, which is later used for serialization? The example below runs fine as long as the Pool object is created after the Container definition. If you swap the Pool initializations, serialization error occurs. In my production code I would like to initialize Pool way before defining the container class. Is it possible to refresh Pool "context" or to achieve this in another way.
2) Does Pool have its own load balancing mechanism and if so how does it work?
If I run a similar example on my i7 machine with the pool of 8 processes I get the following results:
- For a light evaluation function Pool favours using only one process for computation. It creates 8 processes as requested but for most of the time only one is used (I printed the pid from inside and also see this in htop).
- For a heavy evaluation function the behaviour is as expected. It uses all 8 processes equally.
3) When using Pool I always see 4 more processes that I requested (i.e. for Pool(processes=2) I see 6 new processes). What is their role?
I use Linux with Python 2.7.2
from multiprocessing import Pool
from datetime import datetime
POWER = 10
def eval_power(container):
for power in xrange(2, POWER):
container.val **= power
return container
#processes = Pool(processes=2)
class Container(object):
def __init__(self, value):
self.val = value
processes = Pool(processes=2)
if __name__ == "__main__":
cont = [Container(foo) for foo in xrange(20)]
then = datetime.now()
processes.map(eval_power, cont)
now = datetime.now()
print "Eval time:", now - then
EDIT - TO BAKURIU
1) I was afraid that that's the case.
2) I don't understand what the linux scheduler has to do with python assigning computations to processes. My situation can be ilustrated by the example below:
from multiprocessing import Pool
from os import getpid
from collections import Counter
def light_func(ind):
return getpid()
def heavy_func(ind):
for foo in xrange(1000000):
ind += foo
return getpid()
if __name__ == "__main__":
list_ = range(100)
pool = Pool(4)
l_func = pool.map(light_func, list_)
h_func = pool.map(heavy_func, list_)
print "light func:", Counter(l_func)
print "heavy func:", Counter(h_func)
On my i5 machine (4 threads) I get the following results:
light func: Counter({2967: 100})
heavy func: Counter({2969: 28, 2967: 28, 2968: 23, 2970: 21})
It seems that the situation is as I've described it. However I still don't understand why python does it this way. My guess would be that it tries to minimise communication expenses, but still the mechanism which it uses for load balancing is unknown. The documentation isn't very helpful either, the multiprocessing module is very poorly documented.
3) If I run the above code I get 4 more processes as described before. The screen comes from htop: http://i.stack.imgur.com/PldmM.png

The Pool object creates the subprocesses during the call to __init__ hence you must define Container before. By the way, I wouldn't include all the code in a single file but use a module to implement the Container and other utilities and write a small file that launches the main program.
The Pool does exactly what is described in the documentation. In particular it has no control over the scheduling of the processes hence what you see is what Linux's scheduler thinks it is right. For small computations they take so little time that the scheduler doesn't bother parallelizing them(this probably have better performances due to core affinity etc.)
Could you show this with an example and what you see in the task manager? I think they may be the processes that handle the queue inside the Pool, but I'm not sure. On my machine I can see only the main process plus the two subprocesses.
Update on point 2:
The Pool object simply puts the tasks into a queue, and the child processes get the arguments from this queue. If a process takes almost no time to execute an object, than Linux scheduler let the process execute more time(hence consuming more items from the queue). If the execution takes much time then this scheduler will change processes and thus the other child processes are also executed.
In your case a single process is consuming all items because the computation take so little time that before the other child processes are ready it has already finished all items.
As I said, Pool doesn't do anything about balancing the work of the subprocesses. It's simply a queue and a bunch of workers, the pool puts items in the queue and the processes get the items and compute the results. AFAIK the only thing that it does to control the queue is putting a certain number of tasks in a single item in the queue(see the documentation) but there is no guarantee about which process will grab which task. Everything else is left to the OS.
On my machine the results are less extreme. Two processes get about twice the number of calls than the other two for the light computation, while for the heavy one all have more or less the same number of items processed. Probably on different OSes and/or hardware we would obtain even different results.

python twisted threading

Hi can you please tell me how use different functions in different thread using thread pool
in twisted...say
I have a list of ids x=[1,2,3,4] where 1,2,...etc are ids(I got from data base and each one contains python script in some where disk).
what I want to do is
scanning of x traverse on list and run every script in different thread until they completed
Thanx Calderone, your code helped me a lot.
I have few doubts like I can resize threadpool size by this way.
from twisted.internet import reactor
reactor.suggestThreadPoolSize(30)
say all 30 available threads are busy & there is still some ids in list(dict or tuple)
1-In this situation all ids will be traversed? I mean as soon as thread is free next tool(id)
will be assigned to freed thread?
2-there is also some cases one tools must be executed before second tool & one tool output will be used by another tool,how will it be managed in twisted thread. 3

Threads in Twisted are primarily used via twisted.internet.threads.deferToThread. Alternatively, there's a new interface which is slightly more flexible, twisted.internet.threads.deferToThreadPool. Either way, the answer is roughly the same, though. Iterate over your data and use one of these functions to dispatch it to a thread. You get back a Deferred from either which will tell you what the result is, when it is available.
from twisted.internet.threads import deferToThread
from twisted.internet.defer import gatherResults
from twisted.internet import reactor
def double(n):
return n * 2
data = [1, 2, 3, 4]
results = []
for datum in data:
results.append(deferToThread(double, datum))
d = gatherResults(results)
def displayResults(results):
print 'Doubled data:', results
d.addCallback(displayResults)
d.addCallback(lambda ignored: reactor.stop())
reactor.run()
You can read more about threading in Twisted in the threading howto.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.