Simultaneous parsing, python

Simultaneous parsing, python - python

I have a python program that sequentially parses through 30,000+ files.
Is there a way I could break this up into multiple threads (is this the correct terminology?)
and parse through chunks of that file at the same time. Say having 30 algorithms parsing through 1000 files each.

This is easy.
You can create 30 threads explicitly and give each of them 1000 filenames.
But, even simpler, you can create a pool of 30 threads and have them service a thread with 30000 filenames. That gives you automatic load balancing—if some of the files are much bigger than others, you won't have one thread finishing when another one's only 10% done.
The concurrent.futures module gives you a nice way to execute tasks in parallel (including passing arguments to the tasks and receiving results, or even exceptions if you want). If you're using Python 2.x or 3.1, you will need to install the backport futures. Then you just do this:
with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
results = executor.map(parse_file, filenames)
Now, 30 workers is probably way too many. You'll overwhelm the hard drive and its drivers and end up having most of your threads waiting for the disk to seek. But a small number may be worth doing. And it's ridiculously easy to tweak max_workers and test the timing and see where the sweet spot is for your system.
If your code is doing more CPU work than I/O work—that is, it spends more time parsing strings and building complicated structures and the like than it does reading from the disk—then threads won't help, at least in CPython, because of the Global Interpreter Lock. But you can solve that by using processes.
From a code point of view, this is trivial: just change ThreadPoolExecutor to ProcessPoolExecutor.
However, if you're returning large or complex data structures, the time spent serializing them across the process boundary may eat into, or even overwhelm, your savings. If that's the case, you can sometimes improve things by batching up larger jobs:
def parse_files(filenames):
return [parse_file(filename) for filename in filenames]
with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
results = executor.map(parse_files, grouper(10, filenames))
But sometimes you probably need to drop to a lower level and use the multiprocessing module, which has features like inter-process memory sharing.
If you can't/don't want to use futures, 2.6+ have multiprocessing.Pool for a plain processor pool, and a thread pool with the same interface under the name multiprocessing.ThreadPool (not documented) or multiprocessing.dummy.Pool (documented but ugly).
In a trivial case like this, there's really no difference between a plain pool and an executor. And, as mentioned above, in very complicated cases, multiprocessing lets you get under the hood. In the middle, futures is often simpler. But it's worth learning both.

Related

Python multiprocessing - reassigning jobs dynamically from pool - without async?

So I have a batch of 1000 tasks that I assign using parmap/python multiprocessing module to 8 cores (dual xeon machine 16 physical cores). Currently this runs using synchronized.
The issue is that usually 1 of the cores lags well behind the other cores and still has several jobs/tasks to complete after all the other cores finished their work. This may be related to core speed (older computer) but more likely due to some of the tasks being more difficult than others - so the 1 core that gets the slightly more difficult jobs gets laggy...
I'm a little confused here - but is this what asynch parallelization does? I've tried using it before, but because this step is part of a very large processing step - it wasn't clear how to create a barrier to force the program to wait until all async processes are done.
Any advice/links to similar questions/answers are appreciated.
[EDIT] To clarify, the processes are ok to run independently, they all save data to disk and do not share variables.

parmap author here
By default, both in multiprocessing and in parmap, tasks are divided in chunks and chunks are sent to each multiprocessing process (see the multiprocessing documentation). The reason behind this is that sending tasks individually to a process would introduce significant computational overhead in many situations. The overhead is reduced if several tasks are sent at once, in chunks.
The number of tasks on each chunk is controlled with chunksize in multiprocessing (and pm_chunksize in parmap). By default, chunksize is computed as "number of tasks"/(4*"pool size"), rounded up (see the multiprocessing source code). So for your case, 1000/(4*4) = 62.5 -> 63 tasks per chunk.
If, as in your case, many computationally expensive tasks fall into the same chunk, that chunk will take a long time to finish.
One "cheap and easy" way to workaround this is to pass a smaller chunksize value. Note that using the extreme chunksize=1 may introduce undesired larger cpu overhead.
A proper queuing system as suggested in other answers is a better solution on the long term, but maybe an overkill for a one-time problem.

You really need to look at creating microservices and using a queue pool. For instance, you could put a list of jobs in celery or redis, and then have the microservices pull from the queue one at a time and process the job. Once done they pull the next item and so forth. That way your load is distributed based on readiness, and not based on a preset list.
http://www.celeryproject.org/
https://www.fullstackpython.com/task-queues.html

Why do we blame GIL if CPU can execute one process (light weight) at a time? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Fastest way to handle threading with callback in Python

I'm playing around with some measurement instruments through PyVisa (Python wrapper for VISA). Specifically I need to read the measurement value from four instruments, similar to this:
current1 = instrument1.ask("READ?")
current2 = instrument2.ask("READ?")
current3 = instrument3.ask("READ?")
current4 = instrument4.ask("READ?")
For my application, speed is a must. Individually, I can get between 50 and 200 measurements per second from the four instruments, but unfortunately my current code evaluates the four instruments serially.
From what I've read, there's some options with Threading and Multiprocessing in Python, but it's not obvious what the best, and fastest, option for me.
Best case scenario I will be spawning ~4x50 threads per second, so overhead is a bit of a concern.
The task is in no way CPU intensive, it is simply a matter of waiting for the readout from the instrument.
Any advice on what the proper course of action might be?
Many thanks in advance!

Try using the multiprocessing module in python. Because of the way the python interpreter is written, the threading module for python actually slows down your application when using multiple threads on a computer with multiple processors. This is due to something called the Global Interpreter Lock (GIL) which only allows one python thread per process to be run at a time. This makes things like memory management and concurrency easier for the interpreter.
I would suggest trying to create a process for each instrument (producer process) and send the results to a shared queue that gets processed by a single process (consumer process).
I would also limit the number of processes that you create to the number of processors on your computer. Creating too many processes introduces a lot of overhead.
If you really feel like you need the 200 threads you suggested in your post, I would program this in Java or C++.

I'm not sure if this fits your requirements I'm not too familiar with pyVisa, however have you looked into pypy's stackless examples the speed increase is pretty impressive, and you're able to spawn almost infinite amount of 'threadlets' without much hindrance to your system / program.
pypy is kinda limited in terms of supporting various packages but looks like pyVisa isn't one of them.
https://pypi.python.org/pypi/PyVISA

Can standard C Python has more than one thread running at the same time? [duplicate]

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

How to maximize performance in Python when doing many I/O bound operations?

I have a situation where I'm downloading a lot of files. Right now everything runs on one main Python thread, and downloads as many as 3000 files every few minutes. The problem is that the time it takes to do this is too long. I realize Python has no true multi-threading, but is there a better way of doing this? I was thinking of launching multiple threads since the I/O bound operations should not require access to the global interpreter lock, but perhaps I misunderstand that concept.

Multithreading is just fine for the specific purpose of speeding up I/O on the net (although asynchronous programming would give even greater performance). CPython's multithreading is quite "true" (native OS threads) -- what you're probably thinking of is the GIL, the global interpreter lock that stops different threads from simultaneously running Python code. But all the I/O primitives give up the GIL while they're waiting for system calls to complete, so the GIL is not relevant to I/O performance!
For asynchronous programming, the most powerful framework around is twisted, but it can take a while to get the hang of it if you're never done such programming. It would probably be simpler for you to get extra I/O performance via the use of a pool of threads.

Could always take a look at multiprocessing.

is there a better way of doing this?
Yes
I was thinking of launching multiple threads since the I/O bound operations
Don't.
At the OS level, all the threads in a process are sharing a limited set of I/O resources.
If you want real speed, spawn as many heavyweight OS processes as your platform will tolerate. The OS is really, really good about balancing I/O workloads among processes. Make the OS sort this out.
Folks will say that spawning 3000 processes is bad, and they're right. You probably only want to spawn a few hundred at a time.
What you really want is the following.
A shared message queue in which the 3000 URI's are queued up.
A few hundred workers which are all reading from the queue.
Each worker gets a URI from the queue and gets the file.
The workers can stay running. When the queue's empty, they'll just sit there, waiting for work.
"every few minutes" you dump the 3000 URI's into the queue to make the workers start working.
This will tie up every resource on your processor, and it's quite trivial. Each worker is only a few lines of code. Loading the queue is a special "manager" that's just a few lines of code, also.

Gevent is perfect for this.
Gevent's use of Greenlets (lightweight coroutines in the same python process) offer you asynchronous operations without compromising code readability or introducing abstract 'reactor' concepts into your mix.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.