I am writing a script to simultaneously accept many files transfers from many computers on a subnet using sockets (around 40 jpg files total). I want to use multithreading or multiprocessing to make the the transfer occur as fast as possible.
I'm wondering if this type of image transfer is limited by the CPU - and therefore I should use multiprocessing - or if multithreading will be just as good here.
I would also be curious as to what types of activities are limited by the CPU and require multiprocessing, and which are better suited for multithreading.
If the following assumptions are true:
Your script is simply receiving data from the network and writing that data to disk (more or less) verbatim, i.e. it isn't doing any expensive processing on the data
Your script is running on a modern CPU with typical modern networking hardware (e.g. gigabit Ethernet or slower)
Your script's download routines are not grossly inefficient (e.g. you are receiving reasonably-sized chunks of data and not just 1 byte at a time or something silly like that)
... then it's unlikely that your download rate will be CPU-limited. More likely the bottleneck will be either network bandwidth or disk I/O bandwidth.
In any case, since AFAICT your use-case is embarrassingly parallel (i.e. the various downloads never have to communicate or interact with each other, they just each do their own thing independently), it's unlikely that using multithreading vs multiprocessing will make much difference in terms of performance. Of course, the only way to be certain is to try it both ways and measure the throughput each way.
Short answer:
Generally, it really depends on your workload. If you're serious on the performance, please provide details. for example, whether you store images to disk, whether image sizes are > 1GB or not, and etc.
Note: Generally again, if it not mission-critical, both ways are acceptable since we can easily switch between multithread and multiprocess implementations using threading.Thread and multiprocessing.Process.
some more comments
It seems that not CPU but IO will be the bottleneck.
For multiprocess / multithread, due to GIL and/or your implementation, we may have performance difference. You may implement both ways and make try. BTW, IMHO it won't differ much. I think that async IO vs blocking IO will have greater impact.
If your file transfer isn't extremely slow - slower than writing data to disk, multithreading/multiprocessing isn't going to help. By file transfer I mean downloading images and writing them to the local computer with a single HDD.
Using multithreading or multiprocessing when transferring data from several computers with separate disks definitely can improve overall download performance. Simply data read from several physical disks can be read in paralel. The problem arises when you try to save these images to your local drive.
You have just a single local HDD (if disk array not used), single HDD like most HW devices can do just a single IO operation at time. So trying to write several images to disk in the same time won't improve the overal performance - it can even hamper it.
Just imagine that 40 already downloaded images are trying to be written to a single mechanical HDD with single HDD head to different locations (different physical files) especially if disk is fragmented. Then this can even slow down the whole process because HDD is wasting time moving it magnetic head from one position to different (drives can partially mitigate this by reordering IO operation to limit head movement).
On the other hand if you do some preprocessing with these images that is CPU intensive and just then you are going to save them to disk, multithreading can be really helpful.
And to the question what's preferred. On modern OSs there is not a significant difference between using multithreading and multiprocessing (spanning multiple processes). OSs like Linux or Windows schedule threads not processes - based on process and thread priorities. So there is not a big difference between 40 single threaded processes and a single process containing 40 threads. Using multiple processes normally consumes more memory because OS for every process has to allocate some extra memory (not big), but from point of speed difference between multithreading and multiprocessing is not significant. There are other important question to consider which method to use (will these downloads share some data - like common GUI interface - multithreading is easier to use), (are these files to download so big that 40 transfers can exhaust all virtual address space of a single process - use multiprocessing).
Generally:
Multithreading - easier to use in single application because all threads share virtual address space of a single process and can easily communicate with each other. On the other hand single process has a limited size of virtual address space (less than 4GB on 32bit computer).
Multiprocessing - harder to use in a single application (a need of inter-process communication), but more scalable and more robust (if file transfer process crashes only a single file transfer fails) + more virtual address space to use.
Related
Since this is a design question I don't think code would do it justice.
I'm making a program to process and log a high bandwidth stream of data and concurrently trying to observe that data live. (like a production line and an inspector watching the production line)
I want to distribute the load between cores on my computer to leave room for future functionality but I can't figure out if I can use multiprocessing for this. It seems most examples of multiprocessing all have initial data sets and outputs and don't need to be actively communicating throughout their lifetime.
Am I able to use multiprocessing to actively communicate between processes or is that a bad idea and I should stick with multithreading?
If you expect the computation needed to process the full load of signal processing exceeds the capacity of a single core, you should consider spreading the load over multiple cores using multi-processing.
If you expect the computation needed to fit on a single core, but you expect many slow or blocking I/O operations to hold up the work, you should consider multi-threading.
If overall performance is an issue, you should reconsider writing the actual solution in pure regular Python and instead look for an implementation in another language, or in a version of Python that gets you a solution closer to the hardware. You can of course still come up with a result that would be usable from a regular Python program.
Multiprocessing and multithreading are two different way of executing code.
In Multiprocessing, you essentially just throw more raw computing power at it (hence the name) as your using more CPUs.
https://en.wikipedia.org/wiki/Multiprocessing#:~:text=Multiprocessing%20is%20the%20use%20of,to%20allocate%20tasks%20between%20them.
Multithreading is when the CPU basically breaks the code up into more "threads" of code to perform multiple task simultaneously.
https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)#:~:text=In%20computer%20architecture%2C%20multithreading%20is,supported%20by%20the%20operating%20system.
For Python, which is normal linear (i.e. it runs from line 1 down to the end of the code in order), threading is better suited for network bound or multiprocessing if its CPU bound.
https://timber.io/blog/multiprocessing-vs-multithreading-in-python-what-you-need-to-know/
So that's the difference, how the code turns out and how the system will operate will dictate whether you use multiprocessing or multithreading.
I am trying to understand whether my way of using multiprocessing.Pool is efficient. The method that I would like to do in parallel is a script that reads a certain file, do calculation, and then saves the results to a different file. My code looks something like this:
from multiprocessing import Pool, TimeoutError
import deepdish.io as dd
def savefile(a,b,t,c,g,e,d):
print a
dd.save(str(a),{'b':b,'t':t,'c':c,'g':g,'e':e,'d':d})
def run_many_calcs():
num_processors = 6
print "Num processors - ",num_processors
pool = Pool(processes=num_processors) # start 4 worker processes
for a in ['a','b','c','d','e','f','g','t','y','e','r','w']:
pool.apply(savefile,args=(a,4,5,6,7,8,1))
How can I see that immediately after one process is finished in one of the processors it continues to the next file?
When considering performance of any program, you have to work out if the performance is bound by I/O (memory, disk, network, whatever) or Compute (core count, core speed, etc).
If I/O is the bottleneck, there's no point having multiple processes, a faster CPU, etc.
If the computation is taking up all the time, then it is worth investing in multiple processes, etc. "Computation time" is often diagnosed as being the problem, but on closer investigation turns out to be limited by the computer's memory bus speed, not the clock rate of the cores. In such circumstances adding multiple processes can make things worse...
Check
You can check yours by doing some performance profiling of your code (there's bound to be a whole load of profiling tools out there for Python).
My Guess
Most of the time these days it's I/O that's the bottleneck. If you don't want to profile your code, betting on a faster SSD is likely the best initial approach.
Unsolvable Computer Science Problem
The architectural features of modern CPUs (L1, L2, L3 cache, QPI, hyperthreads) are all symptoms of the underlying problem in computer design; cores are too quick for the I/O we can wrap around them.
For example, the time taken to transfer 1 byte from SDRAM to the core is exceedingly slow in comparison to the core speed. One just has to hope that the L3, L2 and L1 cache subsystems have correctly predicted the need for that 1 byte and have already fetched it ahead of time. If not, there's a big delay; that's where hyperthreading can help the overall performance of the computer's other processes (they can nip in and get some work done), but does absolutely nothing for the stalled program.
Data fetched from files or networks is very slow indeed.
File System Caching
In your case it sounds like you have 1 single input file; that will at least get cached in RAM by the OS (provided it's not too big).
You may be tempted to read it into memory yourself; I wouldn't bother. If it's large you would be allocating a large amount of memory to hold it, and if that's too big for the RAM in the machine the OS will swap some of that RAM out to the virtual memory page file anyway, and you're worse off than before. If it's small enough there's a good chance the OS will cache the whole thing for you anyway, saving you the bother.
Written files are also cached, up to a point. Ultimately there's nothing you can do if "total process time" is taken to mean that all the data is written to disk; you'd be having to wait for the disk to complete writing no matter what you did in memory and what the OS cached.
The OS's filesystem cache might give an initial impression that file writing has completed (the OS will get on with consolidating the data on the actual drive shortly), but successive runs of the same program will get blocked once that write cache is full.
If you do profile your code, be sure to run it for a long time (or repeatedly), to make sure that the measurements made by the profiler reveal the true underlying performance of the computer. If the results show that most of the time is in file.Read() or file.Write()...
I'm currently involved in a Python project that involves handling massive amounts of data. In this, I have to print massive amounts of data to files. They are always one-liners, but sometimes consisting of millions of digits.
The actual mathematical operations in Python only take seconds, minutes at most. Printing them to a file takes up to several hours; which I don't always have.
Is there any way of speeding up the I/O?
From what I figure, the number is stored in the RAM (Or at least I assume so, it's the only thing which would take up 11GB of RAM), but Python does not print it to a text file immediately. Is there a way to dump that information -- if it is the number -- to a file? I've tried Task Manager's Dump, which gave me a 22GB dump file (Yes, you read that right), and it doesn't look like there's what I was looking for in there, albeit it wasn't very clear.
If it makes a difference, I have Python 3.5.1 (Anaconda and Spyder), Windows 8.1 x64 and 16GB RAM.
By the way, I do run Garbage Collect (gc module) inside the script, and I delete variables that are not needed, so those 11GB aren't just junk.
If you are indeed I/O bound by the time it takes to write the file, multi-threading with a pool of threads may help. Of course, there is a limit to that, but at least, it would allow you to issue non-blocking file writes.
Multithreading could speed it up (have printers on other threads that you write to in memory that have a queue).
Maybe a system design standpoint, but maybe evaluate whether or not you need to write everything to the file. Perhaps consider creating various levels of logging so that a release mode could run faster (if that makes sense in your context).
Use HDF5 file format
The problem is, you have to write a lot of data.
HDF5 is format being very efficient in size and allowing access to it by various tools.
Be prepared for few challenges:
there are multiple python packages for HDF5, you will have to find the one which fits your needs
installation is not always very simple (but there might be Windows installation binary)
expect a bit of study to understand the data structures to be stored.
it will occasionally need some CPU cycles - typically you write a lot of data quickly and at one moment it has to be flushed to the disk. At this moment it starts compressing the data what can take few seconds. See GIL for IO bounded thread in C extension (HDF5)
Anyway, I think, it is very likely, you will manage and apart of faster writes to the files you will also gain smaller files, which are simpler to handle.
Not sure this is the best title for this question but here goes.
Through python/Qt I started multiple processes of an executable. Each process is writing a large file (~20GB) to disk in chunks. I am finding that the first process to start is always the last to finish and continues on much, much longer than the other processes (despite having the same amount of data to write).
Performance monitors show that the process is still using the expected amount of RAM (~1GB), but the disk activity from the process has slowed to a trickle.
Why would this happen? It is as though the first process started somehow gets its' disk access 'blocked' by the other processes and then doesnt recover after the other processes have finished...
Would the OS (windows) be causing this? What can I do to alleviate this?
Parallelism (of any kind) only results in a speedup if you actually have the resources to solve the problem faster.
Before thinking of optimizing your program, you should carefully analyze what's causing it to run (subjectively) slow - the bottleneck.
While I know nothing about what sort bottleneck your program has, the fact that it writes a large quantity of data to disk is a good hint that it may be I/O bound.
When a program is I/O bound, the conventional single-machine parallelization techniques (threading, multiple processes) are worse than useless - they actually hurt performance, especially if you're dealing with a spinning disk. This happens because once you have more than one process accessing the disk at different places, the hard drive head has to seek between those.
The I/O scheduler of your OS can have a great impact on how slower performance becomes once you have multiple processes accessing I/O, and how processes are alloted disk accesses. You may consider switching your OS, but only if those multiple processes are needed in the first place.
With that being said, what can you do to get better (I/O) performance?
Get better disks (or a SSD)
Get more disks (one per process)
Get more machines
There are no guarantees as to fairness of I/O scheduling. What you're describing seems rather simple: the I/O scheduler, whether intentionally or not, gives a boost to new processes. Since your disk is tapped out, the order in which the processes finish is not under your control. You're most likely wasting a lot of disk bandwidth on seeks, due to parallel access from multiple processes.
TL;DR: Your expectation is unfounded. When I/O, and specifically the virtual memory system, is saturated, anything can happen. And so it does.
Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory