Will pexpect.spawn automatically run processes using multicores? - python

I want to use python to write some script that runs parallel processes.
I learned that pexpect.spawn, multiprocessing.Process and concurrent.futures can all be used for parallel processing. I'm currently using pexpect.spawn, but I'm not sure whether pexpect.spawn will automatically utilize all the cores available.
This what top command returned when I used pexpect.spawn to start several workers. Are those processes run on different cores? How can I read that? Why the number of processes returned by top is much larger than the number of cores of my computer?

Related

Does running two scripts in parallel on a dual-core CPU decrease the speed as compared to running them serially?

I was wondering whether running two scripts on a dual-core CPU (in parallel at the same time) decreases the speed of the codes as compared to running them serially (running at different times or on two different CPUs). What factors should be considered into account while trying to answer this question?
No; assuming the processes can run independently (neither one is waiting on the other one to move forward), they will run faster in parallel than in serial on a multicore system. For example, here's a little script called busy.py:
for i in range(400000000):
pass
Running this once, on its own:
$ time python busy.py
real 0m6.875s
user 0m6.871s
sys 0m0.004s
Running it twice, in serial:
$ time (python busy.py; python busy.py)
real 0m14.702s
user 0m14.701s
sys 0m0.001s
Running it twice, in parallel (on a multicore system - relying on the OS to assign different cores):
$ time (python busy.py & python busy.py)
real 0m7.849s
user 0m7.843s
sys 0m0.004s
If we run python busy.py & python busy.py or a single-core system, or simulate it with taskset, we get results that looks more like the serial case than the parallel case:
$ time (taskset 1 python busy.py & taskset 1 python busy.py)
real 0m13.057s
user 0m13.035s
sys 0m0.013s
There is some variance in these numbers because, as #tdelaney mentioned, there are other applications competing for my cores and context-switches are occurring. (You can see how many context-switches occurred with /usr/bin/time -v.)
Nonetheless, they get the idea across: running twice in serial necessarily takes twice as long as running once, as does running twice in "parallel" (context-switching) on a single core. Running twice in parallel on two separate cores takes only about as long as running once, because the two processes really can run at the same time. (Assuming they are not waiting on each other, competing for the same resource, etc.)
This is why the multiprocessing module is useful for parallelizable tasks in Python. (threading can also be useful, if the tasks are IO-bound rather than the CPU-bound.)
How Dual Cores Works When Running Scripts
running two separate scripts
if you run one script then another script on a dual core CPU, then
weather or not your scripts will run on each of the cores is dependent on
your Operating System.
running two separate threads
on a dual core CPU then your threads will actually be faster than on a single core.

Speeding up launch of processes using multiprocessing in case of Windows

I have a machine learning application in Python. And I'm using the multiprocessing module in Python to parallelize some of the work (specifically feature computation).
Now, multiprocessing works differently on Unix variants, and Windows OS.
Unix (mac/linux): fork/forkserver/spawn
Windows: spawn
Why multiprocessing.Process behave differently on windows and linux for global object and function arguments
Because of spawn being used on Windows, the launch of multiprocessing processes is really slow. It loads all the modules from scratch for each process on Windows.
Is there a way to speed up the creation of the extra processes on Windows? (using threads instead of multiple processes is not an option)
Instead of creating multiple new processes each time, I highly suggest using concurrent.futures ProcessPoolExecutor and leaving the executor open in the background.
That way, you don't create a new process each time, but rather leave them open in the background and pass some work using the module's functions or queues and pipes.
Bottom line - Don't create new processes each time. Leave them open and pass work.

Utilizing 100% Network in Python

I'm currently trying to grab files from multiple servers simultaneously in the fastest way possible. I've written Python script that uses Paramiko to SCP all files to my computer. I thought threading would do the trick but when I run the script it I can see only 60% of my network being utilized. Even when I change thread count from 50 -> 100 there doesn't seem to be any difference. I discovered that threads are just actually running on the same core and the GIL prevents threads from working simultaneously; threads are just switching between each other REALLY quickly.
I moved on to trying to use multiprocessing, looking into it it seems you should only spawn Process's for the number of cores you have; 4 in my case. When I run it, my CPU is maxed out at 99% usage, and my Network is at 0%, though it seems that files are slowly being transferred...
My question is how do I utilize 100% to maximize my download speed and decrease download time? Spawn Processes that spawn threads?

Python multiprocessing: calling mpi executable and distributing over cores

I am trying to use python's multiprocessing approach to speed up a program I am working on.
The python code runs in serial, but has calls to an mpi executable. It is these calls that I would like to do in parallel as they are independent of one another.
For each step the python script takes I have a set of calculations that must be done by the mpi program.
For example, if I am running over 24 cores, I would like the python script to call 3 instances of the mpi executable each running on 8 of the cores. Each time one mpi executable run ends, another instance is started until all members of a queue are finished.
I am just getting started using multiprocessing, and I am fairly certain this is possible, but I am not sure on how to go about doing it. I can set up a queue and have multiple processes start, it's the adding of the next set of calculations into the queue and starting them that is the issue.
If some kind soul could give me some pointers, or some example code, I'd be most obliged!

Python: How to Run multiple programs on same interpreter

How to start an always on Python Interpreter on a server?
If bash starts multiple python programs, how can I run it on just one interpreter?
And how can I start a new interpreter after tracking number of bash requests, say after X requests to python programs, a new interpreter should start.
EDIT: Not a copy of https://stackoverflow.com/questions/16372590/should-i-run-1000-python-scripts-at-once?rq=1
Requests may come pouring in sequentially
You cannot have new Python programs started through bash run on the same interpreter, each program will always have its own. If you want to limit the number of Python programs running the best approach would be to have a Python daemon process running on your server and instead of creating a new program through bash on each request you would signal the daemon process to create a thread to handle the task.
To run a program forever in python:
while True :
do_work()
You could look at spawning threads for incoming request. Look at threading.Thread class.
from threading import Thread
task = new Thread(target=do_work, args={})
task.start()
You probably want to take a look at http://docs.python.org/3/library/threading.html and http://docs.python.org/3/library/multiprocessing.html; threading would be more lightweight but only allows one thread to execute at a time (meaning it won't take advantage of multicore/hyperthreaded systems), while multiprocessing allows for true simultaneous execution but can be a bit less lightweight than threading if you're on a system that doesn't utilize lightweight subprocesses and may not be as necessary if the threads/processes spend lots of time doing I/O requests.

Categories