Multiprocessing efficiency confusion - python

I'm running a Python job utilizing the Multiprocessing package and here's the issue. When I run with 3 processors on my dual-core hyper-threaded laptop I hit 100% CPU usage in each core no problem. I also have a workstation with 6 cores, hyper-threaded, and when I run the same script on that machine each core barely breaks 30%. Can someone explain why this is? I was thinking it was I/O but if that's the case then my laptop shouldn't be utilized 100%, right?
Code below with short explanation:
MultiprocessingPoolWithState is a custom class that fire up N_Workers workers and gives each of them a copy of a dataframe (so that the df isn't shipped over the wire to each worker for each operation). And tups is a list of tuples which are used as the slicing criteria for each operation that process_data() does.
Here's an example of the code:
import multiprocessing as mp
config = dict()
N_Workers = mp.cpu_count-1
def process_data(tup):
global config
df = config['df']
id1 = tup[0]
id2 = tup[1]
df_want = df.loc[(df.col1 == id1) & (df.col2 == id2)]
""" DO STUFF """
return series_i_want
pool = MultiprocessingPoolWithState(n=N_Workers, state=df)
results = pool.map(process_data,tups)
I'm not sure what other details anyone would need so I'll add what I can (I can't give the custom class as it's not mine but a co-worker's). The main thing is that my laptop maxes out cpu usage but my workstation doesn't.

For those who might be curious here I think I've figured it out (although this answer won't be highly technical). Within the """ DO STUFF """ I call statsmodels.x13.x13_arima_analysis() which is a Python wrapper for X13-Arima-SEATS which is a sales adjustment program the US Census Bureau creates for seasonally adjusting time series (like sales records). I only had one copy of the x13.exe that the wrapper calls so on my laptop (Windows 10) I think the OS created copies of the .exe automatically for each process however on our Server (Windows Server 2012 R2) each process was waiting in line for its own chance at using the .exe - so on our server I had an I/O issue that my laptop handled automatically. The solution was simple - create a .exe with a unique path for each process to utilize and boom the program is 300 times faster than before.
I do not understand how the different OS's handled the issue with multiple processes looking at the same .exe but that's my theory which seemed to be confirmed when adding the process-dependent paths.

Related

Why do two CPUs work when I am only working with one process?

I'm running a script like
def test0():
start = time()
for i in range(int(1e8)):
i += 1
print(time() - start)
When I run this on my machine which has 4 CPUs I get the following trace for CPU usage using the Ubuntu 20.04 system monitor.
In the image you can see I ran two experiments separated by some time. But in each experiment, the activity of 2 of the CPUs peaks. Why?
This seems normal to me. The process that is running your python code is not, at least by default, pinned to a specific core. This means that the process can be switched between different cores, which is what is happening in this case. Those spikes are not simultaneous, it indicates that the process was switched from one core to another.
On Linux, you can observe this using
watch -tdn0.5 ps -mo pid,tid,%cpu,psr -p 172810
where 172810 is PID of the python process (which you can get, for example, from the output of top)
If you want to pin the process to a particular core, you can use psutils in your code.
import psutil
p = psutil.Process()
p.cpu_affinity([0]) # pinning the process to cpu 0
Now, you should see only one core spiking. (but avoid doing this if you don't have a good reason for it).

How to run an optimal number of instances of a single python script on linux

I have a script that's performing an independent task on about 1200 different files. It loops through each file and checks if it has already been completed or is in progress, if it hasn't been done and isn't being actively worked on (which it wouldn't be if it's not being run in parallel) then it performs a task with the file. This follows the general outline below:
myScript.py:
for file in directory:
fileStatus = getFileStatus(file)
if fileStatus != 'Complete' and fileStatus != 'inProgress':
setFileStatus(file, 'inProgress')
doTask(file)
setFileStatus(file, 'Complete')
doTask() takes 20-40 minutes on my machine and will arc from minimal RAM requirements at the beginning to about 8GB toward the middle, and back down to minimal requirements at the end. Depending on the file this will occur over a variable amount of time.
I would like to run this script in parallel with itself so that all tasks are completed in the least amount of time possible, using the maximum amount of my machine's resources. Assuming (in ignorance) the limiting resource is RAM (of which my machine has 64GB), and that the scripts will all have peak RAM consumption at the same time, I could mimic the response to this question in a manner such as:
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
However, I imagine I could fit more in depending on where each process is in its execution.
Is there a way to dynamically determine how many resources I have available and accordingly create, destroy or pause instances of this script so that the machine is working at maximum efficiency with respect to time? I would like to avoid making changes to myScript and instead call it from another which would handle the creating, destroying and pausing.
GNU Parallel is built for doing stuff like:
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
It also has some features to do resource limitation. Finding the optimal number is, however, really hard given that:
Each job runs for 20-40 minutes (if this was fixed, it would be easier)
Has a RAM usage envelope like a mountain (if it stayed at the same level all through the run, it would be easier)
If the 64 GB RAM is the limiting resource, then it is always safe to run 8 jobs:
cat filelist | parallel -j8 python myScript.py
If you have plenty of CPU power and is willing to risk wasting some, then you can run start a job if there is 8 GB memory free and if the last job was started more than 3 minutes ago (assuming jobs reach their peak memory usage within 3-5 minutes). GNU Parallel will kill the newest job and put it back on the queue, if the free memory goes below 4 GB:
cat filelist | parallel -j0 --memlimit 8G --delay 300 python myScript.py
Update:
Thanks for clarifying it further. However, with the requirements and approach you just mentioned, you are going to end up reinventing multi-threading. I suggest you avoid the multiple script calls and have all control inside your loop(s) (like the one in my original response). You are probably looking for querying the memory usage of the processes (like this).One particular component that might help you here is setting priority of the individual tasks (mentioned here).You may find this link particularly useful for scheduling priority of tasks.Infact, I recommend using threading2 package here, since it has inbuilt features on priority control.
Original Response:
Since you have roughly identified which parts require how much memory, you may employ multithreading pretty easily.
import threading
thread1 = threading.Thread(target=process1 , args=(yourArg1,)) # process1 takes 1 GB
thread2 = threading.Thread(target=process2 , args=(yourArg1,)) # process2 takes 1 GB
threadList1 = [thread1,thread2]
thread3 = threading.Thread(target=process3 , args=(yourArg1,)) # process3 takes 0.5 GB
thread4 = threading.Thread(target=process4 , args=(yourArg1,)) # process4 takes 0.5 GB
threadList2 = [thread3,thread4]
# Batch1 :
for thread in threadList1:
thread.start()
for thread in threadList1:
thread.join()
# Batch2 :
for thread in threadList2:
thread.start()
for thread in threadList2:
thread.join()

Python sub processes randomly drops to 0% cpu usage, causing the process to "hang up"

I run several python subprocesses to migrate data to S3. I noticed that my python subprocesses often drops to 0% and this condition lasts more than one minute. This significantly decreases the performance of the migration process.
Here is the pic of the sub process:
The subprocess does these things:
Query all tables from a database.
Spawn sub processes for each table.
for table in tables:
print "Spawn process to process {0} table".format(table)
process = multiprocessing.Process(name="Process " + table,
target=target_def,
args=(args, table))
process.daemon = True
process.start()
processes.append(process)
for process in processes:
process.join()
Query data from a database using Limit and Offset. I used PyMySQL library to query the data.
Transform returned data to another structure. construct_structure_def() is a function that transform row into another format.
buffer_string = []
for i, row_file in enumerate(row_files):
if i == num_of_rows:
buffer_string.append( json.dumps(construct_structure_def(row_file)) )
else:
buffer_string.append( json.dumps(construct_structure_def(row_file)) + "\n" )
content = ''.join(buffer_string)
Write the transformed data into a file and compress it using gzip.
with gzip.open(file_path, 'wb') as outfile:
outfile.write(content)
return file_name
Upload the file to S3.
Repeat step 3 - 6 until no more rows to be fetched.
In order to speed up things faster, I create subprocesses for each table using multiprocesses.Process built-in library.
I ran my script in a virtual machine. Following are the specs:
processor: Intel(R) Xeon(R) CPU E5-2690 # 2.90 Hz 2.90 GHz (2 Processes)
Virtual processors: 4
Installed RAM: 32 GB
OS: Windows Enterprise Edition.
I saw on the post in here that said one of the main possibilities is because of memory I/O limitation. So I tried to run one sub process to test that theory, but no avail.
Any idea why this is happening? Let me know if you guys need more information.
Thank you in advance!
Turns out the culprit was the query I ran. The query took a long time to return the result. This made the python script idle thus zero percent usage.
Your VM is a Windows machine, I'm more of a Linux person so I'd love it if someone will back me up here.
I think the daemon is the problem here.
I've read about daemon preocesses and especially about TSR.
The first line in TSR says:
Regarding computers, a terminate and stay resident program (commonly referred to by the initialism TSR) is a computer program that uses a system call in DOS operating systems to return control of the computer to the operating system, as though the program has quit, but stays resident in computer memory so it can be reactivated by a hardware or software interrupt.
As I understand, making the process a daemon (or TSR in your case) makes it dormant until a syscall will wake it up, which I don't think is the case here (correct me if I'm wrong).

Running scipy.odeint through multiple cores

I'm working in Jupyter (Anaconda) with Python 2.7.
I'm trying to get an odeint function I wrote to run multiple times, however it takes an incredible amount of time.
While trying to figure out how to decrease the run time, I realized that when I ran it only took up about 12% of my CPU.
I operate off of an Intel Core i7-3740QM # 2.70GHz:
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-3740QM+%40+2.70GHz&id=1481
So, I'm assuming this is because of Python's GIL causing my script to only run off of one core.
After some research on parallel processing in Python, I thought I found the answer by using this code:
import sys
import multiprocessing as mp
Altitude = np.array([[550],[500],[450],[400],[350],[300]])
if __name__ == "__main__":
processes = 4
p = mp.Pool(processes)
mp_solutions = p.map(calc, Altitude)
This doesn't seem to work though. Once I run it, Jupyter just becomes constantly busy. My first thought was that it was just a high computation level so it was taking a long time, but then I looked at my CPU usage and although there were multiple instances of Python processes, none of them were using any CPU.
I can't figure out what the reasoning for this is. I found this post as well and tried using their code but it simply did the same thing:
Multiple scipy.integrate.ode instances
Any help would be much appreciated.
Thanks!

Why is this dictionary-updating code slow? How can it I improve it's efficiency?

I'm trying to reduce the processor time consumed by a python application, and after profiling it, I've found a small chunk of code consuming more processor time than it should:
class Stats(DumpableObject):
members_offsets = [
('blkio_delay_total', 40),
('swapin_delay_total', 56),
('read_bytes', 248),
('write_bytes', 256),
('cancelled_write_bytes', 264)
]
[...other code here...]
def accumulate(self, other_stats, destination, coeff=1):
"""Update destination from operator(self, other_stats)"""
dd = destination.__dict__
sd = self.__dict__
od = other_stats.__dict__
for member, offset in Stats.members_offsets:
dd[member] = sd[member] + coeff * od[member]
Why is this so expensive? How can I improve the efficiency of this code?
Context:
One of my favourite Linux tools, iotop, uses far more processor time than I think is appropriate for a monitoring tool - quickly consuming minutes of processor time; using the built in --profile option, total function calls approached 4 million running for only 20 seconds. I've observed similar behaviour on other systems, across reboots, & on multiple kernels. pycallgraph highlighted accumulate as one of a few time-consuming functions.
After studying the code for a full week, I think that dictionaries are the best choice for a data structure here, as a large number of threads to update will require many lookups, but don't understand why this code is expensive. Extensive search failed to enlighten. I don't understand the curses, socket, and struct libraries well enough to ask a self-contained question. I'm not asking for code as lightweight as pure C is in i7z.
I'd post images & other data, but I don't have the reputation.
The iotop git repository: http://repo.or.cz/w/iotop.git/tree (The code in question is in data.py, beginning line 73)
The system in question runs Ubuntu 13.04 on an Intel E6420 with 2GB of ram. Kernel 3.8.0-35-generic.
(I wish that Guillaume Chazarain had written more docstrings!)

Categories