How to get accurate process CPU and memory usage with python? - python

I am trying to create a process monitor but I am unable to get accurate results comparing my results to windows task manager.
I have been using psutil which seems to work fine when looking at the overall cpu and memory usage but doesn't appear to be very accurate for a single process. Memory usage is always higher than task manager and CPU it always random.
I am setting the process once on initialise with self.process = psutil.Process(self.pid) and then calling the below method once a second, the process in task manager is running at a constant 5.4% cpu usage and 130mb ram however the below code produces:
CPU: 12.5375
Memory 156459008
CPU: 0.0
Memory 156459008
CPU: 0.0
Memory 156459008
CPU: 0.0
Memory 156459008
CPU: 12.5375
Memory 156459008
CPU: 0.0
Memory 156459008
CPU: 0.0
Memory 156459008
Example code:
def process_info(self):
# I am calling this method twice because I read the first time gets ignored?
ignore_cpu = self.process.cpu_percent(interval=None) / psutil.cpu_count()
time.sleep(0.1)
process_cpu = self.process.cpu_percent(interval=None) / psutil.cpu_count()
# I also tried the below code but it was much worse than above
# for j in range(10):
# if j == 0:
# test_list = []
# p_cpu = self.process.cpu_percent(interval=0.1) / psutil.cpu_count()
# test_list.append(p_cpu)
# process_cpu = (sum(test_list)) / len(test_list)
# Memory is about 25mb higher than task manager
process_memory = self.process.memory_info().rss
print(f"CPU: {process_cpu}")
print(f"Memory: {process_memory}")
am I using psutil incorrectly or is there a more accurate way to grab the data?

I think you are misinterpreting the output of psutil.cpu_percent().
Returns a float representing the current system-wide CPU utilization as a percentage. When interval is > 0.0 compares process times to system CPU times elapsed before and after the interval (blocking). When interval is 0.0 or None compares process times to system CPU times elapsed since last call, returning immediately. That means that the first time this is called it will return a meaningful 0.0 value on which subsequent calls can be based. In this case is recommended for accuracy that this function be called with at least 0.1 seconds between calls.
So, if you call it with interval=0.1, it will return the percentage of time the process was running in the last 0.1 seconds.
If you call it with interval=None, it will return the percentage of time the process was running since the last time you called it.
So, if you call it twice in a row with interval=None, the first call will return the percentage of time the process was running since the last time you called it, and the second call will return the percentage of time the process was running since the last time you called it.
So, if you want to get the percentage of time the process was running in the last 0.1 seconds, you should call it with interval=0.1.

Like #Axeltherabbit mentioned in his comment, Task Manager grabs all processes under a given name, whereas psutils grabs an individual process. psutils is correct, task manager over-scopes and decides to grab all processes.

this should be more accurate i think?
import psutil
for process in [psutil.Process(pid) for pid in psutil.pids()]:
try:
process_name = process.name()
process_mem = process.memory_percent()
process_cpu = process.cpu_percent(interval=0.5)
except psutil.NoSuchProcess as e:
print(e.pid, "killed before analysis")
else:
print("Name:", process_name)
print("CPU%:", process_cpu)
print("MEM%:", process_mem)

Related

Problems with load RAM on 100%

I am conducting an experiment to load RAM at 100% on Mac OS. Stumbled upon the method described here: https://unix.stackexchange.com/a/99365
I decide to do the same. I wrote two programs which are presented below. While executing the first program, the system writes that the process takes 120 GB, but the memory usage graph is still stable. When executing the second program, almost immediately a warning pops up that the system does not have enough resources. The second program creates ten parallel processes that increase memory consumption in approximately the same way.
First program:
def load_ram(vm, timer):
x = (vm * 1024 * 1024 * 1024 * 8) * (0,)
begin_time = time.time()
while time.time() - begin_time < timer:
pass
print("end")
Memory occupied by the first program
Second program:
def load_ram(vm, timer):
file_sh = open("bash_file.sh", "w")
str_to_bash = """
VM=%d;
for i in {1..10};
do
python -c "x=($VM*1024*1024*1024*8)*(0,); import time; time.sleep(%d)" & echo "started" $i ;
done""" % (int(vm), int(timer))
file_sh.write(str_to_bash)
file_sh.close()
os.system("bash bash_file.sh")
Memory occupied by the second program
Memory occupied by the second program + system message
Parameters: vm = 16, timer = 30.
In the first program, memory takes up are equal to about 128 gigabytes (after that, a kill pops up in the terminal and the process stops). The second takes up more than 160 gigabytes, as shown in the picture. And all these ten processes are not completed. The warning that the system is low on resources is displayed even if the memory takes up are 10 gigabytes per process (that is, 100 gigabytes in total).
According to the situation described, two questions arise:
Why, with the same memory consumption (120 gigabytes), in the first case, the system pretends that this process does not exist, and in the second case it immediately falls under the same load?
Where does the number of 120 gigabytes come from if my computer's operating system contains only 16 gigabytes?
Thank you for the attention!

Why does per-process overhead constantly increase for multiprocessing?

I was counting for a 6 core CPU with 12 logical CPUs in a for-loop till really high numbers several times.
To speed things up i was using multiprocessing. I was expecting something like:
Number of processes <= number of CPUs = time identical
number of processes + 1 = number of CPUs = time doubled
What i was finding was a continuous increase in time. I'm confused.
the code was:
#!/usr/bin/python
from multiprocessing import Process, Queue
import random
from timeit import default_timer as timer
def rand_val():
num = []
for i in range(200000000):
num = random.random()
print('done')
def main():
for iii in range(15):
processes = [Process(target=rand_val) for _ in range(iii)]
start = timer()
for p in processes:
p.start()
for p in processes:
p.join()
end = timer()
print(f'elapsed time: {end - start}')
print('for ' + str(iii))
print('')
if __name__ == "__main__":
main()
print('done')
result:
elapsed time: 14.9477102 for 1
elapsed time: 15.4961154 for 2
elapsed time: 16.9633134 for 3
elapsed time: 18.723183399999996 for 4
elapsed time: 21.568377299999995 for 5
elapsed time: 24.126758499999994 for 6
elapsed time: 29.142095499999996 for 7
elapsed time: 33.175509300000016 for 8
.
.
.
elapsed time: 44.629786800000005 for 11
elapsed time: 46.22480710000002 for 12
elapsed time: 50.44349420000003 for 13
elapsed time: 54.61919949999998 for 14
There are two wrong assumptions you make:
Processes are not free. Merely adding processes adds overhead to the program.
Processes do not own CPUs. A CPU interleaves execution of several processes.
The first point is why you see some overhead even though there are less processes than CPUs. Note that your system usually has several background processes running, so the point of "less processes than CPUs" is not clearcut for a single application.
The second point is why you see the execution time increase gradually when there are more processes than CPUs. Any OS running mainline Python does preemptive multitasking of processes; roughly, this means a process does not block a CPU until it is done, but is paused regularly so that other processes can run.
In effect, this means that several processes can run on one CPU at once. Since the CPU can still only do a fixed amount of work per time, all processes take longer to complete.
I don't understand what you are trying to achieve?
You are taking the same work, and running it X times, where X is the number of SMPs in your loop. You should be taking the work and dividing it by X, then sending a chunk to each SMP unit.
Anyway, with regards what you are observing - you are seeing the time it takes to spawn and close the separate processes. Python isn't quick at starting new processes.
Your test is faulty.
Imagine this, it takes 1 day for one farmer to work a 10km^2 farmland using a single tractor. If there are two farmers working 20km^2 farms, why are you expecting two farmers working twice the amount of farmlands using two tractors to take less time?
You have 6 CPU cores, your village has 6 tractors, but nobody has money to buy private tractors. As the number of workers (processes) on the village increased, the number of tractors remain the same, so everyone has to share the limited number of tractors.
In an ideal world, two farmers working twice the amount of work using two tractors would take exactly the same amount of time as one farmer working one portion of work, but in real computers the machine has other work to do even if it seems idle. There are task switching, the OS kernel has to run and monitor hardware devices, memory caches needs to be flushed and invalidated between CPU cores, your browser needs to run, the village elder is running a meeting to discuss who should get the tractors and when, etc.
As the number of workers increases beyond the number of tractors, farmers don't just hog the tractors for themselves. Instead they made an arrangement that they'd pass the tractors around every three hours or so. This means that the seventh farmer don't have to wait for two days to get their share of tractor time. However, there's a cost to transferring tractors between farmlands, just as there are costs for CPU to switch between processes; task switch too frequently and the CPU is not actually doing work, and switch to infrequently and you get resource starvation as some jobs takes too long to start being worked on.
A more sensible test would be to keep the size of farmland constant and just increase the number of farmers. In your code, that would correspond to this change:
def rand_val(num_workers):
num = []
for i in range(200000000 / num_workers):
num = random.random()
print('done')
def main():
for iii in range(15):
processes = [Process(target=lambda: rand_val(iii)) for _ in range(iii)]
...

Python most accurate method to measure time (ms)

I need to meassure the time certain parts of my code take. While executing my code on a powerfull server, I get 10 diffrent results
I tried comparing time measured with time.time(), time.perf_counter(), time.perf_counter_ns(), time.process_time() and time.process_time_ns().
import time
for _ in range(10):
start = time.perf_counter()
i = 0
while i < 100000:
i = i + 1
time.sleep(1)
end = time.perf_counter()
print(end - start)
I'm expecting when executing the same code 10 times, to be the same (the results to have a resolution of at least 1ms) ex. 1.041XX and not 1.030sec - 1.046sec.
When executing my code on a 16 cpu, 32gb memory server I'm receiving this result:
1.045549364
1.030857833
1.0466020120000001
1.0309665050000003
1.0464690349999994
1.046397238
1.0309525370000001
1.0312070380000007
1.0307592159999999
1.046095523
Im expacting the result to be:
1.041549364
1.041857833
1.0416020120000001
1.0419665050000003
1.0414690349999994
1.041397238
1.0419525370000001
1.0412070380000007
1.0417592159999999
1.041095523
Your expectations are wrong. If you want to measure code average time consumption use the timeit module. It executes your code multiple times and averages over the times.
The reason your code has different runtimes lies in your code:
time.sleep(1) # ensures (3.5+) _at least_ 1000ms are waited, won't be less, might be more
You are calling it in a tight loop,resulting in accumulated differences:
Quote from time.sleep(..) documentation:
Suspend execution of the calling thread for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time. The actual suspension time may be less than that requested because any caught signal will terminate the sleep() following execution of that signal’s catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.
Changed in version 3.5: The function now sleeps at least secs even if the sleep is interrupted by a signal, except if the signal handler raises an exception (see PEP 475 for the rationale).
Emphasis mine.
Perfoming a code do not take the same time at each loop iteration because of the scheduling of the system (system puts on hold your process to perform another process then back to it...).

Why is a ThreadPoolExecutor with one worker still faster than normal execution?

I'm using this library, Tomorrow, that in turn uses the ThreadPoolExecutor from the standard library, in order to allow for Async function calls.
Calling the decorator #tomorrow.threads(1) spins up a ThreadPoolExecutor with 1 worker.
Question
Why is it faster to execute a function using 1 thread worker over just calling it as is (e.g. normally)?
Why is it slower to execute the same code with 10 thread workers in place of just 1, or even None?
Demo code
imports excluded
def openSync(path: str):
for row in open(path):
for _ in row:
pass
#tomorrow.threads(1)
def openAsync1(path: str):
openSync(path)
#tomorrow.threads(10)
def openAsync10(path: str):
openSync(path)
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
[func(p) for p in paths]
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Note: The data folder contains 18 files, each 700 lines of random text.
Output
0 workers: 0.0120 seconds
1 worker: 0.0009 seconds
10 workers: 0.0535 seconds
What I've tested
I've ran the code more than a couple dusin times, with different programs running in the background (ran a bunch yesterday, and a couple today). The numbers change, ofc, but the order is always the same. (I.e. 1 is fastest, then 0 then 10).
I've also tried changing the order of execution (e.g. moving the do calls around) in order to eliminate caching as a factor, but still the same.
Turns out that executing in the order 10, 1, None results in a different order (1 is fastest, then 10, then 0) compared to every other permutation. The result shows that whatever do call is executed last, is considerably slower than it would have been had it been executed first or in the middle instead.
Results (After receiving solution from #Dunes)
0 workers: 0.0122 seconds
1 worker: 0.0214 seconds
10 workers: 0.0296 seconds
When you call one of your async functions it returns a "futures" object (instance of tomorrow.Tomorrow in this case). This allows you to submit all your jobs without having to wait for them to finish. However, never actually wait for the jobs to finish. So all do(openAsync1) does is time how long it takes to setup all the jobs (should be very fast). For a more accurate test you need to do something like:
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
# do all jobs if openSync, else start all jobs if openAsync
results = [func(p) for p in paths]
# if openAsync, the following waits until all jobs are finished
if func is not openSync:
for r in results:
r._wait()
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Using additional threads in python generally slows things down. This is because of the global interpreter lock which means only 1 thread can ever be active, regardless of the number of cores the CPU has.
However, things are complicated by the fact that your job is IO bound. More worker threads might speed things up. This is because a single thread might spend more time waiting for the hard drive to respond than is lost between context switching between the various threads in the multi-threaded variant.
Side note, even though neither openAsync1 and openAsync10 wait for jobs to complete, do(openAsync10) is probably slower because it requires more synchronisation between threads when submitting a new job.

Why does my loop require more memory on each iteration?

I am trying to reduce the memory requirements of my python 3 code. Right now each iteration of the for loop requires more memory than the last one.
I wrote a small piece of code that has the same behaviour as my project:
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def simulation(steps, y): # the function that starts the parallel execution of f()
pool = Pool(processes=8, maxtasksperchild=int(steps/8))
results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
pool.close()
return results
def f(steps, y): # steps is used as a counter. My code doesn't need it.
a, b = np.random.random(2)
return y*a, y*b
def main():
steps = 2**20 # amount of times a random sample is taken
y = np.ones(5) # dummy variable to show that the next iteration of the code depends on the previous one
total_results = np.zeros((0,2))
for i in range(5):
results = simulation(steps, y[i-1])
y[i] = results[0][0]
total_results = np.vstack((total_results, results))
print(total_results, y)
if __name__ == "__main__":
main()
For each iteration of the for loop the threads in simulation() each have a memory usage equal to the total memory used by my code.
Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?
Ideally I would want my code to only copy the memory it requires to execute f() while I can save the results in memory.
Though the script does use quite a bit of memory even with the "smaller" example values, the answer to
Does Python clone my entire environment each time the parallel
processes are run, including the variables not required by f()? How
can I prevent this behaviour?
is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe instead of forking. Since you mention using htop, you're using some flavour of UNIX or UNIX like system, and you get COW semantics.
For each iteration of the for loop the processes in simulation() each
have a memory usage equal to the total memory used by my code.
The spawned processes will display almost identical values of RSS, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC and submitted data will be copied. In your example the IPC and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation took the script's runtime from around 150s to 0.66s on this machine.
We can observe COW with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
Output on this machine:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
Now if we replace the function read_arr with a function that modifies the array:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
the results are quite different:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.
Maybe you should know the difference between thread and process in Operating System. see this What is the difference between a process and a thread.
In the for loop, there are processes, not threads. Threads share the address space of the process that created it; processes have their own address space.
You can print the process id, type os.getpid().

Categories