Extremely simple Python program taking up 100% of CPU

Extremely simple Python program taking up 100% of CPU - python

I have this program that checks the amount of CPU being used by the current Python process.
import os
import psutil
p = psutil.Process(os.getpid())
counter = 0
while True:
if counter % 1000 == 0:
print(p.cpu_percent())
counter += 1
The output is:
98.79987719751766
98.79981257674615
98.79975442677997
98.80031017770553
98.80022615662917
98.80020675841527
98.80027781367056
98.80038116157328
98.80055555555509
98.80054906013777
98.8006523704943
98.80072337402265
98.80081374321833
98.80092993219198
98.80030995738038
98.79962549234794
98.79963842975158
98.79916715088079
98.79930277598402
98.7993480085206
98.79921895171654
98.799154456851
As seen by the output, this program is taking up 100% of my CPU and I'm having a tough time understanding why. I noticed that putting a time.sleep(0.25) causes CPU usage to go down to zero.
In my actual application, I have a similar while loop and can't afford to have a sleep in the while loop since it is reading a video from OpenCV and needs to stay realtime.
import cv2
cap = cv2.VideoCapture("video.mp4")
while True:
success, frame = cap.retrieve()
This program takes the same amount of CPU as the first program I wrote, but this one decodes video!
If someone could explain a bit more that'd be awesome.

Your original loop is doing something as fast as it can. Any program that is doing purely CPU bound work, with no significant blocking operations involved, will happily consume the whole CPU, it just does whatever it's doing faster if the CPU is faster.
Your particular something is mostly just incrementing a value and dividing it by 1000 over and over, which is inefficient, but making it more efficient, e.g. by changing the loop to:
import os
import psutil
p = psutil.Process(os.getpid())
while True:
for i in range(1000): pass
print(p.cpu_percent())
removing all the division work and having a more efficient addition (range does the work at the C layer), would just mean you do the 1000 loops faster and print the cpu_percent more often (it might slightly reduce the CPU usage, but only because you might be printing enough that the output buffer is filled faster than it can be drained for display, and your program might end up blocking on I/O occasionally).
Point is, if you tell the computer to do something forever as fast as it can, it will. So don't. Use a sleep; even a small one (time.sleep(0.001)) would make a huge difference.
For your video capture scenario, seems like that's what you want in the first place. If the video source is producing enough output to occupy your whole CPU, so be it; if it isn't, and your code blocks when it gets ahead, that's great, but you don't want to slow processing just for the sake of lower CPU usage if it means falling behind/dropping frames.

I hope you are doing great!
When you do:
while True:
if counter % 1000 == 0:
print(p.cpu_percent())
counter += 1
You actually ask your computer to process constantly.
It is going to increment counter as fast as possible and will display the cpu_percent every time counter is modulo of 1000.
It means that your program will feed the CPU constantly with the instructions of incrementing that counter.
When you use sleep, you basically say to the OS (operating system) that your code shouldn't execute anything new before sleep time. Your code will then not flood the CPU with instructions.
Sleep suspends execution for the given number of seconds.
Currently it is better to use a sleep than the counter.
import os
import psutil
import time
p = psutil.Process(os.getpid())
counter = 0
while True:
time.sleep(1)
print(p.cpu_percent())
I hope it is helping.
Have a lovely day,
G

You might want to add in a time.sleep(numberOfSeconds) to your loop if you don't want it to be using 100% CPU all the time, if it's only checking for a certain condition over and over.

Related

Why is this multiprocess program running slowly than its not concurrent version?

I have made a program for adding a list by dividing them in subparts and using multiprocessing in Python. My code is the following:
from concurrent.futures import ProcessPoolExecutor, as_completed
import random
import time
def dummyFun(l):
s=0
for i in range(0,len(l)):
s=s+l[i]
return s
def sumaSec(v):
start=time.time()
sT=0
for k in range(0,len(v),10):
vc=v[k:k+10]
print ("vector ",vc)
for item in vc:
sT=sT+item
print ("sequential sum result ",sT)
sT=0
start1=time.time()
print ("sequential version time ",start1-start)
def main():
workers=5
vector=random.sample(range(1,101),100)
print (vector)
sumaSec(vector)
dim=10
sT=0
for k in range(0,len(vector),dim):
vc=vector[k:k+dim]
print (vc)
for item in vc:
sT=sT+item
print ("sub list result ",sT)
sT=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),10))
start=time.time()
with ProcessPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(dummyFun,chunk) for chunk in chunks]
for future in as_completed(futures):
print (future.result())
start1=time.time()
print (start1-start)
if __name__=="__main__":
main()
The problem is that for the sequential version I got a time of:
0.0009753704071044922
while for the concurrent version my time is:
0.10629010200500488
And when I reduce the number of workers to 2 my time is:
0.08622884750366211
Why is this happening?

The length of your vector is only 100. That is a very small amount of work, so the the fixed cost of starting the process pool is the most significant part of the runtime. For this reason parallelism is most beneficial when there is a lot of work to do. Try a larger vector, like a length of 1 million.
The second problem is that you have each worker do a tiny amount of work: a chunk of size 10. Again, that means the cost of starting a task cannot be amortized over so little work. Use larger chunks. For example, instead of 10 use int(len(vector)/(workers*10)).
Also note that you're creating 5 processes. For a CPU-bound task like this one you ideally want to use the same number of processes as you have physical CPU cores. Either use whatever number of cores your system has, or if you use max_workers=None (the default value) then ProcessPoolExecutor will default to that number for your system. If you use too few processes you're leaving performance on the table, if you use too many then the CPU will have to switch between them and your performance may suffer.

Your chunking is pretty awful for creating multiple tasks.
Creating too many tasks still incurs the time punishment even when your workers are already created.
Maybe this post can help you in your search:
How to parallel sum a loop using multiprocessing in Python

Is there anything in Python 2.7 akin to Go's `time.Tick` or Netty's `HashedWheelTimer`?

I write a lot of code that relies on precise periodic method calls. I've been using Python's futures library to submit calls onto the runtime's thread pool and sleeping between calls in a loop:
executor = ThreadPoolExecutor(max_workers=cpu_count())
def remote_call():
# make a synchronous bunch of HTTP requests
def loop():
while True:
# do work here
executor.submit(remote_call)
time.sleep(60*5)
However, I've noticed that this implementation introduces some drift after a long duration of running (e.g. I've run this code for about 10 hours and noticed about 7 seconds of drift). For my work I need this to run on the exact second, and millisecond would be even better. Some folks have pointed me to asyncio ("Fire and forget" python async/await), but I have not been able to get this working in Python 2.7.
I'm not looking for a hack. What I really want is something akin to Go's time.Tick or Netty's HashedWheelTimer.

Nothing like that comes with Python. You'd need to manually adjust your sleep times to account for time spent working.
You could fold that into an iterator, much like the channel of Go's time.Tick:
import itertools
import time
import timeit
def tick(interval, initial_wait=False):
# time.perf_counter would probably be more appropriate on Python 3
start = timeit.default_timer()
if not initial_wait:
# yield immediately instead of sleeping
yield
for i in itertools.count(1):
time.sleep(start + i*interval - timeit.default_timer())
yield
for _ in tick(300):
# Will execute every 5 minutes, accounting for time spent in the loop body.
do_stuff()
Note that the above ticker starts ticking when you start iterating, rather than when you call tick, which matters if you try to start a ticker and save it for later. Also, it doesn't send the time, and it won't drop ticks if the receiver is slow. You can adjust all that on your own if you want.

Python: kill the application if subprocess.check_output waits too long to receive the output

I have implemented a cache oblivious algorithm and have shown with the PAPI library that the L1/L2/L3 misses are very low. However I would also like to see how the algorithm behaves if I reduce the available RAM memory and force the algorithm to start using the swap space in the disk. Since the algorithm is cache oblivious, I should expect a much better scaling to the disk compared to other non cache oblivious algorithms for the same problem.
The problem however is that it is very hard to predict how bad the algorithms will perform once out on the disk; a small increase in the input size might dramatically change the time that it takes for the algorithm to finish running. So if you have many algorithms that you want to test, if one takes forever to finish then the experiment will be useless (I could of course sit and monitor the experiment and kill if with ctrl+c, but I really need to sleep).
Let's say the algorithms are A,B and C. I use a different python script, one for each algorithm. For varying input size n I use subprocess.check_output to call the executable of the implementation. This executable returns some statistics that I then process and store in a suitable format that I can then use with R for example to make some nice plots.
This is an example code for algorithm A:
import subprocess
import sys
f1=open('data.stats', 'w+', 1)
min = 200000
max = 2000000
step = 200000
iterations = 10
ns = range(minLeafs, maxLeafs+1, step)
incr = 0
f1.write('n\tp\talg\ttime\n')
for n in ns:
i = 0
for p in ps:
for it in range(0, iterations):
resA = subprocess.check_output(['/usr/bin/time', '-v','./A',n],
stderr=subprocess.STDOUT)
#do something with resA
f1.write(resA + '\n')
incr = incr + 1
print(incr/(((len(ns)))*iterations)*100.0, '%', end="\r")
i = i + 1
My question is, can I somehow kill a script if subprocess.check_outputtakes too long to receive an answer? The best thing would for me to define a cut off, like 10 minutes, so if subprocess.check_output hasn't received anything, then kill the entire script.

If you're using Python 3 (and the format of your call to print suggests you might be), then check_output actually already has a timeout argument that might be useful to you: https://docs.python.org/3.6/library/subprocess.html#subprocess.check_output

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

Control the speed of a loop

I'm currently reading physics in the university, and im learning python as a little hobby.
To practise both at the same time, i figured I'll write a little "physics engine" that calculates the movement of an object based on x,y and z coordinates. Im only gonna return the movement in text (at least for now!) but i want the position updates to be real-time.
To do that i need to update the position of an object, lets say a hundred times a second, and print it back to the screen. So every 10 ms the program prints the current position.
So if the execution of the calculations take 2 ms, then the loop must wait 8ms before it prints and recalculate for the next position.
Whats the best way of constructing a loop like that, and is 100 times a second a fair frequency or would you go slower, like 25 times/sec?

The basic way to wait in python is to import time and use time.sleep. Then the question is, how long to sleep? This depends on how you want to handle cases where your loop misses the desired timing. The following implementation tries to catch up to the target interval if it misses.
import time
import random
def doTimeConsumingStep(N):
"""
This represents the computational part of your simulation.
For the sake of illustration, I've set it up so that it takes a random
amount of time which is occasionally longer than the interval you want.
"""
r = random.random()
computationTime = N * (r + 0.2)
print("...computing for %f seconds..."%(computationTime,))
time.sleep(computationTime)
def timerTest(N=1):
repsCompleted = 0
beginningOfTime = time.clock()
start = time.clock()
goAgainAt = start + N
while 1:
print("Loop #%d at time %f"%(repsCompleted, time.clock() - beginningOfTime))
repsCompleted += 1
doTimeConsumingStep(N)
#If we missed our interval, iterate immediately and increment the target time
if time.clock() > goAgainAt:
print("Oops, missed an iteration")
goAgainAt += N
continue
#Otherwise, wait for next interval
timeToSleep = goAgainAt - time.clock()
goAgainAt += N
time.sleep(timeToSleep)
if __name__ == "__main__":
timerTest()
Note that you will miss your desired timing on a normal OS, so things like this are necessary. Note that even with asynchronous frameworks like tulip and twisted you can't guarantee timing on a normal operating system.

Since you cannot know in advance how long each iteration will take, you need some sort of event-driven loop. A possible solution would be using the twisted module, which is based on the reactor pattern.
from twisted.internet import task
from twisted.internet import reactor
delay = 0.1
def work():
print "called"
l = task.LoopingCall(work)
l.start(delay)
reactor.run()
However, as has been noted, don't expect a true real-time responsiveness.

A piece of warning. You may not expect a real time on a non-realtime system. The sleep family of calls guarantees at least a given delay, but may well delay you for more.
Therefore, once you returned from sleep, query current time, and make the calculations into the "future" (accounting for the calculation time).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.