Python numpy.fft very slow (10x slower) when run in subprocess

Python numpy.fft very slow (10x slower) when run in subprocess - python

I've found that numpy.fft.fft (and its variants) very slow when run in the background. Here is an example of what I'm talking about
import numpy as np
import multiprocessing as mproc
import time
import sys
# the producer function, which will run in the background and produce data
def Producer(dataQ):
numFrames = 5
n = 0
while n < numFrames:
data = np.random.rand(3000, 200)
dataQ.put(data) # send the datta to the consumer
time.sleep(0.1) # sleep for 0.5 second, so we dont' overload CPU
n += 1
# the consumer function, which will run in the backgrounnd and consume data from the producer
def Consumer(dataQ):
while True:
data = dataQ.get()
t1 = time.time()
fftdata = np.fft.rfft(data, n=3000*5)
tDiff = time.time() - t1
print("Elapsed time is %0.3f" % tDiff)
time.sleep(0.01)
sys.stdout.flush()
# the main program if __name__ == '__main__': is necessary to prevent this code from being run
# only when this program is started by user
if __name__ == '__main__':
data = np.random.rand(3000, 200)
t1 = time.time()
fftdata = np.fft.rfft(data, n=3000*5, axis=0)
tDiff = time.time() - t1
print("Elapsed time is %0.3f" % tDiff)
# generate a queue for transferring data between the producedr and the consumer
dataQ = mproc.Queue(4)
# start up the processoso
producerProcess = mproc.Process(target=Producer, args=[dataQ], daemon=False)
consumerProcess = mproc.Process(target=Consumer, args=[dataQ], daemon=False)
print("starting up processes")
producerProcess.start()
consumerProcess.start()
time.sleep(10) # let program run for 5 seconds
producerProcess.terminate()
consumerProcess.terminate()
The output it produes on my machine:
Elapsed time is 0.079
starting up processes
Elapsed time is 0.859
Elapsed time is 0.861
Elapsed time is 0.878
Elapsed time is 0.863
Elapsed time is 0.758
As you can see, it is roughly 10x slower when run in the background, and I can't figure out why this would be the case. The time.sleep() calls should ensure that the other process (the main process and producer process) aren't doing anything when the FFT is being computed, so it should use all the cores. I've checked CPU utilization through Windows Task Manager and it seems to use up about 25% when numpy.fft.fft is called heavily in both the single process and multiprocess cases.
Anyone have an idea whats going on?

The main problem is that your fft call in the background thread is:
fftdata = np.fft.rfft(data, n=3000*5)
rather than:
fftdata = np.fft.rfft(data, n=3000*5, axis=0)
which for me made all the difference.
There are a few other things worth noting. Rather than having the time.sleep() everywhere, why not just let the processor take care of this itself? Further more, rather than suspending the main thread, you can use
consumerProcess.join()
and then have the producer process run dataQ.put(None) once it has finished loading the data, and break out of the loop in the consumer process, i.e.:
def Consumer(dataQ):
while True:
data = dataQ.get()
if(data is None):
break
...

Related

multiprocessing is always worse than single process no matter how many

I am playing around with multiprocessing in Python 3 to try and understand how it works and when it's good to use it.
I am basing my examples on this question, which is really old (2012).
My computer is a Windows, 4 physical cores, 8 logical cores.
First: not segmented data
First I try to brute force compute numpy.sinfor a million values. The million values is a single chunk, not segmented.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
if __name__ == '__main__':
pool = Pool(processes = 8)
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
And I get that, no matter the number of processes, the 'multi_threading' always takes 10 times or so as much as the 'single threading'. In the task manager, I see that not all the CPUs are maxed out, and the total CPU usage is goes between 18% and 31%.
So I try something else.
Second: segmented data
I try to split up the original 1 million computations in 10 batches of 100,000 each. Then I try again for 10 million computations in 10 batches of 1 million each.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
p = 3
s = 1000000
a = [numpy.arange(s) for _ in range(10)]
if __name__ == '__main__':
print('processes = {}'.format(p))
print('size = {}'.format(s))
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
pool = Pool(processes = p)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
I ran this last piece of code for different processes p and different list length s, 100000and 1000000.
At least now the task Manager gives the CPU maxed out at 100% usage.
I get the following results for the elapsed times (ORANGE: multiprocess, BLUE: single):
So multiprocessing never wins over the single process.
Why??

Numpy changes how the parent process runs so that it only runs on one core. You can call os.system("taskset -p 0xff %d" % os.getpid()) after you import numpy to reset the CPU affinity so that all cores are used.
See this question for more details

A computer can really only do one thing at a time. When multi-threading or multi-processing, the computer is really only switching back and forth between tasks quickly. With the provided problem, the computer could either perform the calculation 1,000,000 times, or split-up the work between a couple "workers" and perform 100,000 for each of 10 "workers".
Multi-processing shines not when computing something straight out, as the computer has to take time to create multiple processes, but while waiting for something. The main example I've heard is for webscraping. If a program requested data from a list of websites and waited for each server to send data before requesting data from the next, the program will have to sit for a couple seconds. If instead, the computer used multiprocessing/threading to ask all the websites first and all concurrently wait, the total running time is much shorter.

raspberry pi 3 multiprocessing queue syncronization between 2 processes

I have done a simple code using the multiprocessing library to build an extra process apart from the main code (2 processes in total). I did this code on W7 Professional x64 through Anaconda-spyder v3.2.4 and it works almost as I want except for the fact that when I run the code it increase the memory consumption of my second process (not the main one) until it reaches the total capacity and the computer got stuck and freezed (you can notice this at the whindows task manager).
"""
Example to print data from a function using multiprocessing library
Created on Thu Jan 30 12:07:49 2018
author: Kevin Machado Gamboa
Contct: ing.kevin#hotmail.com
"""
from time import time
import numpy as np
from multiprocessing import Process, Queue, Event
t0=time()
def ppg_parameters(hr, minR, ampR, minIR, ampIR, t):
HR = float(hr)
f= HR * (1/60)
# Spo2 Red signal function
sR = minR + ampR * (0.05*np.sin(2*np.pi*t*3*f)
+ 0.4*np.sin(2*np.pi*t*f) + 0.25*np.sin(2*np.pi*t*2*f+45))
# Spo2 InfraRed signal function
sIR = minIR + ampIR * (0.05*np.sin(2*np.pi*t*3*f)
+ 0.4*np.sin(2*np.pi*t*f) + 0.25*np.sin(2*np.pi*t*2*f+45))
return sR, sIR
def loop(q):
"""
generates the values of the function ppg_parameters
"""
hr = 60
ampR = 1.0814 # amplitud for Red signal
minR = 0.0 # Desplacement from zero for Red signal
ampIR = 1.12 # amplitud for InfraRed signal
minIR = 0.7 # Desplacement from zero for Red signal
# infinite loop to generate the signal
while True:
t = time()-t0
y = ppg_parameters(hr, minR, ampR, minIR, ampIR, t)
q.put([t, y[0], y[1]])
if __name__ == "__main__":
_exit = Event()
q = Queue()
p = Process(target=loop, args=(q,))
p.start()
# starts the main process
while q.qsize() != 1:
try:
data = q.get(True,2) # takes each data from the queue
print(data[0], data[1], data[2])
except KeyboardInterrupt:
p.terminate()
p.join()
print('supposed to stop')
break
Why is this happening? Perhaps is the while loop of my 2nd process? I don't know. I haven't seen this issue nowhere.
Moreover, if I run the same code on my Rpi 3 model B, there is a point when it pops an error that said "the queue is empty" something like if the main process is running faster than process two.
Please any guess of why is this happening, suggestion or link would be helpful.
Thanks

It looks like inside your infinite loop you are adding to the queue and I'm guessing that you are adding data faster than it can be taken off of the queue by the other process.
You could check the queue size periodically from inside the infinite loop and if it is over a certain amount (say 500 items), then you could sleep for a few seconds and then check again.
https://docs.python.org/2/library/queue.html#Queue.Queue.qsize

Python, Raspberry pi, call a task every 10 milliseconds precisely

I'm currently trying to have a function called every 10ms to acquire data from a sensor.
Basically I was triggering the callback from a gpio interrupt but I changed my sensor and the one I'm currently using doesn't have a INT pin to drive the callback.
So my goal is to have the same behavior but with an internal interrupt generated by a timer.
I tried this from this topic
import threading
def work ():
threading.Timer(0.25, work).start ()
print(time.time())
print "stackoverflow"
work ()
But when I run it I can see that the timer is not really precise and it's deviating over time as you can see.
1494418413.1584847
stackoverflow
1494418413.1686869
stackoverflow
1494418413.1788757
stackoverflow
1494418413.1890721
stackoverflow
1494418413.1992736
stackoverflow
1494418413.2094712
stackoverflow
1494418413.2196639
stackoverflow
1494418413.2298684
stackoverflow
1494418413.2400634
stackoverflow
1494418413.2502584
stackoverflow
1494418413.2604961
stackoverflow
1494418413.270702
stackoverflow
1494418413.2808678
stackoverflow
1494418413.2910736
stackoverflow
1494418413.301277
stackoverflow
So the timer is deviating by 0.2 milliseconds every 10 milliseconds which is quite a big bias after few seconds.
I know that python is not really made for "real-time" but I think there should be a way to do it.
If someone already have to handle time constraints with python I would be glad to have some advices.
Thanks.

This code works on my laptop - logs the delta between target and actual time - main thing is to minimise what is done in the work() function because e.g. printing and scrolling screen can take a long time.
Key thing is to start the next timer based on difference between the time when that call is made and the target.
I slowed down the interval to 0.1s so it is easier to see the jitter which on my Win7 x64 can exceed 10ms which would cause problems with passing a negative value to thte Timer() call :-o
This logs 100 samples, then prints them - if you redirect to a .csv file you can load into Excel to display graphs.
from multiprocessing import Queue
import threading
import time
# this accumulates record of the difference between the target and actual times
actualdeltas = []
INTERVAL = 0.1
def work(queue, target):
# first thing to do is record the jitter - the difference between target and actual time
actualdeltas.append(time.clock()-target+INTERVAL)
# t0 = time.clock()
# print("Current time\t" + str(time.clock()))
# print("Target\t" + str(target))
# print("Delay\t" + str(target - time.clock()))
# print()
# t0 = time.clock()
if len(actualdeltas) > 100:
# print the accumulated deltas then exit
for d in actualdeltas:
print d
return
threading.Timer(target - time.clock(), work, [queue, target+INTERVAL]).start()
myQueue = Queue()
target = time.clock() + INTERVAL
work(myQueue, target)
Typical output (i.e. don't rely on millisecond timing on Windows in Python):
0.00947008617187
0.0029628920052
0.0121824719378
0.00582923077099
0.00131316206917
0.0105631524709
0.00437298744466
-0.000251418553351
0.00897956530515
0.0028528821332
0.0118192949105
0.00546301269675
0.0145723546788
0.00910063698529

I tried your solution but I got strange results.
Here is my code :
from multiprocessing import Queue
import threading
import time
def work(queue, target):
t0 = time.clock()
print("Target\t" + str(target))
print("Current time\t" + str(t0))
print("Delay\t" + str(target - t0))
print()
threading.Timer(target - t0, work, [queue, target+0.01]).start()
myQueue = Queue()
target = time.clock() + 0.01
work(myQueue, target)
And here is the output
Target 0.054099
Current time 0.044101
Delay 0.009998
Target 0.064099
Current time 0.045622
Delay 0.018477
Target 0.074099
Current time 0.046161
Delay 0.027937999999999998
Target 0.084099
Current time 0.0465
Delay 0.037598999999999994
Target 0.09409899999999999
Current time 0.046877
Delay 0.047221999999999986
Target 0.10409899999999998
Current time 0.047211
Delay 0.05688799999999998
Target 0.11409899999999998
Current time 0.047606
Delay 0.06649299999999997
So we can see that the target is increasing per 10ms and for the first loop, the delay for the timer seems to be good.
The point is instead of starting again at current_time + delay it start again at 0.045622 which represents a delay of 0.001521 instead of 0.01000
Did I missed something? My code seems to follow your logic isn't it?
Working example for #Chupo_cro
Here is my working example
from multiprocessing import Queue
import RPi.GPIO as GPIO
import threading
import time
import os
INTERVAL = 0.01
ledState = True
GPIO.setmode(GPIO.BCM)
GPIO.setup(2, GPIO.OUT, initial=GPIO.LOW)
def work(queue, target):
try:
threading.Timer(target-time.time(), work, [queue, target+INTERVAL]).start()
GPIO.output(2, ledState)
global ledState
ledState = not ledState
except KeyboardInterrupt:
GPIO.cleanup()
try:
myQueue = Queue()
target = time.time() + INTERVAL
work(myQueue, target)
except KeyboardInterrupt:
GPIO.cleanup()

Implement sub millisecond processing in python without busywait

How would i implement processing of an array with millisecond precision using python under linux (running on a single core Raspberry Pi).
I am trying to parse information from a MIDI file, which has been preprocessed to an array where each millisecond i check if the array has entries at the current timestamp and trigger some functions if it does.
Currently i am using time.time() and employ busy waiting (as concluded here). This eats up all the CPU, therefor i opt for a better solution.
# iterate through all milliseconds
for current_ms in xrange(0, last+1):
start = time()
# check if events are to be processed
try:
events = allEvents[current_ms]
# iterate over all events for this millisecond
for event in events:
# check if event contains note information
if 'note' in event:
# check if mapping to pin exists
if event['note'] in mapping:
pin = mapping[event['note']]
# check if event contains on/off information
if 'mode' in event:
if event['mode'] == 0:
pin_off(pin)
elif event['mode'] == 1:
pin_on(pin)
else:
debug("unknown mode in event:"+event)
else:
debug("no mapping for note:" + event['note'])
except:
pass
end = time()
# fill the rest of the millisecond
while (end-start) < (1.0/(1000.0)):
end = time()
where last is the millisecond of the last event (known from preprocessing)
This is not a question about time() vs clock() more about sleep vs busy wait.
I cant really sleep in the "fill rest of millisecond" loop, because of the too low accuracy of sleep(). If i were to use ctypes, how would i go about it correctly?
Is there some Timer library which would call a callback each millisecond reliably?
My current implementation is on GitHub. With this approach i get a skew of around 4 or 5ms on the drum_sample, which is 3.7s total (with mock, so no real hardware attached). On a 30.7s sample, the skew is around 32ms (so its at least not linear!).
I have tried using time.sleep() and nanosleep() via ctypes with the following code
import time
import timeit
import ctypes
libc = ctypes.CDLL('libc.so.6')
class Timespec(ctypes.Structure):
""" timespec struct for nanosleep, see:
http://linux.die.net/man/2/nanosleep """
_fields_ = [('tv_sec', ctypes.c_long),
('tv_nsec', ctypes.c_long)]
libc.nanosleep.argtypes = [ctypes.POINTER(Timespec),
ctypes.POINTER(Timespec)]
nanosleep_req = Timespec()
nanosleep_rem = Timespec()
def nsleep(us):
#print('nsleep: {0:.9f}'.format(us))
""" Delay microseconds with libc nanosleep() using ctypes. """
if (us >= 1000000):
sec = us/1000000
us %= 1000000
else: sec = 0
nanosleep_req.tv_sec = sec
nanosleep_req.tv_nsec = int(us * 1000)
libc.nanosleep(nanosleep_req, nanosleep_rem)
LOOPS = 10000
def do_sleep(min_sleep):
#print('try: {0:.9f}'.format(min_sleep))
total = 0.0
for i in xrange(0, LOOPS):
start = timeit.default_timer()
nsleep(min_sleep*1000*1000)
#time.sleep(min_sleep)
end = timeit.default_timer()
total += end - start
return (total / LOOPS)
iterations = 5
iteration = 1
min_sleep = 0.001
result = None
while True:
result = do_sleep(min_sleep)
#print('res: {0:.9f}'.format(result))
if result > 1.5 * min_sleep:
if iteration > iterations:
break
else:
min_sleep = result
iteration += 1
else:
min_sleep /= 2.0
print('FIN: {0:.9f}'.format(result))
The result on my i5 is
FIN: 0.000165443
while on the RPi it is
FIN: 0.000578617
which suggest a sleep period of about 0.1 or 0.5 milliseconds, with the given jitter (tend to sleep longer) that at most helps me reduce the load a little bit.

One possible solution, using the sched module:
import sched
import time
def f(t0):
print 'Time elapsed since t0:', time.time() - t0
s = sched.scheduler(time.time, time.sleep)
for i in range(10):
s.enterabs(t0 + 10 + i, 0, f, (t0,))
s.run()
Result:
Time elapsed since t0: 10.0058200359
Time elapsed since t0: 11.0022959709
Time elapsed since t0: 12.0017120838
Time elapsed since t0: 13.0022599697
Time elapsed since t0: 14.0022521019
Time elapsed since t0: 15.0015859604
Time elapsed since t0: 16.0023040771
Time elapsed since t0: 17.0023028851
Time elapsed since t0: 18.0023078918
Time elapsed since t0: 19.002286911
Apart from some constant offset of about 2 millisecond (which you could calibrate), the jitter seems to be on the order of 1 or 2 millisecond (as reported by time.time itself). Not sure if that is good enough for your application.
If you need to do some useful work in the meantime, you should look into multi-threading or multi-processing.
Note: a standard Linux distribution that runs on a RPi is not a hard real-time operating system. Also Python can show non-deterministic timing, e.g. when it starts a garbage collection. So your code might run fine with low jitter most of the time, but you might have occasional 'hickups', where there is a bit of delay.

Printing time in Python multiprocessing script return negative time elapsed

Running it on Ubuntu 14 with Python 2.7.6
I simplified script to show my problem:
import time
import multiprocessing
data = range(1, 3)
start_time = time.clock()
def lol():
for i in data:
print time.clock() - start_time, "lol seconds"
def worker(n):
print time.clock() - start_time, "multiprocesor seconds"
def mp_handler():
p = multiprocessing.Pool(1)
p.map(worker, data)
if __name__ == '__main__':
lol()
mp_handler()
And the output:
8e-06 lol seconds
6.9e-05 lol seconds
-0.030019 multiprocesor seconds
-0.029907 multiprocesor seconds
Process finished with exit code 0
Using time.time() gives non-negative values (as marked here Timer shows negative time elapsed)
but I'm curious what is the problem with time.clock() in python multiprocessing and reading time from CPU.

multiprocessing spawns new processes and time.clock() on linux has the same meaning of the C's clock():
The value returned is the CPU time used so far as a clock_t;
So the values returned by clock restart from 0 when a process start. However your code uses the parent's process start_time to determine the time spent in the child process, which is obviously incorrect if the child processes CPU time resets.
The clock() function makes sense only when handling one process, because its return value is the CPU time spent by that process. Child processes are not taken into account.
The time() function on the other hand uses a system-wide clock, and thus can be used even between different processes (although it is not monotonic, so it might return wrong results if somebody changes the system time during the events).
Forking a running python instance is probably faster then starting a new one from scratch, hence start_time is almost always bigger then the value returned by time.clock().
Take into account that the parent process also had to read your file on disk, perform the imports which may require reading other .py files, searching directories etc.
The forked child processes don't have to do all that.
Example code that shows that the return value of time.clock() resets to 0:
from __future__ import print_function
import time
import multiprocessing
data = range(1, 3)
start_time = time.clock()
def lol():
for i in data:
t = time.clock()
print('t: ', t, end='\t')
print(t - start_time, "lol seconds")
def worker(n):
t = time.clock()
print('t: ', t, end='\t')
print(t - start_time, "multiprocesor seconds")
def mp_handler():
p = multiprocessing.Pool(1)
p.map(worker, data)
if __name__ == '__main__':
print('start_time', start_time)
lol()
mp_handler()
Result:
$python ./testing.py
start_time 0.020721
t: 0.020779 5.8e-05 lol seconds
t: 0.020804 8.3e-05 lol seconds
t: 0.001036 -0.019685 multiprocesor seconds
t: 0.001166 -0.019555 multiprocesor seconds
Note how t is monotonic for the lol case while goes back to 0.001 in the other case.

To add a concise Python 3 example to Bakuriu's excellent answer above you can use the following method to get a global timer independent of the subprocesses:
import multiprocessing as mp
import time
# create iterable
iterable = range(4)
# adds three to the given element
def add_3(num):
a = num + 3
return a
# multiprocessing attempt
def main():
pool = mp.Pool(2)
results = pool.map(add_3, iterable)
return results
if __name__ == "__main__": #Required not to spawn deviant children
start=time.time()
results = main()
print(list(results))
elapsed = (time.time() - start)
print("\n","time elapsed is :", elapsed)
Note that if we had instead used time.process_time() instead of time.time() we will get an undesired result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python numpy.fft very slow (10x slower) when run in subprocess - python

Related

multiprocessing is always worse than single process no matter how many

raspberry pi 3 multiprocessing queue syncronization between 2 processes

Python, Raspberry pi, call a task every 10 milliseconds precisely

Implement sub millisecond processing in python without busywait

Printing time in Python multiprocessing script return negative time elapsed

Categories

Resources