Numpy matrix inverse appears to use multiple threads - python

So I have this really simple code:
import numpy as np
import scipy as sp
mat = np.identity(4)
for i in range (100000):
np.linalg.inv(mat)
for i in range (100000):
sp.linalg.inv(mat)
Now, the first bit where the inverse is done through numpy, for some reason, launches 3 additional threads (so 4 total, including the main) and collectively they consume roughly 3 or my available cores, causing the fans on my computer to go wild.
The second bit where I use Scipy has no noticable impact on CPU use and there's only one thread, the main thread. This runs about 20% slower than the numpy loop.
Anyone have any idea what's going on? Is numpy doing threading in the background? Why is it so inefficient?

I faced the same issue, the fix was to set export OPENBLAS_NUM_THREADS=1 and then run the python script in the same terminal.
My issue was that a simple code block that consists of np.linalg.inv() was consuming more than 50% of CPU usage. After setting the OPENBLAS_NUM_THREADS parameter, the CPU usage dropped to around 3% and also the total execution time reduced. I read somewhere that this is an issue with the OPENBLAS library (Which is used by numpy.linalg.inv function). Hope this helps!

Related

Detect memory swapping in Python

How to detect when OS starts swapping some resources of the running process to disk?
I came here from basically the same question. The psutil library is obviously great and provides a lot of information, yet, I don't know how to use it to solve the problem.
I created a simple test script.
import psutil
import os
import numpy as np
MAX = 45000
ar = []
pr_mem = []
swap_mem = []
virt_mem = []
process = psutil.Process()
for i in range(MAX):
ar.append(np.zeros(100_000))
pr_mem.append(process.memory_info())
swap_mem.append(psutil.swap_memory())
virt_mem.append(psutil.virtual_memory())
Then, I plotted the course of those statistics.
import matplotlib.pyplot as plt
plt.figure(figsize=(16,12))
plt.plot([x.rss for x in pr_mem], label='Resident Set Size')
plt.plot([x.vms for x in pr_mem], label='Virtual Memory Size')
plt.plot([x.available for x in virt_mem], label='Available Memory')
plt.plot([x.total for x in swap_mem], label='Total Swap Memory')
plt.plot([x.used for x in swap_mem], label='Used Swap Memory')
plt.plot([x.free for x in swap_mem], label='Free Swap Memory')
plt.legend(loc='best')
plt.show()
I cannot see, how I can use the information about swap memory to detect the swapping of my process.
The 'Used Swap Memory' information is kind of meaningless, as it is high from the very first moment (it counts global consumption) when the data of the process are obviously not swapped.
It seems best to look at the difference between 'Virtual Memory Size' and 'Resident Set Size' of the process. If VMS greatly exceeds RSS, it is a sign that the data are not on RAM but on disk.
However, here is described a problem with sudden explosion of VMS which makes it irrelevant in some cases, if I understand it correctly.
Other approach is to watch 'Available Memory' and be sure that it does not drop below a certain threshold, as psutil documentation suggests. But to me it seems complicated to set the threshold properly. They suggests (in the small snippet) 100 MB, but on my machine it is something like 500 MB.
So the question stands. How to detect when OS starts swapping some resources of the running process to disk?
I work on Windows, but the solution needs to be cross-platform (or at least as cross-platform as possible).
Any suggestion, advice or useful link is welcomed. Thank you!
For the context, I write a library which needs to manage its memory consumption.
I believe that with the knowledge of the program logic, my manual swapping (serializing data to disk) can work better (faster) than OS swapping. More importantly, the OS swap space it limited, thus, sometimes it is necessary to do the manual swapping which does not utilize the OS swap space.
In order to start the manual swapping at the right time, it is crucial to know when OS swapping starts.

Parallelize matplotlib.animation when running .to_html5_video()

I was wondering whether there is a straight-forward way to parallelize the method animation.FuncAnimation.to_html5_video() which takes a lot of time. While running it only uses 1 of my cpu's cores at a given time and I am guessing the process should be parallelizable. Any ideas to make this work without digging too much?
I need this to plot some time-evolving curves from an ODE system but it takes a lot of time
PS When I say parallelize, I mean e.g. using a library like multiprocessing

Can time.clock() be heavily affected by the state of the system?

This is a rather general question:
I am having issues that the same operation measured by time.clock() takes longer now than it used to.
While I had very some very similar measurements
1954 s
1948 s
1948 s
One somewhat different measurement
1999 s
Another even more different
2207 s
It still seemed more or less ok, but for another one I get
2782 s
And now that I am repeating the measurements, it seems to get slower and slower.
I am not summing over the measurements after rounding or doing other weird manipulations.
Do you have some ideas whether this could be affected by how busy the server is, the clock speed or any other variable parameters? I was hoping that using time.clock() instead of time.time() would mostly sort these out...
The OS is Ubuntu 18.04.1 LTS.
The operations are run in separate screen sessions.
The operations do not involve hard-disk acccess.
The operations are mostly numpy operations that are not distributed. So this is actually mainly C code being executed.
EDIT: This might be relevant: The measurements in time.time() and time.clock() are very similar in any of the cases. That is time.time() measurements are always just slightly longer than time.clock(). So if I haven't missed something, the cause has almost exactly the same effect on time.clock() as on time.time().
EDIT: I do not think that my question has been answered. Another reason I could think of is that garbage collection contributes to CPU usage and is done more frequently when the RAM is full or going to be full.
Mainly, I am looking for an alternative measure that gives the same number for the same operations done. Operations meaning my algorithm executed with the same start state. Is there a simple way to count FLOPS or similar?
The issue seems to be related to Python and Ubuntu.
Try the following:
Check if you have the latest stable build of the python version you're using Link 1
Check the process list, also see which cpu core your python executable is running on.
Check the thread priority state on the cpu Link 2
, Link 3
Note:
Time may vary due to process switching, threading and other OS resource management and application code execution (this cannot be controlled)
Suggestions:
it could be because of your systems build, try running your code on another machine or on a Virtual Machine.
Read Up:
Link 4
Link 5
Good Luck.
~ Dean Van Geunen
As a result of repeatedly running the same algorithm at different 'system states', I would summarize that the answer to the question is:
Yes, time.clock() can be heavily affected by the state of the system.
Of course, this holds all the more for time.time().
The general reasons could be that
The same Python code does not always result in the same commands being sent to the CPU - that is the commands depend not only on the code and the start state, but also on the system state (i.e. garbage collection)
The system might interfere with the commands sent from Python, resulting in additional CPU usage (i.e. by core switching) that is still counted by time.clock()
The divergence can be very large, in my case around 50%.
It is not clear which are the specific reasons, nor how much each of them contributes to the problem.
It is still to be tested whether timeit helps with some or all of the above points. However timeit is meant for benchmarking and might not be recommended to be used during normal processing. It turns off garbage collection and does not allow accessing return values of the timed function.

Fastest way to compute matrix dot product

I compute the dot product as follows:
import numpy as np
A = np.random.randn(80000, 3000)
B = np.random.randn(3000, 50)
C = np.dot(A, B)
Running this script takes about 9 seconds:
Mac#MacBook-Pro:~/python_dot_product$ time python dot.py
real 0m9.042s
user 0m10.927s
sys 0m0.911s
Could I do any better?
Does numpy already use the ideal balance for the cores?
The last two answers at this SO answer should be helpful.
The last one pointed me to SciPy documentation, which includes this quote:
"[np.dot(A,B) is evaluated using BLAS, which] will normally be a
library carefully tuned to run as fast as possible on your hardware by
taking advantage of cache memory and assembler implementation. But
many architectures now have a BLAS that also takes advantage of a
multicore machine. If your numpy/scipy is compiled using one of these,
then dot() will be computed in parallel (if this is faster) without
you doing anything."
So it sounds like it depends on your specific hardware and SciPy compilation. Sometimes np.dot(A,B) will utilize your multiple cores/processors, sometimes it might not.
To find out which case is yours, I suggest running your toy example (with larger matrices) while you have your system monitor open, so you can see whether just one CPU spikes in activity, or if multiple ones do.

Python randomly drops to 0% CPU usage, causing the code to "hang up", when handling large numpy arrays?

I have been running some code, a part of which loads in a large 1D numpy array from a binary file, and then alters the array using the numpy.where() method.
Here is an example of the operations performed in the code:
import numpy as np
num = 2048
threshold = 0.5
with open(file, 'rb') as f:
arr = np.fromfile(f, dtype=np.float32, count=num**3)
arr *= threshold
arr = np.where(arr >= 1.0, 1.0, arr)
vol_avg = np.sum(arr)/(num**3)
# both arr and vol_avg needed later
I have run this many times (on a free machine, i.e. no other inhibiting CPU or memory usage) with no issue. But recently I have noticed that sometimes the code hangs for an extended period of time, making the runtime an order of magnitude longer. On these occasions I have been monitoring %CPU and memory usage (using gnome system monitor), and found that python's CPU usage drops to 0%.
Using basic prints in between the above operations to debug, it seems to be arbitrary as to which operation causes the pausing (i.e. open(), np.fromfile(), np.where() have each separately caused a hang on a random run). It is as if I am being throttled randomly, because on other runs there are no hangs.
I have considered things like garbage collection or this question, but I cannot see any obvious relation to my problem (for example keystrokes have no effect).
Further notes: the binary file is 32GB, the machine (running Linux) has 256GB memory. I am running this code remotely, via an ssh session.
EDIT: This may be incidental, but I have noticed that there are no hang ups if I run the code after the machine has just been rebooted. It seems they begin to happen after a couple of runs, or at least other usage of the system.
np.where is creating a copy there and assigning it back into arr. So, we could optimize on memory there by avoiding a copying step, like so -
vol_avg = (np.sum(arr) - (arr[arr >= 1.0] - 1.0).sum())/(num**3)
We are using boolean-indexing to select the elements that are greater than 1.0 and getting their offsets from 1.0 and summing those up and subtracting from the total sum. Hopefully the number of such exceeding elements are less and as such won't incur anymore noticeable memory requirement. I am assuming this hanging up issue with large arrays is a memory based one.
The drops in CPU usage were unrelated to python or numpy, but were indeed a result of reading from a shared disk, and network I/O was the real culprit. For such large arrays, reading into memory can be a major bottleneck.
Did you click or select the Console window? This behavior can "hang" the process. Console enters "QuickEditMode". Pressing any key can resume the process.

Categories