Why Might Python Be Running Slow - python

I am trying to run a program I created that uses a neural network to predict stock prices. I am trying to run it for a number of various different stocks. I am running the same exact code on both my desktop and my laptop.
At first I was running the code only on my desktop, and it was running very slow. At first I thought it was just because of the number of calculations to be made for the neural network. However, I also started running the code on my laptop to be able to run it for two stocks at the same time.
The code runs much much faster on my laptop (I would estimate about 20x faster), even though my desktop has a much better processor, GPU, etc... I am also using the same size data set for each run as well.
I added the lines of code:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
So that python should be using my processor and not my graphics processor, I am not sure if that makes a difference.
Any idea why this might be?

Related

Limit number of cores used on server for tensorflow 2 and keras

I try to run a Python script that trains several Neural Networks using TensorFlow and Keras. The problem is that I cannot restrict the number of cores used on the server, even though it works on my local desktop.
The basic structure is that I have defined a function run_net that runs the neural net. This function is called with different parameters in parallel using joblib (see below). Additionally, I have tried running the function iteratively with different parameters which didn't solve the problem.
Parallel(n_jobs=1, backend="multiprocessing")(
delayed(run_net)
If I run that on my local Windows Desktop, everything works fine. However, if I try to run the same script on our institute's server with 48 cores and check CPU usage using htop command, all cores are used. I already tried setting n_jobs in joblib Parallel to 1 and it looks like CPU usage goes to 100% once the tensorflow models are trained.
I already searched for different solutions and the main one that I found is the one below. I define that before running the parallel jobs shown above. I also tried placing the code below before every fit or predict method of the model.
NUM_PARALLEL_EXEC_UNITS = 5
config = tf.compat.v1.ConfigProto(
intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS,
inter_op_parallelism_threads=2,
device_count={"CPU": NUM_PARALLEL_EXEC_UNITS},
)
session = tf.compat.v1.Session(config=config)
K.set_session(session)
At this point, I am quite lost and have no idea how to make Tensorflow and/or Keras use a limited number of cores as the server I am using is shared across the institute.
The server is running linux. However, I don't know which exact distribution/version it is. I am very new to running code on a server.
These are the versions I am using:
python == 3.10.8
tensorflow == 2.10.0
keras == 2.10.0
If you need any other information, I am happy to provide that.
Edit 1
Both the answer suggested in this thread doesn't work as well as using only these commands:
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
after trying some things, I have found a solution to my problem. With the following code, I can restrict the number of CPUs used:
os.environ["OMP_NUM_THREADS"] = "5"
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
Note, that I have no idea how many CPUs will be used in the end. I noticed that it isn't five cores being used but more. As I don't really care about the exact number of cores but just that I don't use all cores, I am fine with that solution for now. If anybody knows how to calculate the number of cores used from the information provided above, let me know.

multiprocessing pool map incredibly slow running a Monte Carlo ensemble of python simulations

I have a python code that runs a 2D diffusion simulation for a set of parameters. I need to run the code many times, O(1000), like a Monte Carlo approach, using different parameter settings each time. In order to do this more quickly I want to use all the cores on my machine (or cluster), so that each core runs one instance of the code.
In the past I have done this successfully for serial fortran codes by writing a python wrapper that then used multiprocessing map (or starmap in the case of multiple arguments) to call the fortan code in an ensemble of simulations. It works very nicely in that you loop over the 1000 simulations, and the python wrapper farms out a new integration to a core as soon as it becomes free after completing a previous integration.
However, now when I set this up to do the same to run multiple instances of my python (instead of fortran) code, I find it is incredibly slow, much slower than simply running the code 1000 times in serial on a single core. Using the system monitor I see that one core is working at a time, and it never goes above 10-20% load, while of course I expected to see N cores running near 100% (as is the case when I farm out fortran jobs).
I thought it might be a write issue, and so I checked the code carefully to ensure that all plotting is switched off, and in fact there is no file/disk access at all, I now merely have one print statement at the end to print out a final diagnostic.
The structure of my code is like this
I have the main python code in toy_diffusion_2d.py which has a single arg of a dictionary with the run parameters in it:
def main(arg)
loop over timesteps:
calculation simulation over a large-grid
print the result statistic
And then I wrote a "wrapper" script, where I import the main simulation code and try to run it in parallel:
from multiprocessing import Pool,cpu_count
import toy_diffusion_2d
# dummy list of arguments
par1=[1,2,3]
par2=[4,5,6]
# make a list of dictionaries to loop over, 3x3=9 simulations in all.
arglist=[{"par1":p1,"par2":p2} for p1 in par1 for p2 in par2]
ncore=min(len(arglist),int(cpu_count()))
with Pool(processes=ncore) as p:
p.map(toy_diffusion_2d.main,arglist)
The above is a shorter paraphrased example, my actual codes are longer, so I have placed them here:
Main code: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_2d.py
You can run this with the default values like this:
python3 toy_diffusion_2d.py
Wrapper script: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_loop.py
You can run a 4 member ensemble like this:
python3 toy_diffusion_loop.py --diffK=[33000,37500] --tau_sub=[20,30]
(note that the final stat is slightly different each run, even with the same values as the model is stochastic, a version of the stochastic allen-cahn equations in case any one is interested, but uses a stupid explicit solver on the diffusion term).
As I said, the second parallel code works, but as I say it is reeeeeallly slow... like it is constantly gating.
I also tried using starmap, but that was not any different, it is almost like the desktop only allows one python interpreter to run at a time...? I spent hours on it, I'm almost at the point to rewrite the code in Fortran. I'm sure I'm just doing something really stupid to prevent parallel execution.
EDIT(1): this problem is occurring on
4.15.0-112-generic x86_64 GNU/Linux, with Python 3.6.9
In response to the comments, in fact I also find it runs fine on my MAC laptop...
EDIT(2): so it seems my question was a bit of a duplicate of several other postings, apologies! As well as the useful links provided by Pavel, I also found this page very helpful: Importing scipy breaks multiprocessing support in Python I'll edit in the solution below to the accepted answer.
The code sample you provide works just fine on my MacOS Catalina 10.15.6. I can guess you're using some Linux distributive, where, according to this answer, it can be the case that numpy import meddles with core affinity due to being linked with OpenBLAS library.
If your Unix supports scheduler interface, something like this will work:
>>> import os
>>> os.sched_setaffinity(0, set(range(cpu_count)))
Another question that has a good explanation of this problem is found here and the solution suggested is this:
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
inserted right before the multiprocessing call.

Python killed on GCP

I have been working on comparison to run deep learning code on local machine and Google Cloud Platform.
The code is about recurrent neural network and it ran perfectly well on local machine.
But on GCP cloud shell, when I want to compile my python file, it shows "Killed"
userID#projectID:~$ python rnn.py
Killed
Is it because that I am out of memory? (because I tried to run line by line, and on the second time I assigned large data to a variable, it stuck.)
My code is somewhat like this
imdb = np.load('imdb_word_emb.npz')
X_train = imdb['X_train']
X_test = imdb['X_test']
on the third line, the machine stuck and showed "Killed"
I tried to change the order of the second and third line, it still stuck at the third line.
My training data is a (25000,80,128)-array. So is my testing data. The data set works perfectly well on my local machine. I am sure there are no problem with this data set.
Or is it because of other reasons?
It would be awesome if people who know how to solve or even few key words tell me how to deal with this. Thank you :D
The error you are getting is because Cloud Shell is not intended for computational or network intensive processes, see Cloud Shell limitations.
I understand you want to compare your local machine with Google Cloud Platform. As stated in the public docs:
"When you start Cloud Shell, it provisions a g1-small Google Compute
Engine"
A g1-small machine type has 1.70GB RAM and a shared physical core. Keeping this in mind and also that is a limited as stated before, your local machine is likely more powerful than Cloud Shell so you'd not see any improvement.
I recommend you to create a Compute Engine instance with a different machine type, you can use a custom machine type to set the number of cores and GB of RAM you want to have. I guess you want to benefit from running your workload faster in Google Compute Engine so you can choose a better machine type than your local one in terms of resources and compare how much it improves.

Running scipy.odeint through multiple cores

I'm working in Jupyter (Anaconda) with Python 2.7.
I'm trying to get an odeint function I wrote to run multiple times, however it takes an incredible amount of time.
While trying to figure out how to decrease the run time, I realized that when I ran it only took up about 12% of my CPU.
I operate off of an Intel Core i7-3740QM # 2.70GHz:
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-3740QM+%40+2.70GHz&id=1481
So, I'm assuming this is because of Python's GIL causing my script to only run off of one core.
After some research on parallel processing in Python, I thought I found the answer by using this code:
import sys
import multiprocessing as mp
Altitude = np.array([[550],[500],[450],[400],[350],[300]])
if __name__ == "__main__":
processes = 4
p = mp.Pool(processes)
mp_solutions = p.map(calc, Altitude)
This doesn't seem to work though. Once I run it, Jupyter just becomes constantly busy. My first thought was that it was just a high computation level so it was taking a long time, but then I looked at my CPU usage and although there were multiple instances of Python processes, none of them were using any CPU.
I can't figure out what the reasoning for this is. I found this post as well and tried using their code but it simply did the same thing:
Multiple scipy.integrate.ode instances
Any help would be much appreciated.
Thanks!

Different time taken by python script every time it is runned?

I am working on a Opencv based Python project. I am working on program development which takes less time to execute. For that i have tested my small program print hello world on python to test the time taken to run the program. I had run many time and every time it run it gives me a different run time.
Can you explain me why a simple program is taking different time to execute?
I need my program to be independent of system processes ?
Python gets different amounts of system resources depending upon what else the CPU is doing at the time. If you're playing Skyrim with the highest graphics levels at the time, then your script will run slower than if no other programs were open. But even if your task bar is empty, there may be invisible background processes confounding things.
If you're not already using it, consider using timeit. It performs multiple runs of your program in order to smooth out bad runs caused by a busy OS.
If you absolutely insist on requiring your program to run in the same amount of time every time, you'll need to use an OS that doesn't support multitasking. For example, DOS.

Categories