Python running slower than before some system change - python

I have two computers at home for my python projects. One is a PC with an Intel i5-7400 CPU and the other is a laptop with an Intel i7-10750H CPU. Presumably the laptop is faster running the same python code than the PC. This was the case before I made some changes to the laptop in an attempt to leverage its Nvida GPU for training DNN model.
I followed the instruction from Tensorflow GPU support webpage to upgrade Nvida GPU driver, install Cuda toolkit and cuDNN with the recommanded version. After the installation, I created a new conda environment and installed latest tensorflow. With all this I could detect my GPU with tf.config.list_physical_devices() and run some test code on the GPU. However, the performance was not lifted and even worse, the laptop became noticeably slower running the same code on its CPU. I tested the following simple code on both machines:
from datetime import datetime
import numpy as np
t0 = datetime.now()
for i in range(1000):
a = np.random.rand(1000, 1000)
b = np.random.rand(1000, 1000)
c = np.matmul(a, b)
t1 = datetime.now()
print(t1 - t0)
The PC ran it in 32s but the laptop needed 45s. I tried a few things to resolve this, including uninstalling Cuda toolkit/cuDNN and reinstalling anaconda (tried different anaconda versions). But the issue still remains. Anyone has any insights about why this happens and what to try in order to address it? Many thanks.
Update: I notice that the same python code uses about 30% CPU if running on the PC's intel i5-7400 CPU but uses over 90% CPU and is slower if on the laptop intel i7-10750H CPU. Is it normal?

This is probably not your main problem however, your laptop run with battery? The laptop can decrease performance for saving battery life

There are many reasons to consider. Firstly, The code you are ruining
doesn't use GPU at all. It's all about CPU's holding the throttle.
Basically, as a part of "Thermal management" Laptops CPU power limit
throttle is constantly controlled. Fact is in CPU's runs code faster
than GPU's. So, maybe your laptop reaching thermal limitations and
throttles down to slower for most of the time it takes to run the
program. Maybe your PC CPU is able to withhold that throttle so it's
finishing bit faster.
Once check Benchmarking your code. A wonderful instructions return here
https://stackoverflow.com/a/1593034/15358800

Related

Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?

I am asking this question because I am successfully training a segmentation network on my GTX 2070 on laptop with 8GB VRAM and I use exactly the same code and exactly the same software libraries installed on my desktop PC with a GTX 1080TI and it still throws out of memory.
Why does this happen, considering that:
The same Windows 10 + CUDA 10.1 + CUDNN 7.6.5.32 + Nvidia Driver 418.96 (comes along with CUDA 10.1) are both on laptop and on PC.
The fact that training with TensorFlow 2.3 runs smoothly on the GPU on my PC, yet it fails allocating memory for training only with PyTorch.
PyTorch recognises the GPU (prints GTX 1080 TI) via the command : print(torch.cuda.get_device_name(0))
PyTorch allocates memory when running this command: torch.rand(20000, 20000).cuda() #allocated 1.5GB of VRAM.
What is the solution to this?
Most of the people (even in the thread below) jump to suggest that decreasing the batch_size will solve this problem. In fact, it does not in this case. For example, it would have been illogical for a network to train on 8GB VRAM and yet to fail to train on 11GB VRAM, considering that there were no other applications consuming video memory on the system with 11GB VRAM and the exact same configuration is installed and used.
The reason why this happened in my case was that, when using the DataLoader object, I set a very high (12) value for the workers parameter. Decreasing this value to 4 in my case solved the problem.
In fact, although at the bottom of the thread, the answer provided by Yurasyk at https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 pointed me in the right direction.
Solution: Decrease the number of workers in the PyTorch DataLoader. Although I do not exactly understand why this solution works, I assume it is related to the threads spawned behind the scenes for data fetching; it may be the case that, on some processors, such an error appears.

Tensorflow CPU RAM allocation

I experience an incredibly high amount of (CPU) RAM usage with Tensorflow while about every variable is allocated on the GPU device, and all computation runs there. Even then, RAM usage exceeds the VRAM usage by a factor of 2 at least. I'm trying to understand why so as to see if it can be remedied or if it's inevitable.
Question
So my main question is: Does Tensorflow allocate and maintain a copy of all GPU variables on (CPU) RAM? If yes, what is allocated when (in which phase, see below)? And why is it useful to allocate this in CPU memory?
More info
I have 3 phases in which I see RAM increasing dramatically.
First, when defining the graph (I append VGG-19 with quite large loss functions that iterate over many translated activation maps). This adds 2 GB to RAM usage.
Second, defining the optimizer (I use Adam) adds 250MB.
Initialize global variables adds 750MB.
And then it remains stable and runs very fast (all on GPU).
(Amounts of data mentioned here are when I input tiny images of size 8x8x3, batch size of 1. If I do more than 1x16x16x3, the process gets killed because it overflows my 8GB RAM+6GB swap limit).
Note that I recorded variable placement with tf.ConfigProto(log_device_placement=True), and GPU usage using tf.RunMetadata and visualization on tensorboard.
Thank you for any help.
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution: Linux Ubuntu 17.10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.7
Python version: 3.6.3
GCC/Compiler version (if compiling from source): 6.4.0
CUDA/cuDNN version: 9.0
GPU model and memory: NVidia GeForce Titan Xp

plt.plot() cause blue screen on windows

This code cause a blue screen on windows on my computer :
import matplotlib.pyplot as plt
plt.plot(range(10),range(10)) # This is the line that cause the crash
WhoCrashed tells me this :
This was probably caused by the following module: nt_wrong_symbols.sys
(nt_wrong_symbols) Bugcheck code: 0x124 (0x0, 0xFFFFB60A6AF4D028,
0xB2000000, 0x70005) Error: WHEA_UNCORRECTABLE_ERROR
Here is a link to the full Minidump
What I have done:
Fully tested the CPU with a CPU-Z stress test
Fully tested the RAM with memtest86+
Tested the GPU with Assassin's creed origin in full ultra
Tested the same code on Ubuntu (double boot) : works fine
This lead me to believe this is a windows specific error.
Hardware configuration :
i9-7940X
GTX 1080 Ti
64 Gb RAM #2400Mhz (CPU frequency)
Software :
Windows 10, fresh install (I've always had this issue)
Python 2.7 installed through Anaconda ( I tested the code with Jupyter and IPython with the same results)
Windows and graphic drivers up to date
This is the only thing that causes blue screen on my computer, and I'm out of ideas on how to solve this, any advice would be greatly appreciated.
NOTE : I asked this question here as it appears to be matplotlib related, I hope this is the right place
EDIT : Correction : it does not happens all the time, but more like 95% of the time.
I updated the BIOS and it seems to work now. As i9-7940X is very recent (Q3'17), my old BIOS version was supposed to work with it but was released before the CPU (06/17) so that might have been the issue.
I'll post again if blue screens come back.
I had the same problem on an Alienware Area 51 machine. Fixed it by disabling processor's "hyperthreading" on the BIOS configuration. Also, I had a similar crashing issue on another machine with Ubuntu when trying to use multithreading.
In conclusion Matplotlib and multithreading don't get along well.

How to change the processing unit during run time (from GPU to CPU)?

In the context of deep neural networks training, the training works faster when it uses the GPU as the processing unit.
This is done by configuring CudNN optimizations and changing the processing unit in the environment variables with the following line (Python 2.7 and Keras on Windows):
os.environ["THEANO_FLAGS"] = "floatX=float32,device=gpu,optimizer_including=cudnn,gpuarray.preallocate=0.8,dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic,dnn.include_path=e:/toolkits.win/cuda-8.0.61/include,dnn.library_path=e:/toolkits.win/cuda-8.0.61/lib/x64"
The output is then:
Using gpu device 0: TITAN Xp (CNMeM is disabled, cuDNN 5110)
The problem is that the GPU memory is limited compared to the RAM (12GB and 128GB respectively), and the training is only one phase of the whole flow. Therefore I want to change back to CPU once the training is completed.
I've tried the following line, but it has no effect:
os.environ["THEANO_FLAGS"] = "floatX=float32,device=cpu"
My questions are:
Is it possible to change from GPU to CPU and vice-versa during runtime? (technically)
If yes, how can I do it programmatically in Python? (2.7, Windows, and Keras with Theano backend).
Yes this is possible at least for the tensorflow backend. You just have to also import tensorflow and put your code into the following with:
with tf.device('/cpu:0'):
your code
with tf.device('/gpu:0'):
your code
I am unsure if this also works for theano backend. However, switching from one backend to the other one is just setting a flag beforehand so this should not provide too much trouble.

tensorflow: works well in one-GPU computer but Out of Memory in two-GPU computer

Before I installed the second graphics card, I had successfully trained a model which is VGG-11 architecture using tensorflow and a single GPU that has 6GB memory. But I got OOM error when I installed the second graphics card and ran the same code(allow_growth=True and no tf.device() used).
My understanding is that my second card(GPU:0) has 8GB memory and TF would use my "device:gpu:0" as default to do computation when I did not use tf.device() to specify any device. And the memory should be enough because 8GB > 6GB.
Then I tried to use CUDA_VISIBLE_DEVICES=0 to block one and ran the same code. TF worked successfully.
What is the problem?

Categories