I am trying to train a neural network on a subset of the GPU's that I have access to.
The problem is that I want to use only two GPU's so I wrote the following command in the .bashrc:
export CUDA_VISIBLE_DEVICES=6,7
However, when I watch the gpus, I find that my program is mainly using other GPU's (4,5) with 100% memory utilization and using only the visible gpus with very low memory utilization (9% maximum).
I have already closed and reopened all the terminals to restart the ~/.bashrc but the problem still persists!
Any help is appreciated!
P.S: I already successfully ran the same program before on only two GPUs but on a different server, so the memory is not a constraint here.
Related
Recently I discovered something rather strange with the project that I worked on for quite a while already. The model I have is rather conventional: a convnet with a few fully connected layers. For data loading I use tf.data API, but the same thing happens with queue-based code that I had before porting to tf.data. After a few hours since the training of the model begins, the CPU usage rises to very high levels, 1500-2000% as reported by the htop util. And at the beginning of training everything is fine, the main process shows only about 200% CPU usage. Attached is the screenshot of the htop output, and another thing that's worrying is all the child processes that also have pretty high CPU load.
I am using tensorflow-gpu version 1.11, running it on NVIDIA Tesla V100. I am pretty sure that the model does run on the GPU and not on the CPU: nvidia-smi shows that GPU is occupied at an about 70% rate.
Obviously, I cannot ask for an exact cause of this, and it would be difficult to strip the problem down to a reproducible test case. However, may be you could point me at some debugging techniques that are applicable in such case.
I'm writing a Jupyter notebook for a deep learning training, and I would like to display the GPU memory usage while the network is training (the output of watch nvidia-smi for example).
I tried doing a cell with the training run, and a cell with nvidia-smi, but obviously the latter is run only once the first is done, which is pretty useless. Is it possible to run these cells in parallel ?
Thanks in advance for help.
I made a very interesting observation while trying to tune my recommendation engine on a Tesla K80 housed on Google Cloud Platform. Unfortunately I have been unsuccessful at finding any literature which might point me in the right direction. Here is my predicament ...
I run the same code for fitting a fully connected model using a python script and a jupyter notebook. What is surprising is that with the same hyper-parameters (batch size etc.), the code runs faster using the jupyter notebook kernel and uses more memory on the GPU than when I fit the model using a python code invoked from the shell.
Since, I want my code to run in the least possible time and jupyter often ends up closing the connection when unattended due to web socket time-out, is there any way I can increase the amount of memory a python process can use on the GPU ? Open to any alternating way of making this work. Thanks in advance.
Running big data in Jupyter Notebooks with Sci-Kit Learn
I usually try to segment my issues into specific component questions, so bear with me here! We are attempting to model some fairly sparse health conditions against pretty ordinary demographics. We have access to a lot of data, hundreds of millions of records, we would like to get up to 20 million records into a Random Forest Classifier. I have been told that sci-kit learn should be able to handle this. We are running into issues. The sort that don't generate tracebacks! Just dead processes.
I recognize that this is not much to go on,, but what I’m looking for is any advice on how to scale and or debug this process of scaling this up.
So we want to run truly pretty big data though Jupyter notebooks and Scikit Learn, primarily Random Forest.
On a 8gb core i7 Notebook running Linux 14.04 and Jupyter running Py 3.5 and the notebook running Python 2.7 code via conda env. (fwiw we store the data in Google Big Query, we use the Pandas Big Query connector for this which allows us to run the same code both on the local machines and in the cloud)
I can get a 100,000 record dataset consumed and model built with a bunch of diagnostic reports and charts that we have built into the notebook, no problem. The following scenarios all use the same code and the same versions of the respective Conda environments but several different pieces of hardware in an attempt to scale up our process.
When I expand the dataset to 1,000,000 records, I get about 95% of the way into the data load from Big query and I see a lot of pegged cpu activity and I guess what seems to be memory swap activity (viewing it on the Linux process monitor) the entire machine seems to freeze Load progress from Big Query seems to stop for an extended period of time, and the browser indicates that it has lost connection to the kernel, sometimes with the browser message indicating failed status and sometimes with the broken link icon on the Jupyter NB, amazingly, it eventually completes normally. And the output in fact gets completely rendered in the notebook.
I tried to put some memory profiling code in using psutil, and interestingly I saw no changes as the Dataframe went through several processes, including a fairly substantial “get_dummies” process which expands the categorical variables into separate columns (which would presumably expand memory usage) but when doing this after an extended time the memory usage stats printed out but the model diagnostics never rendered in the browser (as they had previously) the terminal window indicated a web socket timeout and the browser froze. I finally terminated the notebook terminal.
Interesting my colleague can process 3,000,000 records successfully on his 16gb Macbook.
When we try to process 4,000,000 records on a 30 gb VM on Google Compute engine we get an error indicating “Mem error”. So even a huge machine does not allow us to complete.
I realize we could use serially built models and or spark, but we wish to use sci kit learn and Random Forrest and get through the memory issue if possible?
I have a Python application where I'm using TensorFlow and I'm running it inside of a Docker container. When running locally I see the memory usage stay well under 4GB of Ram, but there are some large files being written and processed. When TensorFlow reaches the point of creating its first checkpoint file I get the following exception:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
My model is complex so this file may be above 1GB, but also my data is images so I've already downloaded about 30GB of data just to begin running the model so I don't know if it's just by chance this keeps happening here or if this file actually happens to be too large. I'm only loading a small batch of images into memory for model training per epoch so I'm trying to keep the RAM usage low. My VirtualBox config looks like so:
The error appears to be using C++ so I assume it's coming from TensorFlow code internally. Has anyone seen anything like this or know what I can change? I feel that there is enough RAM allocated but maybe my disk access isn't configured correctly?
Most likely not enough RAM. 4GB is very small for running TensorFlow (particularly in python with a default install) for training. Your model size may well be below 1GB, but during training - and when writing out a checkpoint - TensorFlow will temporarily allocate more memory for buffering, and that's probably where you're getting an OOM error. If you run fine normally with 4GB until checkpointing, 8GB should be fine.