Keras model uses only one GPU all the time

Keras model uses only one GPU all the time - python

I am trying to train a CNN model on AWS EC2 p3.16xlarge instance which has 8 GPUs. When I use the batch size of 500, even though the system has 8 GPUs, only one GPU is utilized all the time. When I increased the batch size to 1000, it uses only GPU and really slows compared to 500 case. If I increase the batch size to 2000, then a memory overflow occurs. How can I fix this issue?
I am using tensorflow backend. GPU utilization is as below,
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 47C P0 69W / 300W | 15646MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:18.0 Off | 0 |
| N/A 44C P0 59W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:19.0 Off | 0 |
| N/A 45C P0 61W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1A.0 Off | 0 |
| N/A 47C P0 64W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 48C P0 62W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 46C P0 61W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 46C P0 65W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 46C P0 63W / 300W | 502MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15745 C python3 15635MiB |
| 1 15745 C python3 491MiB |
| 2 15745 C python3 491MiB |
| 3 15745 C python3 491MiB |
| 4 15745 C python3 491MiB |
| 5 15745 C python3 491MiB |
| 6 15745 C python3 491MiB |
| 7 15745 C python3 491MiB |
+-----------------------------------------------------------------------------+

You are probably looking for multiple_gpu_model. You can see that in the keras documentation.
You can just take your model and do parallel_model = multi_gpu_model(model, gpus=n_gpus).
Next time don't forget to include a minimal working exemple.

Related

Running gpt-neo on GPU

The following code runs the EleutherAI/gpt-neo-1.3B model. The model runs on CPUs, but I don't understand why it does not use my GPU. Did I missed something?
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
prompt = ("What is the capital of France?")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=50 )
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print (gen_text)
By the way, here is the output of the nvidia-smi command
Thu Feb 16 14:58:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:73:00.0 On | N/A |
| 30% 31C P8 34W / 350W | 814MiB / 24576MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:A6:00.0 Off | Off |
| 30% 31C P8 16W / 230W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3484 G /usr/lib/xorg/Xorg 378MiB |
| 0 N/A N/A 3660 G /usr/bin/gnome-shell 62MiB |
| 0 N/A N/A 4364 G ...662097787256072160,131072 225MiB |
| 0 N/A N/A 37532 G ...6/usr/lib/firefox/firefox 142MiB |
| 1 N/A N/A 3484 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+

How to make sure PyTorch has deallocated GPU memory?

Say we have a function like this:
def trn_l(totall_lc, totall_lw, totall_li, totall_lr):
self.model_large.cuda()
self.model_large.train()
self.optimizer_large.zero_grad()
for fb in range(self.fake_batch):
val_x, val_y = next(self.valid_loader)
val_x, val_y = val_x.cuda(), val_y.cuda()
logits_main, emsemble_logits_main = self.model_large(val_x)
cel = self.criterion(logits_main, val_y)
loss_weight = cel / (self.fake_batch)
loss_weight.backward(retain_graph=False)
cel = cel.cpu().detach()
emsemble_logits_main = emsemble_logits_main.cpu().detach()
totall_lw += float(loss_weight.item())
val_x = val_x.cpu().detach()
val_y = val_y.cpu().detach()
loss_weight = loss_weight.cpu().detach()
self._clip_grad_norm(self.model_large)
self.optimizer_large.step()
self.model_large.train(mode=False)
self.model_large = self.model_large.cpu()
return totall_lc, totall_lw, totall_li, totall_lr
On the first call, it allocates 8GB of GPU memory. On the next call, no new memory gets allocated, yet 8GBs are still occupied. I want to have after it is called and the produced first result to have 0 allocated GPU memory or as low as possible.
What I have tried: do retain_graph=False and .cpu().detach() everywhere - no positive effects.
Memory snapshot before
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| Active memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 51200 KB | 51200 KB | 51200 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 30720 KB | 30720 KB | 30720 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 18100 KB | 20926 KB | 56892 KB | 38792 KB |
| from large pool | 17408 KB | 18944 KB | 18944 KB | 1536 KB |
| from small pool | 692 KB | 2047 KB | 37948 KB | 37256 KB |
|---------------------------------------------------------------------------|
| Allocations | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| Active allocs | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 16 | 16 | 16 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 15 | 15 | 15 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 3 | 30 | 262 | 259 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 2 | 29 | 261 | 259 |
|===========================================================================|
And after calliing function and
torch.cuda.empty_cache()
torch.cuda.synchronize()
We get:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| Active memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 8818 MB | 9906 MB | 19618 MB | 10800 MB |
| from large pool | 8784 MB | 9874 MB | 19584 MB | 10800 MB |
| from small pool | 34 MB | 34 MB | 34 MB | 0 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 5427 KB | 3850 MB | 207855 MB | 207850 MB |
| from large pool | 0 KB | 3850 MB | 207494 MB | 207494 MB |
| from small pool | 5427 KB | 5 MB | 360 MB | 355 MB |
|---------------------------------------------------------------------------|
| Allocations | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| Active allocs | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 226 | 226 | 410 | 184 |
| from large pool | 209 | 209 | 393 | 184 |
| from small pool | 17 | 17 | 17 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 46 | 358 | 12284 | 12238 |
| from large pool | 0 | 212 | 7845 | 7845 |
| from small pool | 46 | 279 | 4439 | 4393 |
|===========================================================================|

I don't think the other answer is correct. Allocation and deallocation definitely happens during runtime, the thing to note is that the CPU code runs asynchronously from the GPU code, so you need to wait for any deallocation to happen if you want to reserve more memory after it. Take a look at this:
import torch
a = torch.zeros(100,100,100).cuda()
print(torch.cuda.memory_allocated())
del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())
Outputs
4000256
0
So you should del the tensors you don't need and call torch.cuda.synchronize() to make sure that the deallocation goes through before your CPU code continues to run.
In your specific case, after your function trn_l returns, any variables that were local to that function, and do not have references elsewhere, will be deallocated along with the corresponding GPU tensors. All you need to do is wait for this to happen by calling torch.cuda.synchronize() after the function call.

So, Pytorch does not allocate and deallocate memory from GPU in training time.
From https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly:
PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. See Memory management for more details about GPU memory management.
If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via ps -elf | grep python and manually kill them with kill -9 [pid].
You can call torch.cuda.empty_cache() to free all unused memory (however, that is not really good practice as memory re-allocation is time consuming). Docs of empty_cace() : https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache

Why Python stops using GPU and switches to CPU in runtime

I have been using this:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
in order to run on GPU. It has been working properly since today.
The problem now is that, in the middle of the runtime, my program stops using GPU and switches to CPU, so it becomes too slow.
Any idea on why is that happening?
Output at the beggining of the execution for nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 On | 00000000:01:00.0 On | N/A |
| 0% 42C P8 14W / 200W | 363MiB / 4039MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c On | 00000000:05:00.0 Off | 0 |
| 35% 74C P0 136W / 235W | 11011MiB / 11441MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1037 G /usr/lib/xorg/Xorg 20MiB |
| 0 1150 G /usr/bin/gnome-shell 12MiB |
| 0 7430 G /usr/lib/xorg/Xorg 166MiB |
| 0 7560 G /usr/bin/gnome-shell 158MiB |
| 1 13772 C python3 10998MiB |
+-----------------------------------------------------------------------------+
And then, when it begins to run too slowly:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 On | 00000000:01:00.0 On | N/A |
| 0% 42C P8 14W / 200W | 363MiB / 4039MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c On | 00000000:05:00.0 Off | 0 |
| 35% 69C P0 63W / 235W | 11011MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1037 G /usr/lib/xorg/Xorg 20MiB |
| 0 1150 G /usr/bin/gnome-shell 12MiB |
| 0 7430 G /usr/lib/xorg/Xorg 166MiB |
| 0 7560 G /usr/bin/gnome-shell 158MiB |
| 1 13772 C python3 10998MiB |
+-----------------------------------------------------------------------------+

Python sqlite3 db: Structuring database properly. Skip first column when writing to db. Avoid columns with 1-cell data only

I have a python script that writes data to sqlite3 Database. It has 26 columns, but 15 of those columns only need 1 cell data.
Instead of having 15 columns each having 1 cell of data I would like to gather them into a single (or two) columns only. I don't know how to do that with Sqlite3. I was thinking maybe loop-print all the settings into the first one or two columns, and then from there when writing to the database tell sqlite3 to ignore/skip the first few columns. Is that possible?
Later on I load database into db browser for sqlite and export to csv and then load them into excel/google doc. I was hoping to avoid having to do a lot of copy/paste after importing to excel/google doc by structuring it properly from the beginning.
Current database
| 1| Time | Type | Price | Amount | Gain | Market | Option 1 | Acc | Setting a | Setting b |
| 2|-----------|-------|----------|---------|----------|-----------|----------|---------|-----------|------------|
| 3| 22:12:15 | Buy | 660.33 | 0.0130 | 8.58429 | Market 1 | 0.00085 | DD_23 | 0.00233 | 5 |
| 4| 22:12:15 | Sell | 659.58 | 0.0070 | 4.61706 | | | | | |
| 5| 19:36:08 | Buy | 670.00 | 0.0082 | 5.49400 | | | | | |
| 6| 19:36:08 | Sell | 670.33 | 0.0058 | 3.88791 | | | | | |
| 7| 19:36:08 | Buy | 671.23 | 0.0060 | 4.02738 | | | | | |
| 8| 13:01:41 | Sell | 667.15 | 0.0015 | 1.00073 | | | | | |
| 9| 13:01:41 | Buy | 667.10 | 0.0185 | 12.3414 | | | | | |
|10| 07:14:36 | Sell | 657.55 | 0.0107 | 7.03579 | | | | | |
|11| 07:14:36 | Buy | 657.08 | 0.0005 | 0.32854 | | | | | |
|12| 07:14:36 | Sell | 656.59 | 0.0088 | 5.77799 | | | | | |
Desired solution 1: Single column solution
| 1| Script info | Time | Type | Price | Amount | Gain |
| 2|-------------|-----------|-------|----------|---------|----------|
| 3| Market | 22:12:15 | Buy | 660.33 | 0.0130 | 8.58429 |
| 4| Market 1 | 22:12:15 | Sell | 659.58 | 0.0070 | 4.61706 |
| 5| Option 1 | 19:36:08 | Buy | 670.00 | 0.0082 | 5.49400 |
| 6| 0.00085 | 19:36:08 | Sell | 670.33 | 0.0058 | 3.88791 |
| 7| Acc | 19:36:08 | Buy | 671.23 | 0.0060 | 4.02738 |
| 8| DD_23 | 13:01:41 | Sell | 667.15 | 0.0015 | 1.00073 |
| 9| Setting a | 13:01:41 | Buy | 667.10 | 0.0185 | 12.3414 |
|10| 0.00233 | 07:14:36 | Sell | 657.55 | 0.0107 | 7.03579 |
|11| Setting b | 07:14:36 | Buy | 657.08 | 0.0005 | 0.32854 |
|12| 5 | 07:14:36 | Sell | 656.59 | 0.0088 | 5.77799 |
Desired solution 2: Double column solution
| 1| Script info | Script settings | Time | Type | Price | Amount | Gain |
| 2|-------------|-----------------|-----------|-------|----------|---------|----------|
| 3| Market | Market 1 | 22:12:15 | Buy | 660.33 | 0.0130 | 8.58429 |
| 4| Option 1 | 0.00085 | 22:12:15 | Sell | 659.58 | 0.0070 | 4.61706 |
| 5| Acc | DD_23 | 19:36:08 | Buy | 670.00 | 0.0082 | 5.49400 |
| 6| Setting a | 0.00233 | 19:36:08 | Sell | 670.33 | 0.0058 | 3.88791 |
| 7| Setting b | 5 | 19:36:08 | Buy | 671.23 | 0.0060 | 4.02738 |
| 8| | | 13:01:41 | Sell | 667.15 | 0.0015 | 1.00073 |
| 9| | | 13:01:41 | Buy | 667.10 | 0.0185 | 12.3414 |
|10| | | 07:14:36 | Sell | 657.55 | 0.0107 | 7.03579 |
|11| | | 07:14:36 | Buy | 657.08 | 0.0005 | 0.32854 |
|12| | | 07:14:36 | Sell | 656.59 | 0.0088 | 5.77799 |

EDIT [to explore the idea mentioned in the comments]:
Instead of having 15 columns/values to describe the selection group perhaps you could:
Create a string in python do describe the selection group. Something like "Setting 1: value 1, Setting 2: value 2,.......Setting 15: value15"
Replace the 15 columns with one TEXT column
Insert the text for every row in the selection group.
I'm not sure what problem you are trying to solve. This
... I load database into db browser for sqlite and export to csv and then
load them into excel/google doc.
indicates to me that it's a manual process.
This
when writing to the database tell sqlite3 to ignore/skip the first few
columns.
indicates to me that you only want these columns in the export.
Time | Type | Price | Amount | Gain
Couldn't you write a query SELECT time,type,price,amount,gain from thetable then choose "Export to CSV" from the little icon to right of the results panel?
Or maybe easier, if you are going to discard the "1-cell data", don't load them in the first place. Or load them into a different table (on the python side).
Or perhaps I don't understand the problem :)

Tensorflow: CUDA_ERROR_OUT_OF_MEMORY tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized

When I training a VGG16 NN with GPU using TensorFlow, it always show me CUDA_ERROR_OUT_OF_MEMORY and always stops with the error tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
I searched the internet with those message and got some tips:
set config.gpu_options.allow_growth to True.
set config.gpu_options.per_process_gpu_memory_fraction to a smaller fraction like 0.6.
set smaller batch size.
But these tips don't work, the process runs just like nothing changed.
Here is my hardware:
GPU: NVIDIA GTX 1060
Memory: 3G + 4G(shared memory)
I monitored the usage of GPU using nvidia-smi, and below is the detail.
Before Running:
Thu Apr 19 14:21:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 50C P8 7W / N/A | 587MiB / 3072MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 8244 C+G ...6)\Youdao\YoudaoNote\YNoteCefRender.exe N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
Process begin:
Thu Apr 19 14:24:23 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 48C P2 28W / N/A | 1133MiB / 3072MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14404 C ...ools\Anaconda3\envs\py36_tfg\python.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
After 10 steps:
Thu Apr 19 14:30:40 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 64C P2 31W / N/A | 2595MiB / 3072MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14404 C ...ools\Anaconda3\envs\py36_tfg\python.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
After 60 steps:
some message showed, but can still run
2018-04-19 14:33:56.384528: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 2147483648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2018-04-19 14:33:56.423080: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 1932735232 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2018-04-19 14:33:56.474281: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 1739461632 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
Thu Apr 19 14:36:13 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 63C P2 33W / N/A | 2602MiB / 3072MiB | 43% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14404 C ...ools\Anaconda3\envs\py36_tfg\python.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
After 170 steps:
About eight hundreds lines message showed, then the process stopped with errors
About eight hundreds lines:
2018-04-19 14:49:35.688274: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
Stopped with some errors:
Traceback (most recent call last):
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1039, in _do_call
return fn(*args)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1021, in _run_fn
status, run_metadata)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\contextlib.py", line 88, in __exit__
next(self.gen)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: input/input/div/_79 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_111_input/input/div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "vgg16_train_and_test.py", line 212, in <module>
train()
File "vgg16_train_and_test.py", line 124, in train
coord.join(threads)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\training\coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\training\queue_runner_impl.py", line 234, in _run
sess.run(enqueue_op)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 778, in run
run_metadata_ptr)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: input/input/div/_79 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_111_input/input/div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras model uses only one GPU all the time - python

You are probably looking for multiple_gpu_model. You can see that in the keras documentation. You can just take your model and do parallel_model = multi_gpu_model(model, gpus=n_gpus). Next time don't forget to include a minimal working exemple.

Related

Running gpt-neo on GPU

How to make sure PyTorch has deallocated GPU memory?

Why Python stops using GPU and switches to CPU in runtime

Python sqlite3 db: Structuring database properly. Skip first column when writing to db. Avoid columns with 1-cell data only

Tensorflow: CUDA_ERROR_OUT_OF_MEMORY tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized

Categories

Resources