Error Running Stable Diffusion from the command line in Windows

Error Running Stable Diffusion from the command line in Windows - python

I installed Stable Diffusion v1.4 by following the instructions described in https://www.howtogeek.com/830179/how-to-run-stable-diffusion-on-your-pc-to-generate-ai-images/#autotoc_anchor_2
My machine heavily exceeds the min reqs to run Stable Diffusion:
Windows 11 Pro
11th Gen Intel i7 # 2.30GHz
Latest NVIDIA GeForce GPU
16GB Memory
1TB SSD
Yet, I get an error when trying to run the test prompt
python scripts/txt2img.py --prompt "a close-up portrait of a cat by pablo picasso, vivid, abstract art, colorful, vibrant" --plms --n_iter 5 --n_samples 1
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 8.00 GiB total capacity; 6.13 GiB already allocated; 0 bytes free; 6.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Reading a post by Marco Ramos it seems like it relates to the number of workers in PyTorch
Strange Cuda out of Memory behavior in Pytorch
How do I change the number of workers while running Stable Diffusion? And why is it throwing this error if my machine still has lots of memory? Has anyone encountered this same issue while running Stable Diffusion?

I had the same issue, it's because you're using a non-optimized version of Stable-Diffusion. You have to download basujindal's branch of it, which allows it use much less ram by sacrificing the precision, this is the branch - https://github.com/basujindal/stable-diffusion
Everything else in that guide stays the same just clone from this version. It allow you to even push past 512x512 default resolution, you can use 756x512 to get rectangular images for example (but the results may vary since it was trained on a 512 square set).
the new prompt becomes python optimizedSD/optimized_txt2img.py --prompt "blue orange" --H 756 --W 512
Also another note: as of a few days ago an even faster and more optimized version was released by neonsecret (https://github.com/basujindal/stable-diffusion), however I'm having issues installing it, so can't really recommend it but you can try it as well and see if it works for you.

In addition to the optimized version by basujindal, the additional tags following the prompt allows the model to run properly on a machine with NVIDIA or AMD 8+GB GPU.
So the new prompt would look like this
>> python optimizedSD/optimized_txt2img.py --prompt "a close-up portrait of a cat by pablo picasso, vivid, abstract art, colorful, vibrant" --H 512 --W 512 --seed 27 --n_iter 2 --n_samples 10 --ddim_steps 50

Related

pytorch profiler nvidia cuda permission error (microsoft wsl): CUPTI_ERROR_NOT_INITIALIZED, CUPTI_ERROR_INSUFFICIENT_PRIVILEGES

I am trying to run a profiling script for pytorch on MS WSL 2.0 Ubuntu 20.04.
WSL is on the newest version (wsl --update). I am running the stable conda pytorch cuda 11.3 version from the pytorch website with pytorch 1.11. My GPU is a GTX 1650 Ti.
I can run my script fine and it finishes without error, but when I try to profile it using pytorch's bottleneck profiling tool python -m torch.utils.bottleneck run.py
it first throws this warning when starting the autograd profiler:
Running your script with the autograd profiler...
WARNING:2022-06-01 13:37:49 513:513 init.cpp:129] function status failed with error CUPTI_ERROR_NOT_INITIALIZED (15)
WARNING:2022-06-01 13:37:49 513:513 init.cpp:130] CUPTI initialization failed - CUDA profiler activities will be missing
Then, if I run for a small number of epochs, the script finishes fine, and it shows also the cuda profiling stats (even though it says profiler activities will be missing). But when I do a longer run, I get the message Killed after the script runs "through" the autograd profiler. The command dmesg gives this output at the end:
[ 1224.321233] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=python,pid=295,uid=1000
[ 1224.321421] Out of memory: Killed process 295 (python) total-vm:55369308kB, anon-rss:15107852kB, file-rss:0kB, shmem-rss:353072kB, UID:1000 pgtables:39908kB oom_score_adj:0
[ 1224.746786] oom_reaper: reaped process 295 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:353936kB
So, when using the profiler, there seems to be a memory error (it might not necessarily be related to the above CUPTI warning). Is this related to the profiler somehow saving too much data in-mem? Then, it might be a common problem that occurs for too long runs, right?
The cuda warning CUPTI_ERROR_NOT_INITIALIZED indicates that my CUPTI (short for "CUDA Profiling Tools Interface") is not running. I read in another post that this might be related to me running a newer version of CUPTI that is not backcompatible with the older version of CUDA 11.3. As cupti is not included in the cudatoolkit on conda by default, the system is probably trying to use / locate the cupti, but does not find it / cannot use it.
I'd appreciate any help for this issue. It would be quite nice to see a longer profiling run, in order to determine the bottlenecks / expensive operations in my pytorch code.
Thanks!

Python running slower than before some system change

I have two computers at home for my python projects. One is a PC with an Intel i5-7400 CPU and the other is a laptop with an Intel i7-10750H CPU. Presumably the laptop is faster running the same python code than the PC. This was the case before I made some changes to the laptop in an attempt to leverage its Nvida GPU for training DNN model.
I followed the instruction from Tensorflow GPU support webpage to upgrade Nvida GPU driver, install Cuda toolkit and cuDNN with the recommanded version. After the installation, I created a new conda environment and installed latest tensorflow. With all this I could detect my GPU with tf.config.list_physical_devices() and run some test code on the GPU. However, the performance was not lifted and even worse, the laptop became noticeably slower running the same code on its CPU. I tested the following simple code on both machines:
from datetime import datetime
import numpy as np
t0 = datetime.now()
for i in range(1000):
a = np.random.rand(1000, 1000)
b = np.random.rand(1000, 1000)
c = np.matmul(a, b)
t1 = datetime.now()
print(t1 - t0)
The PC ran it in 32s but the laptop needed 45s. I tried a few things to resolve this, including uninstalling Cuda toolkit/cuDNN and reinstalling anaconda (tried different anaconda versions). But the issue still remains. Anyone has any insights about why this happens and what to try in order to address it? Many thanks.
Update: I notice that the same python code uses about 30% CPU if running on the PC's intel i5-7400 CPU but uses over 90% CPU and is slower if on the laptop intel i7-10750H CPU. Is it normal?

This is probably not your main problem however, your laptop run with battery? The laptop can decrease performance for saving battery life

There are many reasons to consider. Firstly, The code you are ruining
doesn't use GPU at all. It's all about CPU's holding the throttle.
Basically, as a part of "Thermal management" Laptops CPU power limit
throttle is constantly controlled. Fact is in CPU's runs code faster
than GPU's. So, maybe your laptop reaching thermal limitations and
throttles down to slower for most of the time it takes to run the
program. Maybe your PC CPU is able to withhold that throttle so it's
finishing bit faster.
Once check Benchmarking your code. A wonderful instructions return here
https://stackoverflow.com/a/1593034/15358800

Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?

I am asking this question because I am successfully training a segmentation network on my GTX 2070 on laptop with 8GB VRAM and I use exactly the same code and exactly the same software libraries installed on my desktop PC with a GTX 1080TI and it still throws out of memory.
Why does this happen, considering that:
The same Windows 10 + CUDA 10.1 + CUDNN 7.6.5.32 + Nvidia Driver 418.96 (comes along with CUDA 10.1) are both on laptop and on PC.
The fact that training with TensorFlow 2.3 runs smoothly on the GPU on my PC, yet it fails allocating memory for training only with PyTorch.
PyTorch recognises the GPU (prints GTX 1080 TI) via the command : print(torch.cuda.get_device_name(0))
PyTorch allocates memory when running this command: torch.rand(20000, 20000).cuda() #allocated 1.5GB of VRAM.
What is the solution to this?

Most of the people (even in the thread below) jump to suggest that decreasing the batch_size will solve this problem. In fact, it does not in this case. For example, it would have been illogical for a network to train on 8GB VRAM and yet to fail to train on 11GB VRAM, considering that there were no other applications consuming video memory on the system with 11GB VRAM and the exact same configuration is installed and used.
The reason why this happened in my case was that, when using the DataLoader object, I set a very high (12) value for the workers parameter. Decreasing this value to 4 in my case solved the problem.
In fact, although at the bottom of the thread, the answer provided by Yurasyk at https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 pointed me in the right direction.
Solution: Decrease the number of workers in the PyTorch DataLoader. Although I do not exactly understand why this solution works, I assume it is related to the threads spawned behind the scenes for data fetching; it may be the case that, on some processors, such an error appears.

Tensorflow CPU RAM allocation

I experience an incredibly high amount of (CPU) RAM usage with Tensorflow while about every variable is allocated on the GPU device, and all computation runs there. Even then, RAM usage exceeds the VRAM usage by a factor of 2 at least. I'm trying to understand why so as to see if it can be remedied or if it's inevitable.
Question
So my main question is: Does Tensorflow allocate and maintain a copy of all GPU variables on (CPU) RAM? If yes, what is allocated when (in which phase, see below)? And why is it useful to allocate this in CPU memory?
More info
I have 3 phases in which I see RAM increasing dramatically.
First, when defining the graph (I append VGG-19 with quite large loss functions that iterate over many translated activation maps). This adds 2 GB to RAM usage.
Second, defining the optimizer (I use Adam) adds 250MB.
Initialize global variables adds 750MB.
And then it remains stable and runs very fast (all on GPU).
(Amounts of data mentioned here are when I input tiny images of size 8x8x3, batch size of 1. If I do more than 1x16x16x3, the process gets killed because it overflows my 8GB RAM+6GB swap limit).
Note that I recorded variable placement with tf.ConfigProto(log_device_placement=True), and GPU usage using tf.RunMetadata and visualization on tensorboard.
Thank you for any help.
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution: Linux Ubuntu 17.10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.7
Python version: 3.6.3
GCC/Compiler version (if compiling from source): 6.4.0
CUDA/cuDNN version: 9.0
GPU model and memory: NVidia GeForce Titan Xp

tensorflow: works well in one-GPU computer but Out of Memory in two-GPU computer

Before I installed the second graphics card, I had successfully trained a model which is VGG-11 architecture using tensorflow and a single GPU that has 6GB memory. But I got OOM error when I installed the second graphics card and ran the same code(allow_growth=True and no tf.device() used).
My understanding is that my second card(GPU:0) has 8GB memory and TF would use my "device:gpu:0" as default to do computation when I did not use tf.device() to specify any device. And the memory should be enough because 8GB > 6GB.
Then I tried to use CUDA_VISIBLE_DEVICES=0 to block one and ran the same code. TF worked successfully.
What is the problem?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.