modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/ - python

I had an issue recently after successfully installing and testing Tensorflow compiled with GPU support.
After rebooting the machine, I got the following error Message when I tried to run a Tensorflow program:
...
('Extracting', 'MNIST_data/t10k-labels-idx1-ubyte.gz')
modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/4.4.0-34-generic
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (caffe-desktop): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:92] No GPU devices available on machine.
(0, 114710.45)
(1, 95368.891)
...
(98, 56776.922)
(99, 57289.672)
Screencapture of error
Code: https://github.com/llSourcell/autoencoder_demo
Question: Why would restarting a Ubuntu 16.04 machine break Tensorflow?

I actually solved my own problem and wanted to share the solution which worked for me.
The magic Google search was:
"modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/"
Which led me to the following answer on askubuntu:
https://askubuntu.com/a/496146
That answer's author, Sneetsher, did a really good job of explaining so if the link doesn't 404 I would start there.
Cliff Notes
Diagnosis: I suspected that Ubuntu may have installed a kernel update when I rebooted.
Solution: Reinstalling the NVIDIA driver fixed the error.
Problem: NVIDIA drivers cannot be installed with X server running
Two different ways to fix the NVIDIA Driver
1) Keyboard and Monitor:
Paraphrasing the askubuntu answer:
1) Switch to text-only console (Ctrl+Alt+F1 or any to F6).
2) Build driver modules for the current kernel (which just installed) sudo ./<DRIVER>.run -K
credit "Sneetsher" : https://askubuntu.com/a/496146
I don't have a keyboard or monitor attached to this PC so here's the "headless" approach I actually used:
2) Over SSH:
Following this guide to reboot to console:
http://ubuntuhandbook.org/index.php/2014/01/boot-into-text-console-ubuntu-linux-14-04/
$ sudo cp -n /etc/default/grub /etc/default/grub.orig
$ sudo nano /etc/default/grub
$ sudo update-grub
edit the grub file according to above link(3 changes):
Comment the line GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash”, by adding # at the beginning, which will disable the Ubuntu purple screen.
Change GRUB_CMDLINE_LINUX=”" to GRUB_CMDLINE_LINUX=”text”, this makes Ubuntu boot directly into Text Mode.
Uncomment this line #GRUB_TERMINAL=console, by removing the # at the beginning, this makes Grub Menu into real black & white Text Mode (without background image)
UPDATE: (If running Ubuntu 16.04 If
$ sudo systemctl set-default multi-user.target
Reboot into console
$ sudo shutdown -r now
$ sudo service lightdm stop
$ sudo ./<DRIVER>.run
follow the NVIDIA driver installer
$ sudo mv /etc/default/grub /etc/default/grub.textonly
$ sudo mv /etc/default/grub.orig /etc/default/grub
$ sudo update-grub
$ sudo shutdown -r now
Results (What things look like now the GPU was successfully detected)
...
('Extracting', 'MNIST_data/t10k-labels-idx1-ubyte.gz')
I tensorflow/core/common_runtime/gpu/gpu_init.cc:118] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.342
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.88GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:138] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:148] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:868] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
(0, 113040.92)
(1, 94895.867)
...
Screencapture of the same

One simple solution to "Problem: NVIDIA drivers cannot be installed with X server running:
Access ubuntu from another computer using SSH
Remove the screen (display device) of Ubuntu computer
Reboot computer using sudo reboot, then access it again

Related

Vagtant in Ubuntu 20.04 LTS

vagrant up
==> vagrant: A new version of Vagrant is available: 2.2.14 (installed version: 2.2.9)!
==> vagrant: To upgrade visit:Vagrant Link
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'ubuntu/bioni64' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: ~> 201901314.0.0
Vagrant is currently configured to create VirtualBox synced folders with
the SharedFoldersEnableSymlinksCreate option enabled. If the Vagrant
guest is not trusted, you may want to disable this option. For more
information on this option, please refer to the VirtualBox manual:
https://www.virtualbox.org/manual/ch04.html#sharedfolders
This option can be disabled globally with an environment variable:
VAGRANT_DISABLE_VBOXSYMLINKCREATE=1
or on a per folder basis within the Vagrantfile:
config.vm.synced_folder '/host/path', '/guest/path',
SharedFoldersEnableSymlinksCreate: false
==> default: Clearing any previously set network interfaces...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
default: Adapter 1: nat
==> default: Forwarding ports...
default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> defThere was an error while executing VBoxManage, a CLI used by Vagrant for controlling VirtualBox. The command and stderr is shown
below.
Command: ["startvm", "2ae67e96-d553-415e-9624-c5381293436e", "--type", "headless"]
Stderr: VBoxManage: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component ConsoleWrap, interface IConsole
Please enable Virtualization from your BIOS. That should solve the problem.

How can I make Python3.6, Red Hat Software Collection, persist after a reboot/logout/login?

I am trying to enable rh-python36 software collection after reboot So I can avoid calling "scl enable" all the time.
After unzipping and installing the package:
yum install -y tmp/rpms/*
I created a new file "python36.sh" under /etc/profile.d with the following script:
#!/bin/bash
source /opt/rh/rh-python36/enable
export X_SCLS="`scl enable rh-python36 'echo $X_SCLS'`"
After restarting or rebooting the instance, I am getting : No such file or directoryenable
I am using CentOS release 6.10 (Final)
If you have the root privilege, then add the line of code below to the .bash_profile file found in your root directory:
source /opt/rh/rh-python36/enable
Try this:
#!/bin/bash
source scl_source enable rh-python36
Reference Doc: https://access.redhat.com/solutions/527703

Running Windows Server Core in Docker Container

my Linux containers run like a charm, but the change to Windows Server in my Docker container makes me crazy!
My Docker file doesn't build although it is as simple as my linux Dockerfiles:
FROM microsoft/windowsservercore
#Install Chocolately
RUN #powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iwr https://chocolatey.org/install.ps1 -UseBasicParsing | iex)"
ENV PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin
#Install python
RUN choco install -fy python2
RUN refreshenv
ENV PYTHONIOINPUT=UTF-8
RUN pip install -y scipy
Some times I was able to Chocolately which results in a fail to install scipy via PIP or curiously starting 5 minutes ago, even the installation of chocolately fails:
iwr : The remote name could not be resolved: 'chocolatey.org'
At line:1 char:2
+ (iwr https://chocolatey.org/install.ps1 -UseBasicParsing | iex)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:Htt
pWebRequest) [Invoke-WebRequest], WebException
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShe
ll.Commands.InvokeWebRequestCommand
Here are some specs on my Docker for Windows Installation:
Containers: 2
Running: 0
Paused: 0
Stopped: 2
Images: 3
Server Version: 1.13.0
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: l2bridge l2tunnel nat null overlay transparent
Swarm: inactive
Default Isolation: hyperv
Kernel Version: 10.0 14393 (14393.693.amd64fre.rs1_release.1612
Operating System: Windows 10 Education
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 7.903 GiB
Name: xxxx
ID: deleted
Docker Root Dir: C:\ProgramData\Docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: -1
Goroutines: 18
System Time: 2017-01-31T16:14:36.3753129+01:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Any ideas?
I was unable to get refreshenv to work, so I used multiple powershell sessions, I've included in case it is useful to someone in the future.
#Install Chocolately, Python and Python Package Manager, each PowerShell session will reload the PATH from previous step
RUN #powershell -NoProfile -ExecutionPolicy unrestricted -Command "iwr https://chocolatey.org/install.ps1 -UseBasicParsing | iex"
RUN #powershell -NoProfile -ExecutionPolicy unrestricted -Command "choco install -y python3"

port upgrade outdated fails in Yosemite

I am trying to upgrade my MacPorts. I ran sudo port -d selfupdate successfully after updating my xcode command line, but the problem is persisting. How can I successfully upgrade my outdated installations? I am running Mac Yosemite.
$ sudo port upgrade outdated
---> Fetching distfiles for python25
Error: python25 is not supported on Yosemite or later.
Error: org.macports.fetch for port python25 returned: unsupported platform
Please see the log file for port python25 for details:
/opt/local/var/macports/logs/_opt_local_var_macports_sources_rsync.macports.org_release_tarballs_ports_lang_python25/python25/main.log
Error: Unable to upgrade port: 1
To report a bug, follow the instructions in the guide:
http://guide.macports.org/#project.tickets
logfile:
$ cat /opt/local/var/macports/logs/_opt_local_var_macports_sources_rsync.macports.org_release_tarballs_ports_lang_python25/python25/main.log
version:1
:debug:main Executing org.macports.main (python25)
:debug:main changing euid/egid - current euid: 0 - current egid: 0
:debug:main egid changed to: 501
:debug:main euid changed to: 502
:debug:fetch fetch phase started at Fri Jan 9 22:13:21 PST 2015
:notice:fetch ---> Fetching distfiles for python25
:debug:fetch Executing proc-pre-org.macports.fetch-fetch-0
:error:fetch python25 is not supported on Yosemite or later.
:error:fetch org.macports.fetch for port python25 returned: unsupported platform
:debug:fetch Error code: NONE
:debug:fetch Backtrace: unsupported platform
while executing
"proc-pre-org.macports.fetch-fetch-0 org.macports.fetch"
("eval" body line 1)
invoked from within
"eval $pre $targetname"
:info:fetch Warning: targets not executed for python25: org.macports.destroot org.macports.fetch org.macports.checksum org.macports.extract org.macports.patch org.macports.configure org.macports.build
:notice:fetch Please see the log file for port python25 for details:
/opt/local/var/macports/logs/_opt_local_var_macports_sources_rsync.macports.org_release_tarballs_ports_lang_python25/python25/main.log

Python multiprocessing: Permission denied

I'm getting an error when trying to execute python program that uses multiprocessing package:
File "/usr/local/lib/python2.6/multiprocessing/__init__.py", line 178, in RLock
return RLock()
File "/usr/local/lib/python2.6/multiprocessing/synchronize.py", line 142, in __init__
SemLock.__init__(self, RECURSIVE_MUTEX, 1, 1)
File "/usr/local/lib/python2.6/multiprocessing/synchronize.py", line 49, in __init__
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 13] Permission denied
It looks like the user doesn't have permission to access shared memory. When executing with root privileges it works fine.
Is there any solution to run it as normal user(not root)?
Python version 2.6.2 , OS is Linux 2.6.18 (CentOS release 5.4) and it's VPS machine.
For POSIX semaphores to work, the users need r/w access to shared memory (/dev/shm).
Check the permissions to /dev/shm. On my laptop (Ubuntu) it looks like this:
$ ls -ld /dev/shm
drwxrwxrwt 2 root root 40 2010-01-05 20:34 shm
To permanently set the correct permissions (even after a reboot), add the following to your /etc/fstab:
none /dev/shm tmpfs rw,nosuid,nodev,noexec 0 0
Haven't tried this, just copied from a forum post.
In my OVH VPS Classic, this error was caused by a loop in /dev/shm and /run/shm.
Both were symlinks linking to each other.
So as root here is what I did:
# rm /dev/shm
# mkdir /dev/shm
# chmod 777 /dev/shm
# nano /etc/fstab
Then I modified the shm line from:
none /dev/shm tmpfs rw 0 0
To:
none /dev/shm tmpfs rw,nosuid,nodev,noexec 0 0
Restarted the server... And that fixed the problem!
Alternatively you can mount shm manually:
# mount /dev/shm
Hope this helps :-)
One simple solution without rebooting is
sudo chmod 777 /dev/shm
That solved my problem.
I tried all the recommendations related to chmod and shm, but in my case the solution was:
Using conda navigator:
In base-environment run (in order to see the navigator):
$ anaconda-navigator
Create a new conda environment: from the button CREATE in the navigator
Select the new environment with your mouse
Install "notebook": Install it from anaconda-navigator in the new environment
Using command line:
Create a new anaconda enviroment (enviroment name "my_new_env"):
$ conda create --name my_new_env
Enter to my_new_env:
$ conda activate my_new_env
Install Jupyter notebook:
$ conda install jupyter-core (OR $ conda install notebook)
As a summary, don't use snap to install Jupyter notebook.

Categories