I started an instance with TPU by following this quick start tutorial using ctpu up command and I was able to run MNIST tutorial successfully. I logged out of cloud shell and logged into my vm connected to TPU using SSH console as explained here, when I run MNIST tutorial again I'm getting
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s).
When I run ctpu ls, I get
# Flock Name Status
0: my-tpu(*) running
ctpu status command gives
Your cluster is running!
Compute Engine VM: RUNNING
Cloud TPU: RUNNING
Am I missing something basic here?
ctpu passes this name to the Compute Engine VM as an environment variable (TPU_NAME), but gcloud doesn't.
Specify your TPU explicitly: use --tpu=my-tpu instead of --tpu=$TPU_NAME
Related
I created an environment in anaconda in azure cloud compute instance for running a time series model.
After creating environment and installing all required libraries , I ran code on jupyter notebook in azure to find best parameters for my Facebook Prophet model.
I am getting Terminated work error:A worker process managed by the executor was unexpectidely terminated.This could be caused by segmentation fault while calling the function or by an excessive memory usage causing operating system to kill worker.
The exit codes of workers are {SIGSEGV(-11)}
I was trying to run the above code and get best parameters of facebook prophet.
So currently I have one computer running MacOS and another computer with a GPU running windows. What's the best way to send data (in this case, 2 images or links to images) from the Mac to the Windows computer, train a model with Tensorflow on the Windows computer, then send the output of the model (another image) back to the Mac? They are both on the same network.
I found a couple solutions so far but I'm not sure which one is optimal:
Using Pyro
Installing linux on the Windows machine and then SSH'ing onto it like a server
I'm new to setting up any server kind of things so any tips would be great!
Personally I like:
Connecting to the machine with the GPU via ssh
Open a tmux session or something similar so that all that is printed and will be printed inside the terminal is not lost when I disconnect from the ssh. To create the session (after you have installed tmux) the command to write in the terminal is: tmux new -s my_session
Launch the training inside that tmux session.
Detach from the session with Ctrl+b and then d.
Disconnect from the ssh and re-connect from time to time to check the training. After re-connecting to the ssh you just have to attach to the session with tmux attach -t my_session to see what was outputted from the training while you were away.
When the training is finished I ssh the output folder containing images, logs etc. to my other machine.
Surely there are more automated solutions, but this one works the best for me.
I have followed around 5 tutorials on using NSSM to run a python script residing on a network drive. It creates the service, I can edit the service but when I start the service I get the following error:
Unexpected status SERVICE_STOPPED in response to START control
When I try starting the Windows-10 service from Services I get the following error:
Services error during start
I have changed the path to the python script from the full network path to a mapped network path and that did not change anything.
I have also tried using Task Scheduler which worked once in a while and was intermittent and I also tried the pywin32 method as posted here but it fails to start the service as well.
I figured it out. On my remote VM it does not allow Local System Account to be set as the Log on. It required me to add an admin user name and password. It's running now.
I'm new to cloud computing in general, and I've started a free trial with Amazon's Web Services, in hopes of using their EC2 servers to run some code for Kaggle competitions. I'm currently working on running a test Python script for doing some image processing and testing a linear classifier (I don't suspect these details are relevant to my problem, but wanted to provide some context).
Here are the steps I've gone through to run my script on an EC2 instance:
Log in to AWS, and start EC2 instance where I've installed relevant Python modules for my tasks (e.g. Anaconda distribution). As a sidenote, all my data and the script I want to run are in the same directory on this server instance.
SSH to my EC2 instance from my laptop, and cd to the directory with my script.
Run screen to run program in background.
Run script via python program.py and detach from screen session (ctrl + A, D)
Keep EC2 instance running, but exit from SSH session connecting my laptop to the server.
I've followed these steps a number of times, which result in either (a) "Broken Pipe" errors, or (b) in an error where the connection appears to "hang". In the case of (b), I've attempted to disconnect from the SSH session and reconnect to the server, however I am unable to do so due to an error stating "connection has been reset by peer".
I'm not sure if I need to configure something differently on the EC2 instance, or if I need to specify different options when connecting to the server via SSH. Any help here would be appreciated. Thanks for reading.
EDIT: I've been successful in running some example scripts using scikit-learn by setting up an iPython notebook, launching it with nohup, and running the code in a notebook cell. However, when trying to do the same with my Kaggle competition code, the same "hanging" issue happens, and the connection appears to be dropped, causing the code to stop running. The image dataset I'm running the code on in the second case is quite a bit larger than the dataset processed by the example code in the first case. Not sure if dataset size along is causing the issue, or how to solve this.
I'm using Python/NumbaPro to use my CUDA complient GPU on a windows box. I use Cygwin as shell and from within a cygwin console it has no problems finding my CUDA device. I test with the simple command
numbapro.check_cuda()
But when I'm connection to the box over OpenSSH (as part of my Cygwin setup), I get the following error:
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init:
Call to cuInit results in CUDA_ERROR_NO_DEVICE:
How to fix this?
The primary cause of this is Windows service session 0 isolation. When you run any application via a service which runs in session 0 (so sshd, or windows remote desktop, for example), the machines native display driver is unavailable. For CUDA applications, this means that you are get a no device available error at runtime because the sshd you use to login is running as a service and there is no available CUDA driver.
The are a few workarounds:
Run the sshd as a process rather than a service.
If you have a compatible GPU, use the TCC driver rather than the GPU display driver.
On the secondary problem, the Python runtime error you are seeing comes from the multiprocessing module. From this question it appears that the root cause is probably the NUMBER_OF_PROCESSORS environment variable not being set. You can use one of the workarounds in that thread to get around that problem