I created an environment in anaconda in azure cloud compute instance for running a time series model.
After creating environment and installing all required libraries , I ran code on jupyter notebook in azure to find best parameters for my Facebook Prophet model.
I am getting Terminated work error:A worker process managed by the executor was unexpectidely terminated.This could be caused by segmentation fault while calling the function or by an excessive memory usage causing operating system to kill worker.
The exit codes of workers are {SIGSEGV(-11)}
I was trying to run the above code and get best parameters of facebook prophet.
Related
Here is my installed environment,
Cuda 10.0
Nvcc
Ubuntu 20.04
Python 3.7
Tensorflow 2.0 GPU
When I simply run a style transfer code like the code provided here
https://www.tensorflow.org/tutorials/generative/style_transfer
It runs well if the process is done, but when I try to use Django-q to run it. The GPU memory will be kept even after my style transfer task is done.
I have tried the following methods to test if the GPU memory can be released but keep my Django-q working.
I manually used the command 'kill -9 ' and restart the Django-q with the command, it worked
I ran the commands in Method 1 programmatically, it released the GPU memory but the future style transfer job could no longer reach the GPU.
I restarted the server, and used a script to init my Django-q again, it worked but it took time.
I used the session cleared method of Tensorflow, but it did not allow my later style transfer job to utilize the GPU.
I have the following questions, which will be greatly appreciated if they can also be answered,
Why I ran the command programmatically, it cannot utilize GPU for my later style transfer jobs
How I can release and use the GPU for later jobs but keeping the Django-q without restarting the server. (Which is my main question)
Thanks for your time.
I set up a new Ubuntu instance on AWS EC2. I SSH to the instance using a private key pair. I installed python, jupyter, pyspark and all the necessary modules. I then start a Jupyter notebook using tmux.
My main aim is simply to run pyspark on an AWS instance (using Jupyter). Unfortunately, I keep running into problems with stability of the Jupyter notebook/connection to the instance. After running the Jupyter notebook for some time (sometimes 5 minutes, other times 2 hours+), it ends up "disconnecting". The kernel in the Jupyter disconnects and then does not process any further calls. At that point, I cannot SSH into the instance (just hangs -> blank screen).
I tried running the same setup on GCP but run into the same symptoms.
Is there something basic that I am missing?
Why would I not be able to SSH into the instance?
Is it possible that the Ubuntu server is crashing?
I started an instance with TPU by following this quick start tutorial using ctpu up command and I was able to run MNIST tutorial successfully. I logged out of cloud shell and logged into my vm connected to TPU using SSH console as explained here, when I run MNIST tutorial again I'm getting
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s).
When I run ctpu ls, I get
# Flock Name Status
0: my-tpu(*) running
ctpu status command gives
Your cluster is running!
Compute Engine VM: RUNNING
Cloud TPU: RUNNING
Am I missing something basic here?
ctpu passes this name to the Compute Engine VM as an environment variable (TPU_NAME), but gcloud doesn't.
Specify your TPU explicitly: use --tpu=my-tpu instead of --tpu=$TPU_NAME
I'm new to cloud computing in general, and I've started a free trial with Amazon's Web Services, in hopes of using their EC2 servers to run some code for Kaggle competitions. I'm currently working on running a test Python script for doing some image processing and testing a linear classifier (I don't suspect these details are relevant to my problem, but wanted to provide some context).
Here are the steps I've gone through to run my script on an EC2 instance:
Log in to AWS, and start EC2 instance where I've installed relevant Python modules for my tasks (e.g. Anaconda distribution). As a sidenote, all my data and the script I want to run are in the same directory on this server instance.
SSH to my EC2 instance from my laptop, and cd to the directory with my script.
Run screen to run program in background.
Run script via python program.py and detach from screen session (ctrl + A, D)
Keep EC2 instance running, but exit from SSH session connecting my laptop to the server.
I've followed these steps a number of times, which result in either (a) "Broken Pipe" errors, or (b) in an error where the connection appears to "hang". In the case of (b), I've attempted to disconnect from the SSH session and reconnect to the server, however I am unable to do so due to an error stating "connection has been reset by peer".
I'm not sure if I need to configure something differently on the EC2 instance, or if I need to specify different options when connecting to the server via SSH. Any help here would be appreciated. Thanks for reading.
EDIT: I've been successful in running some example scripts using scikit-learn by setting up an iPython notebook, launching it with nohup, and running the code in a notebook cell. However, when trying to do the same with my Kaggle competition code, the same "hanging" issue happens, and the connection appears to be dropped, causing the code to stop running. The image dataset I'm running the code on in the second case is quite a bit larger than the dataset processed by the example code in the first case. Not sure if dataset size along is causing the issue, or how to solve this.
I'm playing around with some python deep learning packages (Theano/Lasagne/Keras). I've been running it on CPU on my laptop, which takes a very long time to train the models.
For a while I was also using Amazon GPU instances, with an iPython notebook server running, which obviously ran much faster for full runs, but was pretty expensive to use for prototyping.
Is there any way to set things up that would let me prototype in iPython on my local machine, and then when I have a large model to train spin up a GPU instance, do all processing/training on that, then shut down the instance.
Is a setup like this possible, or does anyone have any suggestions to combine the convenience of the local machine with temporary processing on AWS?
My thoughts so far were along the lines of
Prototype on local ipython notebook
Set up cell to run a long process from start to
finish.
Use boto to start up an ec2 instance ssh into the instance
using boto's sshclient_from_instance
ssh_client = sshclient_from_instance(instance,
key_path='<path to SSH keyfile>',
user_name='ec2-user')
Get the contents of the cell I've set up using the script using the solution here, say the script is in cell 13 Execute that script using
ssh_client.run('python -c "'+ _i13 + '"' )
Shut down instance using boto
This just seems a bit convoluted, is there a proper way to do this?
So when it comes to EC2 you don't have to shut down the instance every time. The beauty of AWS is that you stop and start your instance when you use it, and only pay for the time you have it up and running. Also you can always try your code on a smaller and cheaper instance, and if its too slow for your liking then you just scale up to a larger instance.