I need to create multiple EBS volumes and put some data there using python+boto3.
Overall my flow is:
Create volumes.
Wait their available state.
Attach one volume.
Wait its attached state.
List NVMe devices, detect the volume's one. <-- The issue happens here.
Mount NVMe device.
Copy data.
Unmount and detach the volume.
... next volume.
Most of the time it works fine. Volumes are correctly attaching and linking to NVMe devices (like /dev/nvme2p1). But at some point linux doesn't set block device for the volume: volume state is attached but nvme list doesn't show it.
If re-attach such volume with boto3 or manually in AWS console then it will have block device.
It happens in us-east-2 region but not in ap-south-1, for example.
I tried to attach/detach in single threaded and multi threaded mode.
In multi threaded mode I used separate boto3 clients per thread and a lock to attach sequentially. Also tried different boto3 versions and wait some time after attaching. Nothing helped.
My environment:
EC2 instance: t3a.small, AMI: ubuntu 20.04.2.
python2 (yes, we still using it).
botocore==1.12.253
boto3==1.9.199
Has anyone faced the same problem?
Related
I am just trying to follow example of MultiWorkerMirroredStrategy in tensorflow doc.
I succeed training in localhost, which has a single node.
However, I failed training in cluster, which has two nodes.
I have tried disabling firewall, but it didn't solve the problem.
Here is the main.py. (I run same code in node 1 and node 2, except the tf_config variable. I set node1's tf_config as tf_config['task']['index']=0, and node2's tf_config as tf_config['task']['index']=1)
main.py
Any helps Appreciated. Thanks.
I see that you don't have an error code, but I think I can infer where the issue could be arising, since your code should work. I will test on my kubernetes once I get a chance (I have a node down atm).
The most likely issue. You are using json.dumps() to set the environment variable. In many setting the you should be using:
tf_config=json.loads(os.environ.get(TF_CONFIG) or '{}'),
TASK_INDEX=tf_config['task']['index']
That should clear up any issues with expose ports and ip configurations.
-It sounds like the method you are using is in a notebook? Since you are not running the the same code for main.py. As in one main.py you set 1 and the other 0. Either way that is not what you are doing here. You are setting the index to 1 and 0 but you are not getting back only the index, you are getting back the full cluster spec with the index you set it to. If the environment variable is not set from your cluster, you will need to get back the TF_CONFIG that was set, and then use loads to set that as your tf_config, now you will be getting ONLY the replica index for that node.
If you are using a notebook it needs to be connected to the cluster environment, otherwise you are setting a local environment variable to your machine, and not to containers on the cluster. Consider using Kubeflow, to manage this.
You can either launch from the notebook after setting up your cluster
configuration op, or build a TF_job spec as a YAML that defined the node specs, then launch the pods using that spec.
Either way, the cluster needs to actually have that configuration, you should be able to load the environment in the cluster such that each node is ASSIGNED an index and you are getting that index from THAT nodes replica ID that you set when you launched the nodes and specified with a YAML or json dictionary. A locally set environment running within the local container means nothing to the actual cluster, if the replica-index:{num} on kubernetes does not match the environment variable on the container —That is assigned when the pod is launched.
-Try making a function that will return what the index of each worker is to test if it is set to the same replica-index on your kubernetes dashboard or from kubectl. Make sure to have the function print that out so you can see it in the pod logs. This will help with debugging.
-Look at the pod logs and see if the pods are connecting to the server, and are using whatever communication spec is compatible with your cluster: grcp/etc. You are not setting a communication strategy, but it should be able to automatically find it for you in most cases (just check in case).
-If you are able to launch pods, make sure you are terminating them before trying again. Again kubeflow is going to make things so much easier for you once you get the hang of their python pipeline skd. You can launch functions as containers. You can build an op that clean up, by terminating old pods.
-You should consider having your main.py and any other supporting modules loaded on an image in a repository, such as dockerhub, so that the containers can load the image. With Multiworker Strategy, each machine needs to have the same data for it to be sharded properly. Again check your pod logs to see if it cannot shard the data.
-Are you running on a local machine with different GPUs? If so you should be using Mirrored Strategy NOT multiworker.
We only have 4 GPU devices. and we have more than 4 users to run cuda program ,so before I run my program I want to check which device is not busy, or it will alloc memory fail. But I havent found a function to get this tag. I know when we want to use device we call "cudaSetDevice()" , so there must be a tag for each device. and that "nvidia-smi" can get more detail, include which proccess is using which device and how much memory it used. So who can help me?
The values for cudaSetDevice start at 0 and then increase monotonically for each additional device. Alternatively you can set the environment variable CUDA_VISIBLE_DEVICES to select which device to use. (see https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/).
To get information about what is using the device you need to use the driver API: http://docs.nvidia.com/cuda/cuda-driver-api/index.html
I have an instance on EC2 that is a t2.medium and it was with a 75GB gp2 hard drive( which is a general purpose SSD). After change to a 110GB gp2 hard drive the whole machine is really slow..
My python script used to take something like 40 to 60 seconds to uncompressed some zip file and now its taking 3 to 5 minutes..
If multithreading of this script is running it takes for ever..
Any idea of that happened or how to solve it?
Windows running there..
When you "resized" the disk volume what you really did was create a new larger EBS volume from a snapshot of the old volume. The new EBS volume becomes available immediately but you have to go through an "initialization" process to get it to load all the data. The first time you access a particular block of data on the new volume it will be slow. Subsequent attempts to access that block of data will occur at the fast speed that you would expect. You can read more about this here.
Using this answer, we are able to query all the USB devices connected at a precise moment.
I have a Python program running on Linux (Debian or RaspBian) that does a specific task, but I also want that this program listens if new USB device is connected, and when this happens, trigger a specific action.
I'm thinking about doing a new thread that does:
while True:
list_USB_devices() # using https://stackoverflow.com/a/8265634/1422096
see_if_new_devices_in_this_list()
time.sleep(2) # wait 2 seconds
but I don't find this solution very elegant.
What's a cleaner solution to detect in the background of a Python program if a new USB device is connected?
Example of application for my program: listen if a new USB-MIDI keyboard/device is connected, and if so, attach it with rtmidi-python "Plug and play!"
Look into the gio library (part of glib). You can attach watches and connect callbacks when devices are created. This way you don't have to poll at all. Set a watch on the devices directory, look for file creation. Filter out files not interested in.
You can probably also look at 'udev' system itself, and write a rule to execute something on apparition of a new usb device.
I want to setup Jenkins to
1) pull our source code from our repository,
2) compile and build it
3) run the tests on an embedded device
step 1 & 2 are quite easy and straight forward with Jenkins
as for step 3,
we have hundreds of those devices in various versions of them, and I'm looking for a utility (preferable in python) that can handle the availability of hardware devices/resources.
in such manner that one of the steps will be able to receive which of the device is available and run the tests on it.
What I have found, is that the best thing to do, is have something like jenkins, or if you're using enterprise, electric commander, manage a resource 'pool' the pool is essentially virtual devices, but they have a property, such that you can call into a python script w/ either an ip-address or serial port and communicate w/ your devices.
I used it for automated embedded testing on radios. The python script managed a whole host of tests, and commander would go ahead and choose a single-step resource from the pool, that resource had an ip, and would pass it into the python script. test would then perform all the tests and the stdout would get stored up into commander/jenkins ... Also set properties to track pass/fail count as test was executing
//main resource gets single step item from pool, in the main resource wrote a tiny script that asked if the item pulled from the pool had the resource name == "Bench1" .. "BenchX" etc.
basically:
if resource.name=="BENCH1":
python myscript.py --com COM3 --baud 9600
...
etc.
the really great feature about doing it this way, is if you have to disconnect a device, you don't need to deliver up script changes, you simply mark the commander/jenkins resource as disabled, and the main 'project' can still pull from what remains in your resource pool