I am trying to run a distributive application with PyTorch distributive trainer. I thought I would first try the example they have, found here. I set up two AWS EC2 instances and configured them according to the description in the link, but when I try to run the code I get two different errors: in the first terminal window for node0 I get the error message: RuntimeError: Address already in use
Under the other three windows I get the same error message:
RuntimeError: NCCL error in:
/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled system
error
I followed the code in the link, and terminated the instances an redid but it didn't help/
This is using python 3.6 with the nightly build Cuda 9.0. I tried changing the MASTER_ADDR to the ip for node0 on both nodes, as well as using the same MASTER_PORT (which is an available, unused port). However I still get the same error message.
After running this, my goal is to the adjust this StyleGan implementation so that I can train it across multiple GPUs in two different nodes.
So after a lot of failed attempts I found out what the problem is. Note that this solution applies to using ASW deep learning instances.
After creating two instances I had to adjust the security group. Add two rules: The first rule should be ALL_TCP, and set the source to the Private IPs of the leader. The second rule should be the same (ALL_TCP), but with the source as the Private IPs of the slave node.
Previously, I had the setting security rule set as: Type SSH, which only had a single available port (22). For some reason I was not able to use this port to allow the nodes to communicate. After changing these settings the code worked fine. I was also able to run this with the above mentioned settings.
Related
I have a Deployment Manager script as follows:
cluster.py creates a kubernetes cluster and when the script was run only for the k8 cluster creation, it was successful -- so it means the cluster.py had no issues in creation of a k8 cluster
cluster.py also exposes ouputs:
A small snippet of the cluster.py is as follows:
outputs.append({
'name': 'v1endpoint' ,
'value': type_name + type_suffix })
return {'resources': resources, 'outputs': outputs}
If I try to access the exposed output inside dmnginxservice resource below as $(ref.dmcluster.v1endpoint) I get an error as resource not found
imports:
- path: cluster.py
- path: nodeport.py
resources:
- name: dmcluster
type: cluster.py
properties:
zone: us-central1-a
- name: dmnginxservice
type: nodeport.py
properties:
cluster: $(ref.dmcluster.v1endpoint)
image: gcr.io/pr1/nginx:latest
port: 342
nodeport: 32123
ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation [operation-1519960432614-566655da89a70-a2f917ad-69eab05a]: errors:
- code: CONDITION_NOT_MET
message: Referenced resource yaml%dmcluster could not be found. At resource
gke-cluster-dmnginxservice.
I tried to reproduce a similar implementation and I have been able to deploy it with no issues making use of your very same sintax for the output.
I deployed 2 VM and a new network. I will post you my code, maybe you find some interesting hints concerning the outputs.
The first VM pass as output the name for the second VM and use a reference from the network
The second VM takes the name from the properties that have been populated from the output of the first VM
the network thanks to the references is the first one to be created.
Keep in mind that:
This can get tricky because the order of creation for resources is important; you cannot add virtual machine instances to a network that does not exist, or attach non-existent persistent disks. Furthermore, by default, Deployment Manager creates all resources in parallel, so there is no guarantee that dependent resources are created in the correct order.
I will skip that is the same. If you provide your code I could try to help you to debug it, but from the error code it seems that the DM is not aware that the first element has been created, but from the info provided is not clear why.
Moreover if I were you I would give a shot to explicitly set that dmnginxservice depends on dmcluster making use of the metadata. In this way you can double check if it is actually waiting the first resource.
UPDATE
I have been able to reproduce the bug with a simpler configuration basically depending on how I reference the variables, the behaviour is different and for some reason the property get expanded to $(ref.yaml%vm-1.paolo), it seems that the combination of project and cluster references causes troubles.
#'name': context.properties["debug"],WORKING
#'name': context.env["project"],WORKING
'name': context.properties["debug"]+context.env["project"],#NOT WORKING
You can check the configuration here, If you need it.
I'm trying to get a list of registered virtual machines - by their name - in my vcenter. The problem that I have many vms (~5K), and I am doing it a lot of times (O(1000)/hour).
The SDKs I'm using causing it a lot of traffic (1-2MB/request):
pysphere: which ask for all vms, and filters on client side.
pyVmomi, which need to use recursion to list all vms (I saw SI.content.searchIndex.FindByDnsName on reboot_vm.py, but my machines' DNS configuration is not true)
Looking into SOAP documentation didn't help (got into RetrievePropertiesEx.objectSet but it doesn't looks to filter anything), and the new REST (v6.5) didn't help too (since I need to get its "datastore path", and all I can get is the name)
Have you tried a using a property collector, as in pyVmomi.vmodl.query.PropertyCollector.FilterSpec?
The pyVmomi Community Samples contain examples of using this API, such as
https://github.com/vmware/pyvmomi-community-samples/blob/master/samples/filter_vms.py.
vAPI is based on REST, even more, it does network requests every time, so it will be slow. Another way is to use VCDB, where are stored all VMs (there us such one, but I can not help you here more), see https://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.installclassic.doc_41/install/prep_db/c_preparing_the_vmware_infrastructure_databases.html or https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1025914
We are using multiple cassandra datastax cluster instances(6) to connect to cassandra using python. We are pooling these multiple connections to do some operations. Each operation is independent of other.
It works fine on a small number of operations, but once I try to scale up I get the following errors :
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.ption('Pool is shutdown',)})
and sometimes the following warning:
WARNING Heartbeat failed for connection (140414695068880) to 127.0.0.1
I tried changing some cluster object parameters but it did not help.
Following is the configuration of key space in cassandra I am using :
'class': 'SimpleStrategy',
'replication_factor': '1'
I am using lastest versions of cassandra and datastax driver for python. There is only one node is cassandra.
EDIT: More details:
The multiple cluster instances are in different processes (processes are created using the Python multiprocessing module) - one cluster instance per process. Lets call the proccesses Cassandra-Processes (CP). There are a bunch of other process that do some computation and need to look up a Cassandra DB, and write to it, occassionally. The current design is that each of these processes is mapped to one CP, and all DB reads/writes to be done by the process is done via this mapped CP. 'what' exactly is to be read/written is passed into a queue (again from the multiprocessing library) which the mapped CP reads.
We observe that this setup runs for quite sometime - and then suddenly Cassandra begins erroring out.
It's unclear why you're using six cluster instances against a single Cassandra node. Generally, you should use one Cluster instance per application (per remote cluster). You can read about general design considerations for Cassandra drivers here
If you're looking to "scale" with regards to throughput, you might consider using multiprocessing. I discuss this in a blog post here.
Follow-on:
Two things can be inferred from the information we have so far:
The application is pushing more concurrent requests than your connection pool is configured to handle. I say this because the "Pool is shutdown" only occurs when a request is waiting for a connection/stream to become available. You can tune connection pooling to make more available initially using cluster settings. However, if your "cluster" (server node) is overwhelmed, you won't gain much there.
Your connection is being shutdown. This exception only happens when the node is suddenly marked down. In a single node setup this is most likely because of a connection error. Look for clues in the server log, or driver debug log if you're capturing that.
We probably need to know more about your execution model to help more. Is it possible you're running unfettered async requests without occasionally waiting for them to complete?
Remote diagnosis is hard to do without knowing anything on your specific topology, setup and system configuration. This however looks much like a configuration problem or even the python driver. If you google your error message you will find multiple topics on Datastax's Jira describing this or similar problems, I would check that the Python Driver is up to date.
What would help in the first place is to see in detail what you try to do, how your cluster is configured aso.
I have a node in my Neo4j database, which I retrieve using the find_one() method of the py2neo interface.
profile = graph.find_one('Facebook','fb_id', fb_id)
profile['nb_friends'] = nb_friends # a list of posts
profile.push()
The above statement works fine when updating the local Neo4j database, but not when using a remote Neo4j server (nothing is changed).
However, if I run a raw Cypher query it works both locally and remotely.
graph.cypher.execute('MATCH (n:Facebook {fb_id:{ID}}) SET n.nb_friends = {FR} RETURN n',{'ID':fb_id,'FR':nb_friends})
Any idea why this happens and how it could be fixed?
Note: the only modification I've done to the server configuration is to disable authentification.
I would be surprised if the local/remote aspect is directly significant here. Py2neo does not know or care where the server is located and does not take a different code path for localhost.
I'd suggest making sure that you are using the same version on both servers, that your connection URIs both have the same path (which should end in a trailing slash) and that the data is similar on both.
You may have also hit this bug:
https://github.com/nigelsmall/py2neo/issues/395
Empty lists are not supported as property values and py2neo has a bug that makes push fail silently when this is attempted. An exception will be raised for this in the next release.
I know this is an old question, but I was having a similar issue with py2neo 3 (graph.push() was failing silently). It turns out that I using an old version of Neo4j (2.1.7) which I had installed on accident. Try downloading a newer version and trying again.
Bear with me. This is my first post...
The Tor project has recently introduced Stem as a loadable python module. I've been playing around with it to see if it's a viable tool. My results have been mixed.
I try to enable a configuration for a hidden service within the controller (which is supposed to act as though it came directly from the torrc file. It always fails on me. Here's a quick example of what I try:
#!/usr/bin/env python
from stem.control import Controller
controller = Controller.from_port(port = 9051)
controller.authenticate()
controller.set_options({'HIDDENSERVICEDIR':'/tmp/hiddenservice/','HIDDENSERVICEPORT':'1234 127.0.0.1:1234'})
...which returns an error:
InvalidRequest Traceback (most recent call last)
/home/user/my/folder/<ipython-input-5-3921e9b46181> in <module>()
/usr/local/lib/python2.7/dist-packages/stem/control.pyc in set_options(self, params, reset)
1618 raise stem.InvalidRequest(response.code, response.message)
1619 elif response.code in ("513", "553"):
-> 1620 raise stem.InvalidRequest(response.code, response.message)
1621 else:
1622 raise stem.ProtocolError("Returned unexpected status code: %s" % response.code)
InvalidRequest: Unacceptable option value: Failed to configure rendezvous options. See logs
...and the following in /var/log/tor/log:
Aug 1 10:10:05.000 [warn] HiddenServicePort with no preceding HiddenServiceDir directive
Aug 1 10:10:05.000 [warn] Controller gave us config lines that didn't validate: Failed to configure rendezvous options. See logs for details.
I've tried this with Stem's "set_options" as seen above and in two separate commands with "set_conf". With "set_conf", I can set the HiddenServiceDir but it still fails the same when setting the port, making me think I have a fundamental misunderstanding of Tor.
I checked my circuits and it doesn't seem to matter if I have one with a hidden service rendezvous point; it keeps failing. I'd prefer to keep things pythonic, temporal and clean and not have a hacked up bash script that rewrites the torrc before restarting tor. (In a perfect world, I'd rather not write to a hidden service directory, but tor hasn't implemented that yet.)
I try to be as cross-platform as possible, but I'm running Linux with Tor 2.3.25...
So who has ideas of why Stem won't let me make a hidden service?
Thanks for pointing this out to me via our bug tracker. Answering this here. :)
The set_options() docs say...
The params can optionally be a list of key/value tuples, though the only reason this type of argument would be useful is for hidden service configuration (those options are order dependent).
The issue here is that Tor's hidden service options behave in a slightly different fashion from all the rest of its config options. Tor expects a 'HiddenServiceDir' followed by the properties associated with that hidden service (it's order dependent). This is because a single tor instance can provide multiple hidden services.
Please change your call from...
controller.set_options({'HIDDENSERVICEDIR':'/tmp/hiddenservice/','HIDDENSERVICEPORT':'1234 127.0.0.1:1234'})
... to be a list of tuples instead...
controller.set_options([('HiddenServiceDir', '/tmp/hiddenservice/'), ('HiddenServicePort', '1234 127.0.0.1:1234')])
Hope this helps! -Damian