I have a simple 2 node cluster (master on one, workers on both). I tried using:
python disco/util/distrfiles.py bigtxt /etc/nodes > bigtxt.chunks
To distribute the files (which worked ok).
I expected this to mean that the processes would spawn and only operate on local data, but it seems that they are trying to access data on the other machine, at times.
Instead, I completely copied the data directory. Everything worked fine, until the reduce portion. I received the error:
CommError: Unable to access resource (http://host:8989/host/8b/sup#4f6:d2f6:34b3b/map-index.txt):
It seems like the item is expected to be accessed directly via http. But I don't think this is happening correctly. Are files supposed to be passed back and forth by http? Must I have a distributed FS for multi-node MapReduce?
Related
I am just trying to follow example of MultiWorkerMirroredStrategy in tensorflow doc.
I succeed training in localhost, which has a single node.
However, I failed training in cluster, which has two nodes.
I have tried disabling firewall, but it didn't solve the problem.
Here is the main.py. (I run same code in node 1 and node 2, except the tf_config variable. I set node1's tf_config as tf_config['task']['index']=0, and node2's tf_config as tf_config['task']['index']=1)
main.py
Any helps Appreciated. Thanks.
I see that you don't have an error code, but I think I can infer where the issue could be arising, since your code should work. I will test on my kubernetes once I get a chance (I have a node down atm).
The most likely issue. You are using json.dumps() to set the environment variable. In many setting the you should be using:
tf_config=json.loads(os.environ.get(TF_CONFIG) or '{}'),
TASK_INDEX=tf_config['task']['index']
That should clear up any issues with expose ports and ip configurations.
-It sounds like the method you are using is in a notebook? Since you are not running the the same code for main.py. As in one main.py you set 1 and the other 0. Either way that is not what you are doing here. You are setting the index to 1 and 0 but you are not getting back only the index, you are getting back the full cluster spec with the index you set it to. If the environment variable is not set from your cluster, you will need to get back the TF_CONFIG that was set, and then use loads to set that as your tf_config, now you will be getting ONLY the replica index for that node.
If you are using a notebook it needs to be connected to the cluster environment, otherwise you are setting a local environment variable to your machine, and not to containers on the cluster. Consider using Kubeflow, to manage this.
You can either launch from the notebook after setting up your cluster
configuration op, or build a TF_job spec as a YAML that defined the node specs, then launch the pods using that spec.
Either way, the cluster needs to actually have that configuration, you should be able to load the environment in the cluster such that each node is ASSIGNED an index and you are getting that index from THAT nodes replica ID that you set when you launched the nodes and specified with a YAML or json dictionary. A locally set environment running within the local container means nothing to the actual cluster, if the replica-index:{num} on kubernetes does not match the environment variable on the container —That is assigned when the pod is launched.
-Try making a function that will return what the index of each worker is to test if it is set to the same replica-index on your kubernetes dashboard or from kubectl. Make sure to have the function print that out so you can see it in the pod logs. This will help with debugging.
-Look at the pod logs and see if the pods are connecting to the server, and are using whatever communication spec is compatible with your cluster: grcp/etc. You are not setting a communication strategy, but it should be able to automatically find it for you in most cases (just check in case).
-If you are able to launch pods, make sure you are terminating them before trying again. Again kubeflow is going to make things so much easier for you once you get the hang of their python pipeline skd. You can launch functions as containers. You can build an op that clean up, by terminating old pods.
-You should consider having your main.py and any other supporting modules loaded on an image in a repository, such as dockerhub, so that the containers can load the image. With Multiworker Strategy, each machine needs to have the same data for it to be sharded properly. Again check your pod logs to see if it cannot shard the data.
-Are you running on a local machine with different GPUs? If so you should be using Mirrored Strategy NOT multiworker.
I know the question title is weird!.
I have two virtual machines. First one has limited resources, while the second one has enough resources just like normal machine. The first machine will receive a signal from an external device. This signal will trigger a python compiler to execute a script. The script is big and the first machine does not have enough resources to execute it.
I can copy the script to the second machine to run it there, but I can't make the second machine receive the external signal. I am wondering if there is a way to make the compiler on the first machine ( once the external signal received) call the compiler on the second machine, so the compiler on the second machine executes the script? so the second compiler should use the second machine resources. check the attached image please.
Assume that the connection is established between the two machines and they can see each other, and the second machine has a copy from the script. I just need the commands that pass ( the execution ) to the second machine and make it use its own resources.
You should look into the microservice architecture to do this.
You can achieve this either by using flask and sending server requests between each machine, or something like nameko, which will allow you to create a "bridge" between machines and call functions between them (seems like what you are more interested in). Example for nameko:
Machine 2 (executor of resource-intensive script):
from nameko.rpc import rpc
class Stuff(object):
#rpc
def example(self):
return "Function running on Machine 2."
You would run the above script through the Nameko shell, as detailed in the docs.
Machine 1:
from nameko.standalone.rpc import ClusterRpcProxy
# This is the amqp server that machine 2 would be running.
config = {
'AMQP_URI': AMQP_URI # e.g. "pyamqp://guest:guest#localhost"
}
with ClusterRpcProxy(config) as cluster_rpc:
cluster_rpc.Stuff.example() # Function running on Machine 2.
More info here.
Hmm, there's many approaches to this problem.
If you want a python only solution, you can check out
dispy http://dispy.sourceforge.net/
Or Dask. https://dask.org/
If you want a robust solution (what I use on my home computing cluster but imo overkill for your problem) you can use
SLURM. SLURM is basically a way to string multiple computers together into a "supercomputer". https://slurm.schedmd.com/documentation.html
For a semi-quick, hacky solution. You can write a microservice. Essentially, your "weak" computer will receive the message then send a http request to your "strong" computer. Your strong computer will contain the actual program, compute results, and pass back the result to your "weak" computer.
Flask is an easy and lightweight solution for this.
All of these solutions require some type of networking. At the least, the computers need to be on the same LAN or both have access over the web.
There are many other approaches not mentioned. For example, you can export a NFS (netowrk file storage) and have one computer put a file in the shared folder and the other computer perform work on the file. I'm sure there are plenty other contrived ways to accomplish this task :). I'd be happy to expand on a particular method if you want.
This is similar to a few questions on the internet, but this code seems to be working for awhile instead of returning an error instantly, which suggests to me it is maybe not just a host-file error?
I am running a code that spawns multiple MPI processes which then each create a loop, within which they send some data with bcast and scatter, then gathers data from those processes. This runs the algorithm and saves data. It then disconnects from the spawned comm, and creates another set of spawns on the next loop. This works for a few minutes, then after around 300 files, it will spit this out:
[T7810:10898] [[50329,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 758
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error.
More information may be available above.
I am testing this on a local machine (single node), so the end deployment will have multiple nodes that each spawn their own mpi processes within that node. I am trying to figure out if this is an issue with testing the multiple nodes on my local machine and will work fine on the HPC or is a more serious error.
How can I debug this? Is there a way to be printing out what MPI is trying to do during, or monitor MPI, such as a verbose mode?
Since MPI4PY is so close to MPI (logically, if not in terms of lines-of-code), one way to debug this is to write the C version of your program and see if the problem persists. When you report this bug to OpenMPI, they are going to want a small c test case anyway.
I'm working on a Python project that is required some file transferring. One side of the connection is highly available ( REHL 6 ) and always online. But the other side is going on and off ( Windows 7 ) and the connection period is not guaranteed. The files are transporting on both directions and sizes are between 10MB to 2GB.
Is it possible to resume the file transferring with paramiko instead of transferring the entire file from the beginning.
I would like to use rSync but one side is windows and I would like to avoid cwRsync and DeltaCopy
Paramiko doesn't offer an out of the box 'resume' function however, Syncrify, DeltaCopy's big successor has a retry built in and if the backup goes down the server waits up to six hours for a reconnect. Pretty trusty, easy to use and data diff by default.
paramiko.sftp_client.SFTPClient contains an open function, which functions exactly like python's built-in open function.
You can use this to open both a local and remote file, and manually transfer data from one to the other, all the while recording how much data has been transferred. When the connection is interrupted, you should be able to pick up right where you left off (assuming that neither file has been changed by a 3rd party) by using the seek method.
Keep in mind that a naive implementation of this is likely to be slower than paramiko's get and put functions.
Firstly I am new to Python.
Now my question goes like this:
I have a call back script running in remote machine
which sends some data and run a script in local machine
which process that data and write to a file. Now another
script of mine locally needs to process the file data
one by one and delete them from the file if done.
The problem is the file may be updating continuoulsy.
How do i schyncronize the work so that it doesnt mess up
my file.
Also please suggest me if the same work can be done in some
better way.
I would suggest you to look into named pipes or sockets which seem to be more suited for your purpose than a file. If it's really just between those two applications and you have control on the source code of both.
For example, on unix, you could create a pipe like (see os.mkfifo):
import os
os.mkfifo("/some/unique/path")
And then access it like a file:
dest = open("/some/unique/path", "w") # on the sending side
src = open("/some/unique/path", "r") # on the reading side
The data will be queued between your processes. It's a First In First Out really, but it behaves like a file (mostly).
If you cannot go for named pipes like this, I'd suggest to use IP sockets over localhost from the socket module, preferably DGRAM sockets, as you do not need to do some connection handling there. You seem to know how to do networking already.
I would suggest using a database whose transactions allow for concurrent processing.