How to copy simultaneously from multiple nodes using Fabric? - python

I have just started using Fabric, looks like a very useful tool. I am able to write a tiny script to run some commands in parallel on my Amazon EC2 hosts, something like this:
#parallel
def runs_in_parallel():
sudo("sudo rm -rf /usr/lib/jvm/j2sdk1.6-oracle")
Also, I have written another script to copy all the Hadoop logs from all EC2 nodes to my local machine. This script creates a folder with timestamp as name, within that 1 folder for each node as its IP address and then copies that node's logs in this IP address named folder. E.g.:
2014-04-22-15-52-55
50.17.94.170
hadoop-logs
54.204.157.86
hadoop-logs
54.205.86.22
hadoop-logs
Now I want to do this copy task using Fabric so that I can copy the logs in parallel, to save time. I thought I can easily do it the way I did in my first code snippet, but that won't help, as it runs commands on the remote server. I have no clue as of now how to do this. Any help is much appreciated.

You could likely use the get() command to handle pulling down files. You'd want to make them into tarballs, and have them pull into unique filenames on your client to keep the gets from clobbering one another.

Related

What's the optimal way to store image data temporarily in a containerized website?

I'm currently working on a website where i want the user to upload one or more images, my flask backend will do some changes on these pictures and then return them back to the front end.
Where do I optimally save these images temporarily especially if there are more then one user at the same time on my website (I'm planning on containerizing the website). Is it safe for me to save the images in the folder of the website or do I need e.g. a database for that?
You should use a database, or external object storage like Amazon S3.
I say this for a couple of reasons:
Accidents do happen. Say the client does an HTTP POST, gets a URL back, and does an HTTP GET to retrieve the result. But in the meantime, the container restarts (because the system crashed; your cloud instance got terminated; you restarted the container to upgrade its image; the application failed); the container-temporary filesystem will get lost.
A worker can run in a separate container. It's very reasonable to structure this application as a front-end Web server, that pushes messages into a job queue, and then a back-end worker picks up messages out of that queue to process the images. The main server and the worker will have separate container-local filesystems.
You might want to scale up the parts of this. You can easily run multiple containers from the same image; they'll each have separate container-local filesystems, and you won't directly control which replica a request goes to, so every container needs access to the same underlying storage.
...and it might not be on the same host. In particular, cluster technologies like Kubernetes or Docker Swarm make it reasonably straightforward to run container-based applications spread across multiple systems; sharing files between hosts isn't straightforward, even in these environments. (Most of the Kubernetes Volume types that are easy to get aren't usable across multiple hosts, unless you set up a separate NFS server.)
That set of constraints would imply trying to avoid even named volumes as much as you can. It makes sense to use volumes for the underlying storage for your database, and it can make sense to use Docker bind mounts to inject configuration files or get log files out, but ideally your container doesn't really use its local filesystem at all and doesn't care how many copies of itself are running.
(Do not rely on Docker's behavior of populating a named volume on first use. There are three big problems with it: it is on first use only, so if you update the underlying image, the volume won't get updated; it only works with Docker named volumes and not other options like bind-mounts; and it only works in Docker proper and not in Kubernetes.)
Other decisions are possible given other sets of constraints. If you're absolutely sure you will never ever want to run this application spread across multiple nodes, Docker volumes or bind mounts might make sense. I'd still avoid the container-temporary filesystem.

Best way to copy multiple files to multiple remote servers

I'm developing a service which has to copy multiple files from a central node to remote servers.
The problem is that each time the service is executed, there are new servers and new files to dispatch to these servers. I mean, in each execution, I have the information of which files have to be copied to each server and in which directory.
Obviously, this information is very dynamically changing, so I would like to be able to automatize this task. I tried to get a solution with Ansible, FTP and SCP over Python.
I think Ansible is very difficult to automatize every scp task in each execution.
SCP is ok but I need to build each SCP command in Python to launch it.
FTP Is too much for this problem because there are not many files to dispatch to a single server.
Is there any better solution than what I thinked about?
In case you send some the same file (or files) to different destinations (that can be organized as sets), you could profit from solutions as dsh or parallel-scp.
If this will make sense depends on your use-case.
Parallel-SSH Documentation
from __future__ import print_function
from pssh.pssh_client import ParallelSSHClient
hosts = ['myhost1', 'myhost2']
client = ParallelSSHClient(hosts)
output = client.run_command('uname')
for host, host_output in output.items():
for line in host_output.stdout:
print(line)
How about rsync ?
Example: rsync -rave
Where source or destination could be:
user#IP:/path/to/dest
It knows incremental + you can Cron it or trigger a small script when anything changes

How to use the get method with multiple hosts in fabric v2?

I have a fabfile that run several tasks on the hosts. This results in the creation of a file result.txt in each of the hosts.
Now I want to get all these files locally. This is what I tried:
from invoke import task
#task
def getresult(ctx):
ctx.get('result.txt')
I run with:
fab -H host1 host2 host3 getresult
In the end, I have only one file result.txt in my local machine (it seems to be the copy from the last host of the command line). I would like to get all the files.
Is there a way to do this with fabric v2? I did not find anything in the documentation. It seems that this was possible in fabric v1, but I am not sure for the v2.
In Fabric v2, the get API signature is:
get(remote, local=None, preserve_mode=True)
So when you don't specify the name with which it has to be stored locally, it uses the same name it has in the remote location. Hence it is being overwritten for each host it executes on, and you're ending up with the last one.
One way to fix this would be to mention the local file name and add a random suffix or prefix to it. Something like this.
from invoke import task
import random
#task
def getresult(ctx):
ctx.get('result.txt', 'result%s.txt' % random.random()*100)
This way, each time it executes, it stores the file with a unique name.
You can even add the host name to the file, if you can find a way to use it inside the method.

AWS ETL with python scripts

I am trying to create a basic ETL on AWS platform, which uses python.
In a S3 bucket (lets call it "A") I have lots of raw log files, gzipped.
What I would like to do is to have it periodically (=data pipeline) unzipped, processed by a python script which will reformat the structure of every line, and output it to another S3 bucket ("B"), preferably as gzips of the same log files originating in the same gzip in A, but that's not mandatory.
I wrote the python script which does with it needs to do (receives each line from stdin) and outputs to stdout (or stderr, if a line isn't valid. in this case, i'd like it to be written to another bucket, "C").
I was fiddling around with the data pipeline, tried to run a shell command job and also a hive job for sequencing with the python script.
The EMR cluster was created, ran, finished, no fails or errors, but also no logs created, and I can't understand what is wrong.
In addition, I'd like the original logs be removed after processed and written to the destination or erroneous logs buckets.
Does anyone have any experience with such configuration? and words of advise?
First thing you want to do is to set 'termination protection' on - on the EMR cluster -as soon as it is launched by Data Pipeline. (this can be scripted too).
Then you can log on to the 'Master instance'. This is under 'hardware' pane under EMR cluster details. (you can also search in EC2 console by cluster id).
You also have to define a 'key' so that you can SSH to the Master.
Once you log on to the master, you can look under /mnt/var/log/hadoop/steps/ for logs - or /mnt/var/lib/hadoop/.. for actual artifacts. You can browse hdfs using HDFS utils.
The logs (if they are written to stdout or stderr), are already moved to S3. If you want to move additional files, you have to have write a script and run it using 'script-runner'. You can copy large amount of files using 's3distcp'.

Python: Verifying NFS authentication

Folks,
I believe there are two questions I have: one python specific and the other NFS.
The basic point is that my program gets the 'username', 'uid', NFS server IP and exported_path as input from the user. It now has to verify that the NFS exported path is readable/writable by this user/uid.
My program is running as root on the local machine. The straight-forward approach is to 'useradd' a user with the given username and uid, mount the NFS exported path (run as root for mount) on some temporary mount_point and then execute 'su username -c touch /mnt_pt/tempfile'. IF the username and userid input were correct (and the NFS server was setup correctly) this touch of tempfile will succeed creating tempfile on the NFS remote directory. This is the goal.
Now the two questions are:
(i) Is there a simpler way to do this than creating a new unix user, mounting and touching a file to verify the NFS permissions?
(ii) If this is what needs to be done, then I wonder if there are any python modules/packages that will help me execute 'useradd', 'userdel' related commands? I currently intend to use the respective binaries(/usr/sbin/useradd etc) and then invoke subprocess.Popen to execute the command and get the output.
Thank you for any insight.
i) You could do something more arcane, but short of actually touching the file you probably aren't going to be testing exactly what you need to test, so I think I'd probably do it the way you suggest.
ii) You might want to check out the python pwd module if you want to verify user existance or the like, but you'll probably need to leverage the useradd/userdel programs themselves to do the dirty work.
You might want to consider leveraging sudo for your program so the entire thing doesn't have to run as root, it seems like a pretty risky proposition.
There is a python suite to test NFS server functionality.
git://git.linux-nfs.org/projects/bfields/pynfs.git
While it's for NFSv4 you can simply adopt it for v3 as well.

Categories