I have a DAG that imports data from a source to a server. From there, I am looking to download that file from the server to the Windows network. I would like to keep this part in Airflow for automation purposes. Does anyone know how to do this in Airflow? I am not sure whether to use the os package, the shutil package, or maybe there is a different approach.
I think you're saying you're looking for a way to get files from a cloud server to a windows shared drive or onto a computer in the windows network, these are some options I've seen used:
Use a service like google drive, dropbox, box, or s3 to simulate a synced folder on the cloud machine and a machine in the windows network.
Call a bash command to SCP the files to the windows server or a worker in the network. This could work in the opposite direction too.
Add the files to a git repository and have a worker in the windows network sync the repository to a shared location. This option is only good in very specific cases. It has the benefit that you can track changes and restore old states (if the data is in CSV or another text format), but it's not great for large files or binary files.
Use rsync to transfer the files to a worker in the windows network which has the shared location mounted and move the files to the synced dir with python or bash.
Mount the network drive to the server and use python or bash to move the files there.
All of these should be possible with Airflow by either using python (shutil) or a bash script to transfer the files to the right directory for some other process to pick up or by calling a bash sub-process to perform the direct transfer by SCP or commit the data via git. You will have to find out what's possible with your firewall and network settings. Some of these would require coordinating tasks on the windows side (the git option for example would require some kind of cron job or task scheduler to pull the repository to keep the files up to date).
Related
Any ideas how I automatically send some files (mainly Tensorflow models) after training in Google AI platform to another compute instance or my local machine? I would like to run in my trainer for instance something like this os.system(scp -r ./file1 user#host:/path/to/folder). Of course I don’t need to use scp. It’s just an example. Is there such a possibility in Google? There is no problem to transfer files from job to Google Cloud Storage like this os.system('gsutil cp ./example_file gs://my_bucket/path/'). However when I try for example os.system('gcloud compute scp ./example_file my_instance:/path/') to transfer data from my AI platform job to another instance I get Your platform does not support SSH. Any ideas how can I do this?
UPDATE
Maybe there is a possibility to automatically download all the files from the google cloud storage which are in chosen folder? So I would for instance upload data from my job instance to the google cloud storage folder and my another instance would automatically detect changes and download all the new files?
UPDATE2
I found gsutil rsync but I am not sure whether it can be constantly running in the background? At this point the only solution that comes to my mind is to use cron job in the backend and run gsutil rsync for example every 10 minutes. But is doesn't seem to be optimal solution. Maybe there is a built-in tool or another better idea?
rsync command makes the contents under destination the same as the contents under source, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. source must specify a directory, bucket, or bucket subdirectory. But it command does not run in background.
Remember that the Notebook you're using is in fact a VM running JupyterLab, based on that you could run the command rsync once Tensorflow finished to created the files and sync it with a directory in another instance trying something like:
import os
os.system("rsync -avrz Tensorflow/outputs/filename root#ip:Tensorflow/otputs/file")
I suggest you to take a look in the rsync documentation to know all the options available to use that command.
What is the best method to grab files from a Windows shared folder on the same network?
Typically, I am extracting data from SFTPs, SalesForce, or database tables, but there are a few cases where end-users need to upload a file to a shared folder that I have to retrieve. My process up to now has been to have a script running on a Windows machine which just grabs any new/changed files and loads them to an SFTP, but that is not ideal. I can't monitor it in my Airflow UI, I need to change my password on that machine physically, mapped network drives seem to break, etc.
Is there a better method? I'd rather the ETL server handle all of this stuff.
Airflow is installed on remote Linux server (same network)
Windows folders are just standard UNC paths where people have access based on their NT ID. These users are saving files which I need to retrieve. These users are non-technical and did not want WinSCP installed to share the data through an SFTP instead or even a Sharepoint (where I could use Shareplum, I think).
I would like to avoid mounting these folders and instead use Python scripts to simply copy the files I need as per an Airflow schedule
Best if I can save my NT ID and password within an Airflow connection to access it with a conn_id
If I'm understanding the question correctly, you have a shared folder mounted on your local machine — not the Windows server where your Airflow install is running. Is it possible to access the shared folder on the server instead?
I think a file sensor would work your use case.
If you could auto sync the shared folder to a cloud file store like S3, then you could use the normal S3KeySensor and S3PrefixSensor that are commonly used . I think this would simplify your solution as you wouldn't have to be concerned with whether the machine(s) the tasks are running on has access to the folder.
Here are two examples of software that syncs a local folder on Windows to S3. Note that I haven't used either of them personally.
https://www.cloudberrylab.com/blog/how-to-sync-local-folder-with-amazon-s3-bucket-with-cloudberry-s3-explorer/
https://s3browser.com/amazon-s3-folder-sync.aspx
That said, I do think using FTPHook.retrieve_file is a reasonable solution if you can't have your files in cloud storage.
I do not have access to the admin account in Windows 7. Is there a way to install RabbitMQ and its required Erlang without admin privileges? In some portable way?
I need to use it in my Python Celery project.
Thanks!
It is possible. Here's how I've done it:
You need to create a portable Erlang and acquire RabbitMQ server files.
You can install regular Erlang to another computer, then copy the whole installation directory to the computer with limited account. You can use local documents, or AppData like C:\Users\Limited_Account\AppData\erl5.10.4
(If you don't have any access to another computer, you can extract the setup file with 7-Zip but it'll be troublesome to fix paths.)
Modify the erg.ini in the bin folder with the new path. (By default erg.ini uses Unix line endings, so it might be seen as a single line.)
[erlang]
Bindir=C:\\Users\\Limited_Account\\AppData\\erl5.10.4\\erts-5.10.4\\bin
Progname=erl
Rootdir=C:\\Users\\Limited_Account\\AppData\\erl5.10.4\\erl5.10.4
See if bin\erl.exe opens up Erlang Shell. If you see a crash dump, path might not be correct. If Visual C++ Redist. files were not installed before, it will nag you about msvcr100.dll and you need to manually copy them as well but I don't recommended that.
Download the zip version of RabbitMQ server from https://www.rabbitmq.com/install-windows-manual.html and extract it.
Set %ERLANG_HOME% variable. You can type set ERLANG_HOME="C:\\Users\\Limited_Account\\AppData\\erl5.10.4\" in command line. Alternatively, you can add this line to every .bat in the sbin folder.
Now you can use the management scripts in the sbin folder. For example, you can use rabbitmq_server-3.2.4\sbin\rabbitmq-server.bat to start the RabbitMQ Server. Obviously, starting as a service is not an option because you are not an admin.
For further information, see: https://www.rabbitmq.com/install-windows-manual.html
I'm currently running python on Linux machine and have a windows XP guest running on vbox.
I want to access the shared folder on the xp machine. i tried the following command but always get the same error.
d = os.listdir(r"\\remoteip\share")
OSError: [Errno 2] No such file or directory
the shared folder on xp was created by creating a new folder in the Shared Documents folder and I'm able to ping machines.
Windows sharing is implemented using smb protocol. Windows Explorer and most of the Linux file managers (like Nautilus) make it transparent to the user, so it is easy to do common file operations on files\folders shared through smb.
However, Linux (and thus Python that runs on top of it) does not add this abstraction by default on file system level (though you can mount smb share as part of your fs).
So, in the end, to access those files you can:
mount your share using mount -t cifs (man or google for details) and then access your share from Python as usual folder (to my mind this is rather kludgy solution)
use library that deals specifically with smb, like pysmb (here is the relevant docs section) and do your file operations with it's help.
Hope this will help.
I have two servers, and one updates with a DNSBL of 100k domains every 15 minutes. I want to process these domains through a Python script with information from Safebrowsing, Siteadvisor, and other services. Unfortunately, the server with the DNSBL is rather slow. Is there a way I can transfer the files over from the other server with SSH in Python?
If it's just files (and directories) you are transferring, why not just use rsync over ssh (in a bash script perhaps). A proven, mature method.
Or you could mount the remote filesystem (over ssh) into your own filesystem using sshfs (fuse) and then use something like pyrobocopy (implementing a basic version of rsync functionality in Python) to transfer the files.
If you don't need incremental copying, you could go the simple route: mount the remote filesystem using sshfs (link above) and then use shutil.copytree to copy the correct directory.
Or yet another option: implement it using the paramiko Python ssh module.
There is a module called pexpect which is pretty nice.
This allows you to ssh, telnet, etc. It also supports ftp which might be handy in transferring files.