Airflow - retrieve files from Windows shared folders?

Airflow - retrieve files from Windows shared folders? - python

What is the best method to grab files from a Windows shared folder on the same network?
Typically, I am extracting data from SFTPs, SalesForce, or database tables, but there are a few cases where end-users need to upload a file to a shared folder that I have to retrieve. My process up to now has been to have a script running on a Windows machine which just grabs any new/changed files and loads them to an SFTP, but that is not ideal. I can't monitor it in my Airflow UI, I need to change my password on that machine physically, mapped network drives seem to break, etc.
Is there a better method? I'd rather the ETL server handle all of this stuff.
Airflow is installed on remote Linux server (same network)
Windows folders are just standard UNC paths where people have access based on their NT ID. These users are saving files which I need to retrieve. These users are non-technical and did not want WinSCP installed to share the data through an SFTP instead or even a Sharepoint (where I could use Shareplum, I think).
I would like to avoid mounting these folders and instead use Python scripts to simply copy the files I need as per an Airflow schedule
Best if I can save my NT ID and password within an Airflow connection to access it with a conn_id

If I'm understanding the question correctly, you have a shared folder mounted on your local machine — not the Windows server where your Airflow install is running. Is it possible to access the shared folder on the server instead?
I think a file sensor would work your use case.
If you could auto sync the shared folder to a cloud file store like S3, then you could use the normal S3KeySensor and S3PrefixSensor that are commonly used . I think this would simplify your solution as you wouldn't have to be concerned with whether the machine(s) the tasks are running on has access to the folder.
Here are two examples of software that syncs a local folder on Windows to S3. Note that I haven't used either of them personally.
https://www.cloudberrylab.com/blog/how-to-sync-local-folder-with-amazon-s3-bucket-with-cloudberry-s3-explorer/
https://s3browser.com/amazon-s3-folder-sync.aspx
That said, I do think using FTPHook.retrieve_file is a reasonable solution if you can't have your files in cloud storage.

Related

How can I transfer files from Google AI platform training job to my another compute instance or local machine?

Any ideas how I automatically send some files (mainly Tensorflow models) after training in Google AI platform to another compute instance or my local machine? I would like to run in my trainer for instance something like this os.system(scp -r ./file1 user#host:/path/to/folder). Of course I don’t need to use scp. It’s just an example. Is there such a possibility in Google? There is no problem to transfer files from job to Google Cloud Storage like this os.system('gsutil cp ./example_file gs://my_bucket/path/'). However when I try for example os.system('gcloud compute scp ./example_file my_instance:/path/') to transfer data from my AI platform job to another instance I get Your platform does not support SSH. Any ideas how can I do this?
UPDATE
Maybe there is a possibility to automatically download all the files from the google cloud storage which are in chosen folder? So I would for instance upload data from my job instance to the google cloud storage folder and my another instance would automatically detect changes and download all the new files?
UPDATE2
I found gsutil rsync but I am not sure whether it can be constantly running in the background? At this point the only solution that comes to my mind is to use cron job in the backend and run gsutil rsync for example every 10 minutes. But is doesn't seem to be optimal solution. Maybe there is a built-in tool or another better idea?

rsync command makes the contents under destination the same as the contents under source, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. source must specify a directory, bucket, or bucket subdirectory. But it command does not run in background.
Remember that the Notebook you're using is in fact a VM running JupyterLab, based on that you could run the command rsync once Tensorflow finished to created the files and sync it with a directory in another instance trying something like:
import os
os.system("rsync -avrz Tensorflow/outputs/filename root#ip:Tensorflow/otputs/file")
I suggest you to take a look in the rsync documentation to know all the options available to use that command.

Airflow: How to download file from Linux to Windows via smbclient

I have a DAG that imports data from a source to a server. From there, I am looking to download that file from the server to the Windows network. I would like to keep this part in Airflow for automation purposes. Does anyone know how to do this in Airflow? I am not sure whether to use the os package, the shutil package, or maybe there is a different approach.

I think you're saying you're looking for a way to get files from a cloud server to a windows shared drive or onto a computer in the windows network, these are some options I've seen used:
Use a service like google drive, dropbox, box, or s3 to simulate a synced folder on the cloud machine and a machine in the windows network.
Call a bash command to SCP the files to the windows server or a worker in the network. This could work in the opposite direction too.
Add the files to a git repository and have a worker in the windows network sync the repository to a shared location. This option is only good in very specific cases. It has the benefit that you can track changes and restore old states (if the data is in CSV or another text format), but it's not great for large files or binary files.
Use rsync to transfer the files to a worker in the windows network which has the shared location mounted and move the files to the synced dir with python or bash.
Mount the network drive to the server and use python or bash to move the files there.
All of these should be possible with Airflow by either using python (shutil) or a bash script to transfer the files to the right directory for some other process to pick up or by calling a bash sub-process to perform the direct transfer by SCP or commit the data via git. You will have to find out what's possible with your firewall and network settings. Some of these would require coordinating tasks on the windows side (the git option for example would require some kind of cron job or task scheduler to pull the repository to keep the files up to date).

Script that can automatically download new data from the server to my local backup

I have an application running on linux server and I need to create a local backup of it's data.
However, new data is being added to the application after every hour and I want to sync my local backup data with server's data.
I want to write a script (shell or python) that can automatically download new added data from the linux server to my local machine backup. But I am newbie to the linux envoirnment and don't know how to write shell script to achieve this.
What is the better way of achieving this ? And what would be the script to do so ?

rsync -r fits in your use case and it's a single line command.
rsync -r source destination
or the options you need according to your specific case.
So, you don't need a python script for that, but you can still write it and let it use the command above.
Moreover, if you want the Python script to do it in an automatic way, you may check the event scheduler module.

This depends on where and how your data is stored on the Linux server, but you could write a network application which pushes the data to a client and the client saves the data on the local machine. You can use sockets for that.
If the data is available via aan http server and you know how to write RESTful APIs, you could use that as well and make a task run on your local machine every hour which calls the REST API and handles its (JSON) data. Keep in mind that you need to secure the API if the server is running online and not in the same LAN.
You could also write a small application which downloads the files every hour from the server over FTP (if you want to backup files stored on the system). You will need to know the exact path of the file(s) to do this though.
All solutions above are meant for Python programming. Using a shell script is possible, but a little more complicated. I would use Python for this kind of tasks, as you have a lot of network related libraries available (ftp, socket, http clients, simple http servers, WSGI libraries, etc.)

Store and retrieve text online/cloud using python APIs

I am writing an python application which reads/parses a file of this this kind.
myalerts.ini,
value1=1
value2=3
value3=10
value4=15
Currently I store this file in local filesystem. If I need to change this file I need to have physical access to this computer.
I want to move this file to cloud so that I can change this file anywhere (another computer or from phone).
If this application is running on some machine I should be able to change this file on the cloud and the application which is running on another machine which I don't have physical access to will be able to read updated file.
Notes,
I am new to both python and aws.
I am currently running it on my local mac/linux and planning on deploying on aws.

There are many options!
Amazon S3: This is the simplest option. Each computer could download the file at regular intervals or just before they run a process. If the file is big, the app could instead check whether the file has changed before downloading.
Amazon Elastic File System (EFS): If your applications are running on multiple Amazon EC2 instances, EFS provides a shared file system that can be mounted on each instance.
Amazon DynamoDB: A NoSQL database instead of a file. Much faster than parsing a file, but less convenient for updating values — you'd need to write a program to update values, eg from the command-line.
AWS Systems Manager Parameter Store: A managed service for storing parameters. Applications (anywhere on the Internet) can request and update parameters. A great way to configure cloud-based application!
If you are looking for minimal change and you want it accessible from anywhere on the Internet, Amazon S3 is the easiest choice.
Whichever way you go, you'll use the boto3 AWS SDK for Python.

How can I access another server with Python?

I have two servers, and one updates with a DNSBL of 100k domains every 15 minutes. I want to process these domains through a Python script with information from Safebrowsing, Siteadvisor, and other services. Unfortunately, the server with the DNSBL is rather slow. Is there a way I can transfer the files over from the other server with SSH in Python?

If it's just files (and directories) you are transferring, why not just use rsync over ssh (in a bash script perhaps). A proven, mature method.
Or you could mount the remote filesystem (over ssh) into your own filesystem using sshfs (fuse) and then use something like pyrobocopy (implementing a basic version of rsync functionality in Python) to transfer the files.
If you don't need incremental copying, you could go the simple route: mount the remote filesystem using sshfs (link above) and then use shutil.copytree to copy the correct directory.
Or yet another option: implement it using the paramiko Python ssh module.

There is a module called pexpect which is pretty nice.
This allows you to ssh, telnet, etc. It also supports ftp which might be handy in transferring files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.