Transfer file from S3 to Windows server - python

I have just been introduced to Python (PySpark). I have a requirement to achieve the following steps:-
Extract data from a Hive table (on EMR) into a csv file on AWS S3
Transfer the csv file created on S3 (EMR cluster running Spark on YARN) to a
remote Windows server (at a certain folder path)
Once the file has been transferred, trigger a batch file that exists
on the Windows server at a certain folder path
The Windows batch script when triggered, updates/enriches the transferred file with additional information. So transfer/copy the updated csv file back to S3
The final step is to load the updated file to a Hive table once it is transferred back on S3. However, I have figured out how to extract the data from the table into a csv file on S3 and also to load the file to the table. However, I am struggling to get a bearing on how to perform the file transfer/copy between the servers, and most importantly, how to trigger the Windows Batch script on the remote machine.
Could someone please help me and point me in the right direction, and hint as to where I should be starting from ? I searched the internet but couldn't get a concrete answer. I understand that I have to use Boto3 library to interact with S3, however, if there is any other established solution please share those with me (code snippets, articles etc). Also any specific configurations that I might have to incorporate to achieve the result.
Thanks

Related

Python HTTPS Requests: Get a file from GCP and send it to another location?

I have a script that scans a local folder and uploads some of the files to an SQL Server through a POST request. I would love to modify it to take files from a GCP bucket instead of from local storage. However I have no experience with GCP and I am having difficulty finding documentation supporting what I am trying to do. I have a few questions for anyone who has tried anything like this before:
Is a GET request the best way to copy GCP bucket files into a different location? I.e Is there a way to put my script directly into GCP and just use the POST request I already have, referencing the bucket instead of a folder?
If a GET request is the best way, does anyone know of a good resource to learn about HTTPS requests with GCP? (Not sure how to create the GET request/ what information Google would need).
After my GET request (if this is the best way), do the files necessarily have to download to my computer before the POST request to the SQL server OR is there any way to send the files to upload without having to download them?
If you want to replace your local storage by Cloud Storage, you have several things to know
The most transparent, and if you use a linux compliant OS, is to use GCSFuse. You will be able to mount a Cloud Storage bucket in a local directory and work as it was a local storage. HOWEVER, GCSFuse is a wrapper that transform system call to HTTP calls, latency, feature and performance are absolutely not the same.
When you search a file on Cloud Storage, you can only search with a prefix, not with a suffix (if you look for special extension such as .sql or .csv, it's not possible).
You must download the file content locally before sending it to your database. Except if you have a module/extension in your database able to read data from an URL or directly from Cloud Storage (that shouldn't exist).
gsutil is the best tool to handle the Cloud Storage file

Heroku - CSV Files and TXT Logfiles

I want to deploy a python bot to heroku.
The bot writes all logging data to txt files and also exports a CSV file to the filesystem where data is stored that is important for the next run of the bot and as well makes it possible to track past performance of the bot.
As I know that it is not possible to store any files at a heroku dyno persistent the question is - how/where to store the data?
A database for the data in the csv file is not suitable for me because I have to edit the file sometimes between two runs and to do this via a database would be to much effort for me.
Any suggestions?
You need to save the file(s) on an external storage like S3, DropBox, GitHub, etc..
Check Files on Heroku to see options and examples.
You can decide to read/write the files directly from the storage (i.e. no local copy), or also keep the files locally making sure they are saved at some point (every 10 min, before every restart).

move files from aws s3 loaction to another aws s3 location using NiFi execute script processor

I'm getting files into a S3 location on a weekly basis and I need to move these files after processing to another S3 location to archive the files. I have cloudera NiFi hosted on AWS. I can't use putS3Object+DeleteS3Object processors at the end of the flow because in this NiFi process because I'm decompressing the file and adding additional column (compressing file and dropping column hits performance). I need a python/groovy script to move files from S3 loaction. is there any other way to do this?
I need a python/groovy script to move files from S3 loaction. is there any other way to do this?
No, you don't. You can use the record processors or a script to update the files and push them to S3. We pull, mutate and reupload data like this all the time without having to control the upload with a script.

Fetching .zip files from database

I have a database which contains my released project in.zip format, so the problem here is that, after certain period of time , i have to download those .zip files from my db and mail it to customers. Its a manual process.So is there any way/framework available in python that can automate this process? Any leads suggestion will be very helpful. I can handle the sending mail part ,the main thing i have to ask to automate the process of fetching files from db.

AWS ETL with python scripts

I am trying to create a basic ETL on AWS platform, which uses python.
In a S3 bucket (lets call it "A") I have lots of raw log files, gzipped.
What I would like to do is to have it periodically (=data pipeline) unzipped, processed by a python script which will reformat the structure of every line, and output it to another S3 bucket ("B"), preferably as gzips of the same log files originating in the same gzip in A, but that's not mandatory.
I wrote the python script which does with it needs to do (receives each line from stdin) and outputs to stdout (or stderr, if a line isn't valid. in this case, i'd like it to be written to another bucket, "C").
I was fiddling around with the data pipeline, tried to run a shell command job and also a hive job for sequencing with the python script.
The EMR cluster was created, ran, finished, no fails or errors, but also no logs created, and I can't understand what is wrong.
In addition, I'd like the original logs be removed after processed and written to the destination or erroneous logs buckets.
Does anyone have any experience with such configuration? and words of advise?
First thing you want to do is to set 'termination protection' on - on the EMR cluster -as soon as it is launched by Data Pipeline. (this can be scripted too).
Then you can log on to the 'Master instance'. This is under 'hardware' pane under EMR cluster details. (you can also search in EC2 console by cluster id).
You also have to define a 'key' so that you can SSH to the Master.
Once you log on to the master, you can look under /mnt/var/log/hadoop/steps/ for logs - or /mnt/var/lib/hadoop/.. for actual artifacts. You can browse hdfs using HDFS utils.
The logs (if they are written to stdout or stderr), are already moved to S3. If you want to move additional files, you have to have write a script and run it using 'script-runner'. You can copy large amount of files using 's3distcp'.

Categories