AWS ETL with python scripts

AWS ETL with python scripts - python

I am trying to create a basic ETL on AWS platform, which uses python.
In a S3 bucket (lets call it "A") I have lots of raw log files, gzipped.
What I would like to do is to have it periodically (=data pipeline) unzipped, processed by a python script which will reformat the structure of every line, and output it to another S3 bucket ("B"), preferably as gzips of the same log files originating in the same gzip in A, but that's not mandatory.
I wrote the python script which does with it needs to do (receives each line from stdin) and outputs to stdout (or stderr, if a line isn't valid. in this case, i'd like it to be written to another bucket, "C").
I was fiddling around with the data pipeline, tried to run a shell command job and also a hive job for sequencing with the python script.
The EMR cluster was created, ran, finished, no fails or errors, but also no logs created, and I can't understand what is wrong.
In addition, I'd like the original logs be removed after processed and written to the destination or erroneous logs buckets.
Does anyone have any experience with such configuration? and words of advise?

First thing you want to do is to set 'termination protection' on - on the EMR cluster -as soon as it is launched by Data Pipeline. (this can be scripted too).
Then you can log on to the 'Master instance'. This is under 'hardware' pane under EMR cluster details. (you can also search in EC2 console by cluster id).
You also have to define a 'key' so that you can SSH to the Master.
Once you log on to the master, you can look under /mnt/var/log/hadoop/steps/ for logs - or /mnt/var/lib/hadoop/.. for actual artifacts. You can browse hdfs using HDFS utils.
The logs (if they are written to stdout or stderr), are already moved to S3. If you want to move additional files, you have to have write a script and run it using 'script-runner'. You can copy large amount of files using 's3distcp'.

Related

Transfer file from S3 to Windows server

I have just been introduced to Python (PySpark). I have a requirement to achieve the following steps:-
Extract data from a Hive table (on EMR) into a csv file on AWS S3
Transfer the csv file created on S3 (EMR cluster running Spark on YARN) to a
remote Windows server (at a certain folder path)
Once the file has been transferred, trigger a batch file that exists
on the Windows server at a certain folder path
The Windows batch script when triggered, updates/enriches the transferred file with additional information. So transfer/copy the updated csv file back to S3
The final step is to load the updated file to a Hive table once it is transferred back on S3. However, I have figured out how to extract the data from the table into a csv file on S3 and also to load the file to the table. However, I am struggling to get a bearing on how to perform the file transfer/copy between the servers, and most importantly, how to trigger the Windows Batch script on the remote machine.
Could someone please help me and point me in the right direction, and hint as to where I should be starting from ? I searched the internet but couldn't get a concrete answer. I understand that I have to use Boto3 library to interact with S3, however, if there is any other established solution please share those with me (code snippets, articles etc). Also any specific configurations that I might have to incorporate to achieve the result.
Thanks

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/

The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

Concurrent file upload/download and running background processes

I want to create a minimal webpage where concurrent users can upload a file and I can process the file (which is expected to take some hours) and email back to the user later on.
Since I am hosting this on AWS, I was thinking of invoking some background process once I receive the file so that even if the user closes the browser window, the processing keeps taking place and I am able to send the results after few hours, all through some pre-written scripts.
Can you please help me with the logistics of how should I do this?

Here's how it might look like (hosting-agnostic):
A user uploads a file on the web server
The file is saved in a storage that can be accessed later by the background jobs
Some metadata (location in the storage, user's email etc) about the file is saved in a DB/message broker
Background jobs tracking the DB/message broker pick up the metadata and start handling the file (this is why it needs to be accessible by it in p.2) and notify the user
More specifically, in case of python/django + aws you might use the following stack:
Lets assume you're using python + django
You can save the uploaded files in a private AWS S3 bucket
Some meta might be saved in the db or use celery + AWS SQS or AWS SQS directly or bring up something like rabbitmq or redis(+pubsub)
Have python code handling the job - depends on what your opt for in p.3. The only requirement is that it can pull data from your S3 bucket. After the job is done notify the user via AWS SES
The simplest single-server setup that doesn't require any intermediate components:
Your python script that simply saves the file in a folder and gives it a name like someuser#yahoo.com-f9619ff-8b86-d011-b42d-00cf4fc964ff
Cron job looking for any files in this folder that would handle found files and notify the user. Notice if you need multiple background jobs running in parallel you'll need to slightly complicate the scheme to avoid race conditions (i.e. rename the file being processed so that only a single job would handle it)
In a prod app you'll likely need something in between depending on your needs

Generating and serving files created by a Python script on Heroku (Node/Express server)

I'm working on a site that collects textual input from a number of users and then gives all that input to a user with another role after a set amount of time. The user who is given access to all the input needs to be able to export all the text as a word document.
Right now, on my local machine, a button on the page makes a db call for all the text, and uses the fs npm module to write the correct set of input to a raw text document in a format the pyton script can understand. I then use the docx module in python to read the text and write the formatted input into the word document, saving it into the public directory in my server. I can navigate to it manually after that.
I can automate it locally by writing a simple cron job that waits for the contents of the raw text file to change, firing the python program when that happens and having the link to the word doc appear after some timeout.
My question is how would I get this to work on my heroku site? Simply having python isn't enough, because I need to install the docx module with pip. Beyond that, I still need to have a scheduled check for the raw text file to change to fire the python script. Can this be accomplished through the Procfile or some heroku addons? Is there a better way to accomplish the desired behavior of button click->Document creation->serve the file? Love to know your thoughts.

You have a few different issues to look at: 1) enabling both Python and Node and then 2) correct use of filesystem on Heroku and 3) ways to schedule the work.
For #1, you need to enable multiple build packs to get both Node.js and Python stacks in place. See https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app.
For #2, you need to send the files to storage service of some kind (e.g., Amazon S3) - the filesystem for your dyno is ephemeral, and anything written there will disappear after a dyno restart (which happens every 24 hours no matter what)
For #3, the simplest solution is probably the Heroku Scheduler add-on, which acts like a rudimentary cron. Remember, you don't have low-level OS access, so you need to use the Heroku-provided equivalent.

How to copy simultaneously from multiple nodes using Fabric?

I have just started using Fabric, looks like a very useful tool. I am able to write a tiny script to run some commands in parallel on my Amazon EC2 hosts, something like this:
#parallel
def runs_in_parallel():
sudo("sudo rm -rf /usr/lib/jvm/j2sdk1.6-oracle")
Also, I have written another script to copy all the Hadoop logs from all EC2 nodes to my local machine. This script creates a folder with timestamp as name, within that 1 folder for each node as its IP address and then copies that node's logs in this IP address named folder. E.g.:
2014-04-22-15-52-55
50.17.94.170
hadoop-logs
54.204.157.86
hadoop-logs
54.205.86.22
hadoop-logs
Now I want to do this copy task using Fabric so that I can copy the logs in parallel, to save time. I thought I can easily do it the way I did in my first code snippet, but that won't help, as it runs commands on the remote server. I have no clue as of now how to do this. Any help is much appreciated.

You could likely use the get() command to handle pulling down files. You'd want to make them into tarballs, and have them pull into unique filenames on your client to keep the gets from clobbering one another.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.