retrieving s3 path from payload inside AWS glue pythonshell job - python

I have a pythonshell job inside AWS glue that needs to download a file from a s3 path. This s3 path location is a variable so will come to the glue job as a payload in start_run_job call like below:
import boto3
payload = {'s3_target_file':s3_TARGET_FILE_PATH,
's3_test_file': s3_TEST_FILE_PATH}
job_def = dict(
JobName=MY_GLUE_PYTHONSHELL_JOB,
Arguments=payload,
WorkerType='Standard',
NumberOfWorkers=2,
)
response = glue.start_job_run(**job_def)
My question is, how do I retrieve those s3 paths from the payload inside AWS Glue pythonshell job that comes through boto3? Is there any sort of handler we need to write similar to AWS Lambda?
Please suggest.

Check the docimentation. All you need is here.
You can use the getResolvedOptions as follows:
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['JOB_NAME',
'day_partition_key',
'hour_partition_key',
'day_partition_value',
'hour_partition_value'])
print "The day partition key is: ", args['day_partition_key']
print "and the day partition value is: ", args['day_partition_value']

Related

add delay for a task until particular file is moved from bucket

I am new to Airflow. I have to check for a file which is generated from DAG (eg: sample.txt) is moved from bucket(in my case the file I have generated will be moved away from the bucket when picked up by other system, and then there won't be this output file in the bucket. It might take few minutes for the file to be removed from bucket)
How to add a task in the same DAG where it waits/retires till the file is moved away from the bucket and when the sample.txt file is moved away then proceed with the next task.
Is there any operator which satisfies the above criteria? please throw some light on how to proceed
You can create a custom sensor based on the current GCSObjectExistenceSensor
The modification is simple:
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
class GCSObjectNotExistenceSensor(GCSObjectExistenceSensor):
def poke(self, context: dict) -> bool:
self.log.info('Sensor checks if : %s, %s does not exist', self.bucket, self.object)
hook = GCSHook(
gcp_conn_id=self.google_cloud_conn_id,
delegate_to=self.delegate_to,
impersonation_chain=self.impersonation_chain,
)
return not hook.exists(self.bucket, self.object)
Then use the sensor GCSObjectNotExistenceSensor in your code like:
gcs_object_does_not_exists = GCSObjectNotExistenceSensor(
bucket=BUCKET_1,
object=PATH_TO__FILE,
mode='poke',
task_id="gcs_object_does_not_exists_task",
)
The sensor will not let the pipeline to continue until the object PATH_TO__FILE is removed.
You can use airflow PythonOperator to achieve the task. Make the Python callable continuously poke GCS and check if the file is removed. Return from Python function when the file from GCS is removed.
from airflow.operators.python_operator import PythonOperator
from google.cloud import storage
import google.auth
def check_file_in_gcs():
credentials, project = google.auth.default()
storage_client = storage.Client('your_Project_id', credentials=credentials)
name = 'sample.txt'
bucket_name = 'Your_Bucket_name'
bucket = storage_client.bucket(bucket_name)
while True:
stats = storage.Blob(bucket=bucket, name=name).exists(storage_client)
if not stats:
print("Returning as file is removed!!!!")
return
check_gcs_file_removal = PythonOperator(
task_id='check_gcs_file_removal',
python_callable= check_file_in_gcs,
#op_kwargs={'params': xyz},
#Pass bucket name and other details if needed by commentating above
dag=dag
)
you might need to install Python packages for the google cloud libraries to work. Please install one from below. (Not sure which one to install exactly.Taken from my virtualenv)
google-api-core==1.16.0
google-api-python-client==1.8.0
google-auth==1.12.0
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.4.1
google-cloud-core==1.3.0
google-cloud-storage==1.27.0

Downloading only AWS S3 object file names and image URL in CSV Format

I have hosted files in AWS s3 bucket, I need only all S3 bucket object URL's in CSV file
Please suggest
You can get all S3 Object URLS by using the AWS SDK for S3. First, what you need to do is read all items in a bucket. You can use Python code similar to this Java code (you can port the logic):
ListObjectsRequest listObjects = ListObjectsRequest
.builder()
.bucket(bucketName)
.build();
ListObjectsResponse res = s3.listObjects(listObjects);
List<S3Object> objects = res.contents();
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
System.out.print("\n The name of the key is " + myValue.key());
}
Then iterate through the list and get the key as shown above. For each object, you can get the URL using Python code similar to this:
GetUrlRequest request = GetUrlRequest.builder()
.bucket(bucketName)
.key(keyName)
.build();
URL url = s3.utilities().getUrl(request);
System.out.println("The URL for "+keyName +" is "+url.toString());
Put each URL value into a collection and then write the collection out to a CSV. That is how you achieve your use case.

Triggering AWS Lambda function from Airflow

I have created a function in AWS lambda which looks like this:
import boto3
import numpy as np
import pandas as pd
import s3fs
from io import StringIO
def test(event=None, context=None):
# creating a pandas dataframe from an api
# placing 2 csv files in S3 bucket
This function queries an external API and places 2 csv files in S3 bucket. I want to trigger this function in Airflow, I have found this code:
import boto3, json, typing
def invokeLambdaFunction(*, functionName:str=None, payload:typing.Mapping[str, str]=None):
if functionName == None:
raise Exception('ERROR: functionName parameter cannot be NULL')
payloadStr = json.dumps(payload)
payloadBytesArr = bytes(payloadStr, encoding='utf8')
client = boto3.client('lambda')
response = client.invoke(
FunctionName=test,
InvocationType="RequestResponse",
Payload=payloadBytesArr
)
return response
if __name__ == '__main__':
payloadObj = {"something" : "1111111-222222-333333-bba8-1111111"}
response = invokeLambdaFunction(functionName='test', payload=payloadObj)
print(f'response:{response}')
But as I understand this code snippet does not connect to the S3. Is this the right approach to trigger AWS Lambda function from Airflow or there is a better way?
I would advice to use the AwsLambdaHook:
https://airflow.apache.org/docs/stable/_api/airflow/contrib/hooks/aws_lambda_hook/index.html#module-airflow.contrib.hooks.aws_lambda_hook
And you can check a test showing its usage to trigger a lambda function:
https://github.com/apache/airflow/blob/master/tests/providers/amazon/aws/hooks/test_lambda_function.py

Querying Athena tables in AWS Glue Python Shell

Python Shell Jobs was introduced in AWS Glue. They mentioned:
You can now use Python shell jobs, for example, to submit SQL queries to services such as ... Amazon Athena ...
Ok. We have an example to read data from Athena tables here:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print("Count: ", persons.count())
persons.printSchema()
# TODO query all persons
However, it uses Spark instead of Python Shell. There are no such libraries that are normally available with Spark job type and I have an error:
ModuleNotFoundError: No module named 'awsglue.transforms'
How can I rewrite the code above to make it executable in the Python Shell job type?
The thing is, Python Shell type has its own limited set of built-in libraries.
I only managed to achieve my goal using Boto 3 to query data and Pandas to read it into a dataframe.
Here is the code snippet:
import boto3
import pandas as pd
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
athena_client = boto3.client(service_name='athena', region_name='us-east-1')
bucket_name = 'bucket-with-csv'
print('Working bucket: {}'.format(bucket_name))
def run_query(client, query):
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={ 'Database': 'sample-db' },
ResultConfiguration={ 'OutputLocation': 's3://{}/fromglue/'.format(bucket_name) },
)
return response
def validate_query(client, query_id):
resp = ["FAILED", "SUCCEEDED", "CANCELLED"]
response = client.get_query_execution(QueryExecutionId=query_id)
# wait until query finishes
while response["QueryExecution"]["Status"]["State"] not in resp:
response = client.get_query_execution(QueryExecutionId=query_id)
return response["QueryExecution"]["Status"]["State"]
def read(query):
print('start query: {}\n'.format(query))
qe = run_query(athena_client, query)
qstate = validate_query(athena_client, qe["QueryExecutionId"])
print('query state: {}\n'.format(qstate))
file_name = "fromglue/{}.csv".format(qe["QueryExecutionId"])
obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
return pd.read_csv(obj['Body'])
time_entries_df = read('SELECT * FROM sample-table')
SparkContext won't be available in Glue Python Shell. Hence you need to depend on Boto3 and Pandas to handle the data retrieval. But it comes a lot of overhead to query Athena using boto3 and poll the ExecutionId to check if the query execution got finished.
Recently awslabs released a new package called AWS Data Wrangler. It extends power of Pandas library to AWS to easily interact with Athena and lot of other AWS Services.
Reference link:
https://github.com/awslabs/aws-data-wrangler
https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb
Note: AWS Data Wrangler library wont be available by default inside Glue Python shell. To include it in Python shell, follow the instructions in following link:
https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs
I have a few month using glue, i use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
data_frame = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.load(<CSVs THAT IS USING FOR ATHENA - STRING>)

Amazon AWS - S3 to ElasticSearch (Python Lambda)

I'd like to copy data from an S3 directory to the Amazon ElasticSearch service. I've tried following the guide, but unfortunately the part I'm looking for is missing. I don't know how the lambda function itself should look like (and all the info about this in the guide is: "Place your application source code in the eslambda folder."). I'd like ES to autoindex the files.
Currently I'm trying
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.unquote_plus(record['s3']['object']['key'])
index_name = event.get('index_name', key.split('/')[0])
object = s3_client.Object(bucket, key)
data = object.get()['Body'].read()
helpers.bulk(es, data, chunk_size=100)
But I get like a massive error stating
elasticsearch.exceptions.RequestError: TransportError(400, u'action_request_validation_exception', u'Validation Failed: 1: index is missing;2: type is missing;3: index is missing;4: type is missing;5: index is missing;6: type is missing;7: ...
Could anyone explain to me, how can I set things up so that my data gets moved from S3 to ES where it gets auto-mapped and auto-indexed? Apparently it's possible, as mentioned in the reference here and here.
While mapping can automatically be assigned in Elasticsearch, the indexes are not automatically generated. You have to specify the index name and type in the POST request. If that index does not exist, then Elasticsearch will create the index automatically.
Based on your error, it looks like you're not passing through an index and type.
For example, here's how a simple POST request to add a record to the index MyIndex and type MyType which would first create the index and type if it did not already exist.
curl -XPOST 'example.com:9200/MyIndex/MyType/' \
-d '{"name":"john", "tags" : ["red", "blue"]}'
I wrote a script to download a csv file from S3 and then transfer the data to ES.
Made an S3 client using boto3 and downloaded the file from S3
Made an ES client to connect to Elasticsearch.
Opened the csv file and used the helpers module from elasticsearch to insert csv file contents into elastic search.
main.py
import boto3
from elasticsearch import helpers, Elasticsearch
import csv
import os
from config import *
#S3
Downloaded_Filename=os.path.basename(Prefix)
s3 = boto3.client('s3', aws_access_key_id=awsaccesskey,aws_secret_access_key=awssecretkey,region_name=awsregion)
s3.download_file(Bucket,Prefix,Downloaded_Filename)
#ES
ES_index = Downloaded_Filename.split(".")[0]
ES_client = Elasticsearch([ES_host],http_auth=(ES_user, ES_password),port=ES_port)
#S3 to ES
with open(Downloaded_Filename) as f:
reader = csv.DictReader(f)
helpers.bulk(ES_client, reader, index=ES_index, doc_type='my-type')
config.py
awsaccesskey = ""
awssecretkey = ""
awsregion = "us-east-1"
Bucket=""
Prefix=''
ES_host = "localhost"
ES_port = "9200"
ES_user = "elastic"
ES_password = "changeme"

Categories