I have a lambda function that looks like this:
client = boto3.client('glue')
glueJobName = "Python Glue Script"
inputtedData = "A1B2C3D4E5"
school = "testing"
def lambda_handler(event, context):
response = client.start_job_run(JobName = glueJobName, Arguments = {'==inputtedData': inputtedData, '--school': school})
return response
This starts running my glue job which contains a script. However, I want to pass the Arguments 'inputtedData' and 'school' to this script, so that when the script starts, these variables will be inputted into my syncData function like this:
def syncAttendance(inputtedData, school):
schoolName = school
Data = inputtedData
print(schoolName, Data)
syncData(inputtedData, school)
How do I receive these variables in the glue script?
You need to use getResolvedOptions function as follows:
import sys
from awsglue.utils import getResolvedOptions
options = ['inputtedData', 'school']
args = getResolvedOptions(sys.argv, options)
syncData(args['inputtedData'], args['school'])
Related
I'm building a CloudFormation deployment that includes a Lambda function built out in python 3.9. However, when I build the function, it will not allow me to keep the single quotes. This hasn't been an issue for most of the script as I simply import json and the double quote (") work fine, but one section requires the single quotes.
Here is the code:
import boto3
import json
def lambda_handler(event, context):
client = client_obj()
associated = associated_list(client)
response = client.list_resolver_query_log_configs(
MaxResults=1,
)
config = response['ResolverQueryLogConfigs'][0]['Id']
ec2 = boto3.client('ec2')
vpc = ec2.describe_vpcs()
vpcs = vpc['Vpcs']
for v in vpcs:
if v['VpcId'] not in associated:
client.associate_resolver_query_log_config(
ResolverQueryLogConfigId= f"{config}",
ResourceId=f"{v['VpcId']}"
)
else:
print(f"{v['VpcId']} is already linked.")
def client_obj():
client = boto3.client('route53resolver')
return client
def associated_list(client_object):
associated = list()
assoc = client_object.list_resolver_query_log_config_associations()
for element in assoc['ResolverQueryLogConfigAssociations']:
associated.append(element['ResourceId'])
return associated
any section that includes f"{v['VpcId']}" requires the single quote inside the [] for the script to run properly. Since yaml requires the script to be encapsulated in single quotes for packaging, how can I fix this?
Example in yaml from another script:
CreateIAMUser:
Type: 'AWS::Lambda::Function'
Properties:
Code:
ZipFile: !Join
- |+
- - import boto3
- 'import json'
- 'from botocore.exceptions import ClientError'
- ''
- ''
- 'def lambda_handler(event, context):'
- ' iam_client = boto3.client("iam")'
- ''
- ' account_id = boto3.client("sts").get_caller_identity()["Account"]'
- ''
I imagine I could re-arrange the script to avoid this, but I would like to use this opportunity to learn something new if possible.
Not sure what you are trying to do, but usually you just use pipe in yaml for that:
Code:
ZipFile: |
import boto3
import json
def lambda_handler(event, context):
client = client_obj()
associated = associated_list(client)
response = client.list_resolver_query_log_configs(
MaxResults=1,
)
config = response['ResolverQueryLogConfigs'][0]['Id']
ec2 = boto3.client('ec2')
vpc = ec2.describe_vpcs()
vpcs = vpc['Vpcs']
for v in vpcs:
if v['VpcId'] not in associated:
client.associate_resolver_query_log_config(
ResolverQueryLogConfigId= f"{config}",
ResourceId=f"{v['VpcId']}"
)
else:
print(f"{v['VpcId']} is already linked.")
def client_obj():
client = boto3.client('route53resolver')
return client
def associated_list(client_object):
associated = list()
assoc = client_object.list_resolver_query_log_config_associations()
for element in assoc['ResolverQueryLogConfigAssociations']:
associated.append(element['ResourceId'])
return associated
I have created a function in AWS lambda which looks like this:
import boto3
import numpy as np
import pandas as pd
import s3fs
from io import StringIO
def test(event=None, context=None):
# creating a pandas dataframe from an api
# placing 2 csv files in S3 bucket
This function queries an external API and places 2 csv files in S3 bucket. I want to trigger this function in Airflow, I have found this code:
import boto3, json, typing
def invokeLambdaFunction(*, functionName:str=None, payload:typing.Mapping[str, str]=None):
if functionName == None:
raise Exception('ERROR: functionName parameter cannot be NULL')
payloadStr = json.dumps(payload)
payloadBytesArr = bytes(payloadStr, encoding='utf8')
client = boto3.client('lambda')
response = client.invoke(
FunctionName=test,
InvocationType="RequestResponse",
Payload=payloadBytesArr
)
return response
if __name__ == '__main__':
payloadObj = {"something" : "1111111-222222-333333-bba8-1111111"}
response = invokeLambdaFunction(functionName='test', payload=payloadObj)
print(f'response:{response}')
But as I understand this code snippet does not connect to the S3. Is this the right approach to trigger AWS Lambda function from Airflow or there is a better way?
I would advice to use the AwsLambdaHook:
https://airflow.apache.org/docs/stable/_api/airflow/contrib/hooks/aws_lambda_hook/index.html#module-airflow.contrib.hooks.aws_lambda_hook
And you can check a test showing its usage to trigger a lambda function:
https://github.com/apache/airflow/blob/master/tests/providers/amazon/aws/hooks/test_lambda_function.py
Python Shell Jobs was introduced in AWS Glue. They mentioned:
You can now use Python shell jobs, for example, to submit SQL queries to services such as ... Amazon Athena ...
Ok. We have an example to read data from Athena tables here:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print("Count: ", persons.count())
persons.printSchema()
# TODO query all persons
However, it uses Spark instead of Python Shell. There are no such libraries that are normally available with Spark job type and I have an error:
ModuleNotFoundError: No module named 'awsglue.transforms'
How can I rewrite the code above to make it executable in the Python Shell job type?
The thing is, Python Shell type has its own limited set of built-in libraries.
I only managed to achieve my goal using Boto 3 to query data and Pandas to read it into a dataframe.
Here is the code snippet:
import boto3
import pandas as pd
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
athena_client = boto3.client(service_name='athena', region_name='us-east-1')
bucket_name = 'bucket-with-csv'
print('Working bucket: {}'.format(bucket_name))
def run_query(client, query):
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={ 'Database': 'sample-db' },
ResultConfiguration={ 'OutputLocation': 's3://{}/fromglue/'.format(bucket_name) },
)
return response
def validate_query(client, query_id):
resp = ["FAILED", "SUCCEEDED", "CANCELLED"]
response = client.get_query_execution(QueryExecutionId=query_id)
# wait until query finishes
while response["QueryExecution"]["Status"]["State"] not in resp:
response = client.get_query_execution(QueryExecutionId=query_id)
return response["QueryExecution"]["Status"]["State"]
def read(query):
print('start query: {}\n'.format(query))
qe = run_query(athena_client, query)
qstate = validate_query(athena_client, qe["QueryExecutionId"])
print('query state: {}\n'.format(qstate))
file_name = "fromglue/{}.csv".format(qe["QueryExecutionId"])
obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
return pd.read_csv(obj['Body'])
time_entries_df = read('SELECT * FROM sample-table')
SparkContext won't be available in Glue Python Shell. Hence you need to depend on Boto3 and Pandas to handle the data retrieval. But it comes a lot of overhead to query Athena using boto3 and poll the ExecutionId to check if the query execution got finished.
Recently awslabs released a new package called AWS Data Wrangler. It extends power of Pandas library to AWS to easily interact with Athena and lot of other AWS Services.
Reference link:
https://github.com/awslabs/aws-data-wrangler
https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb
Note: AWS Data Wrangler library wont be available by default inside Glue Python shell. To include it in Python shell, follow the instructions in following link:
https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs
I have a few month using glue, i use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
data_frame = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.load(<CSVs THAT IS USING FOR ATHENA - STRING>)
I'm using Gcloud Composer as my Airflow. When I try to use Jinja in my HQL code, it does not translate it correctly.
I know that the HiveOperator has a Jinja translator as I'm used to it, but the DataProcHiveOperator doesn't.
I've tried to use the HiveConf directly into my HQL files, but when setting those values to my Partition (i.e. INSERT INTO TABLE abc PARTITION (ds = ${hiveconf:ds}))`, it doesn't work.
I have also added the following to my HQL file:
SET ds=to_date(current_timestamp());
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
But it didn't work as HIVE is transforming the formula above into a STRING.
So my idea was to combine both operators to have the Jinja translator working fine, but when I do that, I get the following error: ERROR - submit() takes from 3 to 4 positional arguments but 5 were given.
I'm not very familiar with Python coding and any help would be great, see below code for the operator I'm trying to build;
Header of the Python File (please note that the file contains other Operators not mentioned in this question):
import ntpath
import os
import re
import time
import uuid
from datetime import timedelta
from airflow.contrib.hooks.gcp_dataproc_hook import DataProcHook
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.version import version
from googleapiclient.errors import HttpError
from airflow.utils import timezone
from airflow.utils.operator_helpers import context_to_airflow_vars
modified DataprocHiveOperator:
class DataProcHiveOperator(BaseOperator):
template_fields = ['query', 'variables', 'job_name', 'cluster_name', 'dataproc_jars']
template_ext = ('.q',)
ui_color = '#0273d4'
#apply_defaults
def __init__(
self,
query=None,
query_uri=None,
hiveconfs=None,
hiveconf_jinja_translate=False,
variables=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='cluster-1',
dataproc_hive_properties=None,
dataproc_hive_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
job_error_states=['ERROR'],
*args,
**kwargs):
super(DataProcHiveOperator, self).__init__(*args, **kwargs)
self.gcp_conn_id = gcp_conn_id
self.delegate_to = delegate_to
self.query = query
self.query_uri = query_uri
self.hiveconfs = hiveconfs or {}
self.hiveconf_jinja_translate = hiveconf_jinja_translate
self.variables = variables
self.job_name = job_name
self.cluster_name = cluster_name
self.dataproc_properties = dataproc_hive_properties
self.dataproc_jars = dataproc_hive_jars
self.region = region
self.job_error_states = job_error_states
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.query_uri= re.sub(
"(\$\{(hiveconf:)?([ a-zA-Z0-9_]*)\})", "{{ \g<3> }}", self.query_uri)
def execute(self, context):
hook = DataProcHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to)
job = hook.create_job_template(self.task_id, self.cluster_name, "hiveJob",
self.dataproc_properties)
if self.query is None:
job.add_query_uri(self.query_uri)
else:
job.add_query(self.query)
if self.hiveconf_jinja_translate:
self.hiveconfs = context_to_airflow_vars(context)
else:
self.hiveconfs.update(context_to_airflow_vars(context))
job.add_variables(self.variables)
job.add_jar_file_uris(self.dataproc_jars)
job.set_job_name(self.job_name)
job_to_submit = job.build()
self.dataproc_job_id = job_to_submit["job"]["reference"]["jobId"]
hook.submit(hook.project_id, job_to_submit, self.region, self.job_error_states)
I would like to be able to use Jinja templating inside my HQL code to allow partition automation on my data pipeline.
P.S: I'll use the Jinja templating mostly for Partition DateStamp
Does anyone know what is the error message I'm getting + help me solve it?
ERROR - submit() takes from 3 to 4 positional arguments but 5 were given
Thank you!
It is because of the 5th argument job_error_states which is only in master and not in the current stable release (1.10.1).
Source Code for 1.10.1 -> https://github.com/apache/incubator-airflow/blob/76a5fc4d2eb3c214ca25406f03b4a0c5d7250f71/airflow/contrib/hooks/gcp_dataproc_hook.py#L219
So remove that parameter and it should work.
I have simple lambda function that is located under following endpoint:
https://******.execute-api.eu-west-2.amazonaws.com/lambda/add?x=1&y=2
AWS Chalice was used for adding simple endpoints here.
#app.route('/{exp}', methods=['GET'])
def add(exp):
app.log.debug("Received GET request...")
request = app.current_request
app.log.debug(app.current_request.json_body)
x = request.query_params['x']
y = request.query_params['y']
if exp == 'add':
app.log.debug("Received ADD command...")
result = int(x) + int(y)
return {'add': result}
Basically, it checks if the path is equal to add and sums two values from query_params.
Now, I am trying to invoke this lambda in another lambda.
My question:
How I can pass the path and query_params to my original lambda function using boto3 lambda client?
What I have tried so far:
I added two lines to policy.json file that allow me to invoke original function.
I saw a lot of similar question on StackOverflow, but most of them pass payload as a json.
#app.route('/')
def index():
lambda_client = boto3.client('lambda')
invoke_response = lambda_client.invoke(
FunctionName="function-name",
InvocationType="RequestResponse"
)
app.log.debug(invoke_response['Payload'].read())
Thank you in advance!
Maybe you can add the next code to your add function, so it accepts payloads too:
#app.route('/{exp}', methods=['GET'])
def add(exp, *args, **kwargs):
if isinstance(exp, dict):
# exp is event here
request = exp.get('request', {'x': 0, 'y': 0})
exp = exp.get('exp', 'add')
I'm going to write a general example, and you can easily modify it to match your needs. In your case the data dictionary would have request and exp keys, and you need to find your lambda function's arn.
AWS Documentation Lambda.invoke
Let's assume from now on we have 2 Lambdas named "master" and "slave". master will call slave.
At the moment there are 3 types of invocations:
RequestResponse (Default): master calls and waits for slave response
Event: Async, master calls and forgets
DryRun: Do some verification before running
I keep with #1 RequestResponse:
Slave:
def lambda_handler(event, context):
result = {}
result['event'] = event
result['result'] = "It's ok"
return result
And its arn is something like arn:aws:lambda:us-east-1:xxxxxxxxxxxxxx:function:slave
In the example, slave is just an echo function
Now, the master needs the necessary role's permission to call it, and the arn or name. Then you can write something like this:
import boto3
from datetime import datetime
import json
client = boto3.client('lambda')
def lambda_handler(event, context):
arn = 'arn:aws:lambda:us-east-1:xxxxxxxxxxxxxx:function:slave'
data = {'my_dict': {'one': 1, 'two': 2}, 'my_list': [1,2,3], 'my_date': datetime.now().isoformat()}
response = client.invoke(FunctionName=arn,
InvocationType='RequestResponse',
Payload=json.dumps(data))
result = json.loads(response.get('Payload').read())
return result
Usually you would get arn with something like os.environ.get('slave_arn')
All data from/to lambdas must be JSON serializable.