I have created a function in AWS lambda which looks like this:
import boto3
import numpy as np
import pandas as pd
import s3fs
from io import StringIO
def test(event=None, context=None):
# creating a pandas dataframe from an api
# placing 2 csv files in S3 bucket
This function queries an external API and places 2 csv files in S3 bucket. I want to trigger this function in Airflow, I have found this code:
import boto3, json, typing
def invokeLambdaFunction(*, functionName:str=None, payload:typing.Mapping[str, str]=None):
if functionName == None:
raise Exception('ERROR: functionName parameter cannot be NULL')
payloadStr = json.dumps(payload)
payloadBytesArr = bytes(payloadStr, encoding='utf8')
client = boto3.client('lambda')
response = client.invoke(
FunctionName=test,
InvocationType="RequestResponse",
Payload=payloadBytesArr
)
return response
if __name__ == '__main__':
payloadObj = {"something" : "1111111-222222-333333-bba8-1111111"}
response = invokeLambdaFunction(functionName='test', payload=payloadObj)
print(f'response:{response}')
But as I understand this code snippet does not connect to the S3. Is this the right approach to trigger AWS Lambda function from Airflow or there is a better way?
I would advice to use the AwsLambdaHook:
https://airflow.apache.org/docs/stable/_api/airflow/contrib/hooks/aws_lambda_hook/index.html#module-airflow.contrib.hooks.aws_lambda_hook
And you can check a test showing its usage to trigger a lambda function:
https://github.com/apache/airflow/blob/master/tests/providers/amazon/aws/hooks/test_lambda_function.py
Related
I'm trying to build my first cloud function. Its a function that should get data from API, transform to DF and push to bigquery. I've set the cloud function up with a http trigger using validate_http as entry point. The problem is that it states the function is working but it doesnt actually write anything. Its a similiar problem as the problem discussed here: Passing data from http api to bigquery using google cloud function python
import pandas as pd
import json
import requests
from pandas.io import gbq
import pandas_gbq
import gcsfs
#function 1: Responding and validating any HTTP request
def validate_http(request):
request.json = request.get_json()
if request.args:
get_api_data()
return f'Data pull complete'
elif request_json:
get_api_data()
return f'Data pull complete'
else:
get_api_data()
return f'Data pull complete'
#function 2: Get data and transform
def get_api_data():
import pandas as pd
import requests
import json
#Setting up variables with tokens
base_url = "https://"
token= "&token="
token2= "&token="
fields = "&fields=date,id,shippingAddress,items"
date_filter = "&filter=date in '2022-01-22'"
data_limit = "&limit=99999999"
#Performing API call on request with variables
def main_requests(base_url,token,fields,date_filter,data_limit):
req = requests.get(base_url + token + fields +date_filter + data_limit)
return req.json()
#Making API Call and storing in data
data = main_requests(base_url,token,fields,date_filter,data_limit)
#transforming the data
df = pd.json_normalize(data['orders']).explode('items').reset_index(drop=True)
items = df['items'].agg(pd.Series)[['id','itemNumber','colorNumber', 'amount', 'size','quantity', 'quantityReturned']]
df = df.drop(columns=[ 'items', 'shippingAddress.id', 'shippingAddress.housenumber', 'shippingAddress.housenumberExtension', 'shippingAddress.address2','shippingAddress.name','shippingAddress.companyName','shippingAddress.street', 'shippingAddress.postalcode', 'shippingAddress.city', 'shippingAddress.county', 'shippingAddress.countryId', 'shippingAddress.email', 'shippingAddress.phone'])
df = df.rename(columns=
{'date' : 'Date',
'shippingAddress.countryIso' : 'Country',
'id' : 'order_id'})
df = pd.concat([df, items], axis=1, join='inner')
#Push data function
bq_load('Return_data_api', df)
#function 3: Convert to bigquery table
def bq_load(key, value):
project_name = '375215'
dataset_name = 'Returns'
table_name = key
value.to_gbq(destination_table='{}.{}'.format(dataset_name, table_name), project_id=project_name, if_exists='replace')
The problem is that the script doesnt write to bigquery and doesnt return any error. I know that the get_api_data() function is working since I tested it locally and does seem to be able to write to BigQuery. Using cloud functions I cant seem to trigger this function and make it write data to bigquery.
There are a couple of things wrong with the code that would set you right.
you have list data, so store as a csv file (in preference to json).
this would mean updating (and probably renaming) the JsonArrayStore class and its methods to work with CSV.
Once you have completed the above and written well formed csv, you can proceed to this:
reading the csv in the del_btn method would then look like this:
import python
class ToDoGUI(tk.Tk):
...
# methods
...
def del_btn(self):
a = JsonArrayStore('test1.csv')
# read to list
with open('test1.csv') as csvfile:
reader = csv.reader(csvfile)
data = list(reader)
print(data)
Good work, you have a lot to do, if you get stuck further please post again.
I have a lambda that actually puts some data into the kinesis stream. That works perfectly fine. I can see the data written into the kinesis in the cloud watch logs.
Now I am writing another Lambda that reads that kinesis stream when there some data is available on the stream. I am able to get the shard iterator. But when I pass it to the get_records, it says invalid parameter
import boto3
import json
import time
from pprint import pprint
import base64
kinesis_client = boto3.client('kinesis')
def lambda_handler(event, context):
response = kinesis_client.describe_stream(StreamName='teststream')
my_shard_id = response['StreamDescription']['Shards'][0]['ShardId']
#print(my_shard_id)
shard_iterator = kinesis_client.get_shard_iterator(StreamName='teststream',
ShardId=my_shard_id, ShardIteratorType='LATEST')
#pprint(shard_iterator)
#I am able to print the shard iterator but when I do get_records It says Invalid type
#for parameter ShardIterator, value: {'ShardIterator':
record_response = kinesis_client.get_records(ShardIterator=shard_iterator,Limit=10)
#pprint(record_response)
This is the error I get when get_records is called.
The boto3 get_shard_iterator() call returns a JSON object:
{
'ShardIterator': 'string'
}
Therefore, use:
shard_iterator_response = kinesis_client.get_shard_iterator(
StreamName='teststream',
ShardId=my_shard_id,
ShardIteratorType='LATEST'
)
shard_iterator = shard_iterator_response['ShardIterator']
I'd like to get all archives from a specific directory on S3 bucket like the following:
def get_files_from_s3(bucket_name, s3_prefix):
files = []
s3_resource = boto3.resource("s3")
bucket = s3_resource.Bucket(bucket_name)
response = bucket.objects.filter(Prefix=s3_prefix)
for obj in response:
if obj.key.endswidth('.zip'):
# get all archives
files.append(obj.key)
return files
My question is about testing it; because I'd like to mock the list of objects in the response to be able to iterate on it. Here is what I tried:
from unittest.mock import patch
from dataclasses import dataclass
#dataclass
class MockZip:
key = 'file.zip'
#patch('module.boto3')
def test_get_files_from_s3(self, mock_boto3):
bucket = mock_boto3.resource('s3').Bucket(self.bucket_name)
response = bucket.objects.filter(Prefix=S3_PREFIX)
response.return_value = [MockZip()]
files = module.get_files_from_s3(BUCKET_NAME, S3_PREFIX)
self.assertEqual(['file.zip'], files)
I get an assertion error like this: E AssertionError: ['file.zip'] != []
Does anyone have a better approach? I used struct but I don't think this is the problem, I guess I get an empty list because the response is not iterable. So how can I mock it to be a list of mock objects instead of just a MockMagick type?
Thanks
You could use moto, which is an open-source libray specifically build to mock boto3-calls. It allows you to work directly with boto3, without having to worry about setting up mocks manually.
The testfunction that you're currently using would look like this:
from moto import mock_s3
#pytest.fixture(scope='function')
def aws_credentials():
"""Mocked AWS Credentials, to ensure we're not touching AWS directly"""
os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
os.environ['AWS_SECURITY_TOKEN'] = 'testing'
os.environ['AWS_SESSION_TOKEN'] = 'testing'
#mock_s3
def test_get_files_from_s3(self, aws_credentials):
s3 = boto3.resource('s3')
bucket = s3.Bucket(self.bucket_name)
# Create the bucket first, as we're interacting with an empty mocked 'AWS account'
bucket.create()
# Create some example files that are representative of what the S3 bucket would look like in production
client = boto3.client('s3', region_name='us-east-1')
client.put_object(Bucket=self.bucket_name, Key="file.zip", Body="...")
client.put_object(Bucket=self.bucket_name, Key="file.nonzip", Body="...")
# Retrieve the files again using whatever logic
files = module.get_files_from_s3(BUCKET_NAME, S3_PREFIX)
self.assertEqual(['file.zip'], files)
Full documentation for Moto can be found here:
http://docs.getmoto.org/en/latest/index.html
Disclaimer: I am a maintainer for Moto.
I've got code that downloads a file from an S3 bucket using boto3.
# foo.py
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file(src_f, dest_f)
I'd now like to write a unit test for dl() using pytest and by mocking the interaction with AWS using the stubber available in botocore.
#pytest.fixture
def s3_client():
yield boto3.client("s3")
from foo import dl
def test_dl(s3_client):
with Stubber(s3_client) as stubber:
params = {"Bucket": ANY, "Key": ANY}
response = {"Body": "lorem"}
stubber.add_response(SOME_OBJ, response, params)
dl('bucket_file.txt', 'tmp/bucket_file.txt')
assert os.path.isfile('tmp/bucket_file.txt')
I'm not sure about the right approach for this. How do I add bucket_file.txt to the stubbed reponse? What object do I need to add_response() to (shown as SOME_OBJ)?
Have you considered using moto3?
Your code could look the same way as it is right now:
# foo.py
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file(src_f, dest_f)
and the test:
import boto3
import os
from moto import mock_s3
#mock_s3
def test_dl():
s3 = boto3.client('s3', region_name='us-east-1')
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
s3.create_bucket(Bucket='mybucket')
s3.put_object(Bucket='mybucket', Key= 'bucket_file.txt', Body='')
dl('bucket_file.txt', 'bucket_file.txt')
assert os.path.isfile('bucket_file.txt')
The intention of the code becomes a bit more obvious since you simply work with s3 as usual, except for there is no real s3 behind the method calls.
Python Shell Jobs was introduced in AWS Glue. They mentioned:
You can now use Python shell jobs, for example, to submit SQL queries to services such as ... Amazon Athena ...
Ok. We have an example to read data from Athena tables here:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print("Count: ", persons.count())
persons.printSchema()
# TODO query all persons
However, it uses Spark instead of Python Shell. There are no such libraries that are normally available with Spark job type and I have an error:
ModuleNotFoundError: No module named 'awsglue.transforms'
How can I rewrite the code above to make it executable in the Python Shell job type?
The thing is, Python Shell type has its own limited set of built-in libraries.
I only managed to achieve my goal using Boto 3 to query data and Pandas to read it into a dataframe.
Here is the code snippet:
import boto3
import pandas as pd
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
athena_client = boto3.client(service_name='athena', region_name='us-east-1')
bucket_name = 'bucket-with-csv'
print('Working bucket: {}'.format(bucket_name))
def run_query(client, query):
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={ 'Database': 'sample-db' },
ResultConfiguration={ 'OutputLocation': 's3://{}/fromglue/'.format(bucket_name) },
)
return response
def validate_query(client, query_id):
resp = ["FAILED", "SUCCEEDED", "CANCELLED"]
response = client.get_query_execution(QueryExecutionId=query_id)
# wait until query finishes
while response["QueryExecution"]["Status"]["State"] not in resp:
response = client.get_query_execution(QueryExecutionId=query_id)
return response["QueryExecution"]["Status"]["State"]
def read(query):
print('start query: {}\n'.format(query))
qe = run_query(athena_client, query)
qstate = validate_query(athena_client, qe["QueryExecutionId"])
print('query state: {}\n'.format(qstate))
file_name = "fromglue/{}.csv".format(qe["QueryExecutionId"])
obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
return pd.read_csv(obj['Body'])
time_entries_df = read('SELECT * FROM sample-table')
SparkContext won't be available in Glue Python Shell. Hence you need to depend on Boto3 and Pandas to handle the data retrieval. But it comes a lot of overhead to query Athena using boto3 and poll the ExecutionId to check if the query execution got finished.
Recently awslabs released a new package called AWS Data Wrangler. It extends power of Pandas library to AWS to easily interact with Athena and lot of other AWS Services.
Reference link:
https://github.com/awslabs/aws-data-wrangler
https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb
Note: AWS Data Wrangler library wont be available by default inside Glue Python shell. To include it in Python shell, follow the instructions in following link:
https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs
I have a few month using glue, i use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
data_frame = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.load(<CSVs THAT IS USING FOR ATHENA - STRING>)