reading .csv from zip file in lambda from aws s3 bucket - python

i have a zip file in s3 bucket . I unzip that zip file in memory(without unzipping it) and dump the data in .csv into tables in my database.But while dumping the tables , i get foreign constraints as the program dumps the csvs in the order in which they come , so some tables are dumped before other table example - i have two tables 'dealer_master' and 'billing_master', structure of both tables:
1) dealer_data
dealer_id : primary key
country
pincode
address
create_date
2) billing_data
bill_id : primary-key
dealer_id : foreign_key
bill_amount
bill_date
In the zip file , i get billing_data ahead of dealer_data. So get 'foreign key constraint error'. For the solution to above problem i turned off foreign key constraints while making connection to database. Is there any other way, i can dump the tables into my database in correct order ?
Can i store tables in memory for some piece of time and later dump them in the way i want ?
My code goes like this :
def etl_job():
`data = json.load(open('path_to_json'))`
`logger = helpers.setup_logging()`
`s3_client = boto3.client('s3',aws_access_key_id=data['aws_access_key_id'],
aws_secret_access_key=data['aws_secret_access_key'])`
`s3_resource = boto3.resource('s3',aws_access_key_id=data['aws_access_key_id'],
aws_secret_access_key=data['aws_secret_access_key'])`
`keys = []`
`resp = s3_client.list_objects_v2(Bucket=bucket_name)`
`for obj in resp['Contents']:`
`keys.append(obj['Key'])
for key in keys:
names = key.split("/")
obj = s3_resource.Bucket(bucket_name).Object(helpers.zip_file_name())
buffer = io.BytesIO(obj.get()["Body"].read())
zip_file = zipfile.ZipFile(buffer,'r')
logger.info("Name of csv in zip file :%s",zip_file.namelist())
logs = ""
dataframe = pd.DataFrame()
for name_of_zipfile in zip_file.namelist():
zip_open = pd.read_csv(zip_file.open(name_of_zipfile))
zip_open = zip_open.dropna()
table_name = "{name}".format(name=name_of_zipfile.replace('.csv',''))
try :
zip_open.to_sql(name=name_of_zipfile.replace('.csv',''), con=database.db_connection(), if_exists = 'append', index=False)
except SQLAlchemyError as sqlalchemy_error:
print sqlalchemy_error
database.db_connection().execute('SET FOREIGN_KEY_CHECKS=1;')`

Related

Python boto3, list contents of specific dir in bucket, limit depth

This is the same as this question, but I also want to limit the depth returned.
Currently, all answers return all the objects after the specified prefix. I want to see just what's in the current hierarchy level.
Current code that returns everything:
self._session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
self._session.resource("s3")
bucket = self._s3.Bucket(bucket_name)
detections_contents = bucket.objects.filter(Prefix=prefix)
for object_summary in detections_contents:
print(object_summary.key)
How to see only the files and folders directly under prefix? How to go n levels deep?
I can parse everything locally, and this is clearly not what I am looking for here.
There is no definite way to do this using list objects without getting all the objects in the dir.
But there is a way using s3 select which uses sql query like format to get n levels deep to get the file content as well as to get object keys.
If you are fine with writing sql then use this.
reference doc
import boto3
import json
s3 = boto3.client('s3')
bucket_name = 'my-bucket'
prefix = 'my-directory/subdirectory/'
input_serialization = {
'CompressionType': 'NONE',
'JSON': {
'Type': 'LINES'
}
}
output_serialization = {
'JSON': {}
}
# Set the SQL expression to select the key field for all objects in the subdirectory
expression = 'SELECT s.key FROM S3Object s WHERE s.key LIKE \'' + prefix + '%\''
response = s3.select_object_content(
Bucket=bucket_name,
ExpressionType='SQL',
Expression=expression,
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# The response will contain a Payload field with the selected data
payload = response['Payload']
for event in payload:
if 'Records' in event:
records = event['Records']['Payload']
data = json.loads(records.decode('utf-8'))
# The data will be a list of objects, each with a "key" field representing the file name
for item in data:
print(item['key'])
There is not built in way with the Boto3 or S3 APIs to do this. You'll need some version of processing each level and asking in turn for a list of objects at that level:
import boto3
s3 = boto3.client('s3')
max_depth = 2
paginator = s3.get_paginator('list_objects_v2')
# Track all prefixes to show with a list
common_prefixes = [(0, "")]
while len(common_prefixes) > 0:
# Pull out the next prefix to show
current_depth, current_prefix = common_prefixes.pop(0)
# Loop through all of the items using a paginator to handle common prefixes with more
# than a thousand items
for page in paginator.paginate(Bucket=bucket_name, Prefix=current_prefix, Delimiter='/'):
for cur in page.get("CommonPrefixes", []):
# Show each common prefix, here just use a format like AWS CLI does
print(" " * 27 + f"PRE {cur['Prefix']}")
if current_depth < max_depth:
# This is below the max depth we want to show, so
# add it to the list to be shown
common_prefixes.append((current_depth + 1, cur['Prefix']))
for cur in page.get("Contents", []):
# Show each item sharing this common prefix using a format like the AWS CLI
print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')}{cur['Size']:11d} {cur['Key']}")

Storing SQL output as python variables and storing in text files

I'm pretty much new to python and SQL and I am trying some coding tasks. I have a SQL query in the format of the below, where I return a set of values using python and SQL. What I would like to do using python is to define the variable "X as User_Name" and parse this to a text file within my local linux directory (for example in a file called Usernames.txt).
query = """\
Select
X as User_Name,
Y,
Z
FROM
tbl1
WHERE ...
AND ... """
In the below snippets I attempt to write this to the text file, but does not seem to work for me
cursor = connection.cursor()
....
fo = open ('/localDrive/Usernames.txt', 'a')
for row in cur:
rows = list(row)
fo.write(rows[0])
....
fo.close()
The issue is sometimes there are more than 1 row returned so I'd need to store all usernames in that text file. I'd like then to be able to check against this text file and not return SQL Output if the "X as User_Name" already exists within the text file (Usernames.txt) This is something I'm not sure how to do
Just use Pickle to save your data, with dictionaries and sets to compare them. Pickle can save and load Python objects with no parsing required.
If you want a human readable output as well, just print the objects to the screen or file.
e.g. (untested)
import pickle
from pathlib import Path
pickle_path = Path("data.pickle")
fields = ('field_1', 'field_2', 'field_3')
def add_fields(data_list):
# Return a list of dictionaries
return [dict(zip(fields, row)) for row in data_list]
def get_unique_values(dict_list, key):
# Return a set of key field values
return set(dl[key] for dl in dict_list)
def get_data_subset(dict_list, key_field, keys):
# Return records where key_field contains values in keys
return [dl for dl in dict_list if dl[key_field] in keys]
# ...
# Create DB connection etc.
# ...
# ...
cursor = connection.cursor()
cursor.execute(query)
results = cursor.fetchall()
# De-serialise the local data if it exists
if pickle_path.exists():
with pickle_path.open("rb") as pp:
prev_results = pickle.load(pp)
else:
prev_results = []
results = add_fields(results)
keys = get_unique_values(results, 'field_1')
prev_keys = get_unique_values(prev_results, 'field_1')
# All keys
all_keys = keys | prev_keys
# in both sets
existing_keys = keys ^ prev_keys
# just in prev
deleted_keys = prev_keys - keys
# just the new values in keys
new_keys = keys - prev_keys
# example: handle deleted data
temp_dl = []
for row in prev_results:
if row['field_1'] not in deleted_keys:
temp_dl.append(row)
prev_results = temp_dl
# example: handle new keys
new_data = get_data_subset(results, 'field_1', new_keys)
prev_results.extend(new_data)
# Serialise the local data
if pickle_path.exists():
pickle_path.unlink()
with pickle_path.open("wb") as pp:
pickle.dump(prev_results, pp)
if len(new_data):
print("New records added")
for row in new_data:
print(row)

How to query an external csv file uploaded through StringIO in S3 in Snowflake with the right format?

I've written these Python methods in my custom Airflow operator to convert a dictionary to a dataframe then to StringIO object and upload it to S3 as a CSV file without saving locally.
def execute(self, context):
s3_hook = S3Hook(aws_conn_id=self.s3_conn_id)
retailer, d1 = context['task_instance'].xcom_pull(self.data_source)
self._upload_file(d1, retailer, s3_hook)
def _upload_to_s3(self, df, s3_hook):
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_hook.load_string(string_data=csv_buffer.getvalue(),
key=self.s3_key,
bucket_name=self.s3_bucket,
replace=True)
def _upload_file(self, d, retailer, s3_hook):
self.s3_key = f"S3_STAGING/{retailer}/{retailer}_summary.csv"
df = pd.DataFrame.from_dict(d, orient="index")
df.index.name = 'product_code'
self._upload_to_s3(df, s3_hook)
The DAG runs and uploads the file successfully, and the file looks normal when using S3 query on it. But when I try to query it in Snowflake:
select t.$1 as description,
t.$2 as parent_company
from #S3_STAGING/S3_STAGING/sample/sample_summary.csv as t
All columns are concatenated into one for some reasons. Is there any ways to fix this?
Can you check if you defined a specific field_delimiter for the stage? To be sure, can you create a file format and use it:
create file format myformat type = 'csv';
select t.$1 as description,
t.$2 as parent_company from #S3_STAGING/S3_STAGING/sample/sample_summary.csv (file_format =>
'myformat') t;

snowflake external table from a list of jsons

got a file.json in my s3 bucket, it contains a list of jsons,
for instance when I download it and parse it with python json load I get a list:
[{'k': 'calendar#event'}, {'k': 'calendar#event'}]\
loading it into an external table works:
create external table if not exists TEST_111
with location = #TESt111
auto_refresh = true
file_format = (type = json);
but instead of getting a table with 2 rows, I get one row with a list in it,
any ideas?
If the value is provided as array then strip_outer_array could be used:
create external table if not exists TEST_111
with location = #TESt111
auto_refresh = true
file_format = (type = json, STRIP_OUTER_ARRAY=TRUE);
Additionally if the json keys are known in advance, they could be exposed as columns directly in external table's definition:
create external table if not exists TEST_111
(
filename TEXT metadata$filename
,k TEXT AS (value:"k"::TEXT)
)
with location = #TESt111
auto_refresh = true
file_format = (type = json, STRIP_OUTER_ARRAY=TRUE);

Failed to create for each activity in azure data factory using python

I am trying to create azure data factory pipelines and resources using python. I was successful with certain ADF activities like Lookup, Copy .. but the problem I am facing here is I am trying to copy few tables from SQL to blob using FOR Each activity and it is throwing the below error
How would you create activities inside for each activity? Any inputs is greatly appreciated. Thanks!
Ref: https://learn.microsoft.com/en-us/python/api/azure-mgmt-datafactory/azure.mgmt.datafactory.models.foreachactivity?view=azure-python
Error Message
TypeError: 'CopyActivity' object is not iterable
Code Block
## Lookup Activity
ls_sql_name = 'ls_'+project_name+'_'+src_svr_type+'_dev'
linked_service_name =LinkedServiceReference(reference_name=ls_sql_name)
lkp_act_name ='Get Table Names'
sql_reader_query = "SELECT top 3 name from sys.tables where name like '%dim'"
source = SqlSource(sql_reader_query= sql_reader_query)
dataset= {"referenceName": "ds_sql_Dim_input","type": "DatasetReference"}
LookupActivity_ = LookupActivity(name =lkp_act_name, linked_service_name= linked_service_name, source = source, dataset = dataset
,first_row_only =False)
#create copy activity
ds_name = 'ds_sql_dim_input' #these datasets already created
dsOut_name ='ds_blob_dim_output' #these datasets already created
copy_act_name = 'Copy SQL to Blob(parquet)'
sql_reader_query = {"value": "#item().name","type": "Expression"}
sql_source = SqlSource(sql_reader_query=sql_reader_query)
blob_sink = ParquetSink()
dsin_ref = DatasetReference(reference_name=ds_name)
dsOut_ref = DatasetReference(reference_name=dsOut_name)
copy_activity = CopyActivity(name=copy_act_name,inputs=[dsin_ref], outputs=[dsOut_ref], source=sql_source, sink=blob_sink)
## For Each Activity
pl_name ='pl_Test'
items= {"value": "#activity('Get Table Names').output.value","type": "Expression"}
dependsOn = [{"activity": "Get Table Names","dependencyConditions": ["Succeeded"]}]
ForEachActivity_= ForEachActivity(name = 'Copy tables in loop',items=items,depends_on=dependsOn ,activities =copy_activity)
params_for_pipeline = {}
p_obj = PipelineResource(activities=[LookupActivity_,ForEachActivity_], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, pl_name, p_obj)
Activities needs to be a list of Activity, and you are passing a single one. Try creating a list and add the copy activity to it, and the pass that list in the activities parameter.

Categories