I've been working on a project that uses a fairly simple data pipeline to clean and transform raw csv files into processed data using Python3.8 and Lambda to create various subsets which are sent to respective S3 buckets. The Lambda function is triggered by uploading a raw csv file to an intake S3 bucket, which initiates the process.
However, I would like to also send some of that processed data directly to Quicksight for ingestion from that same Lambda function for visual inspection as well, and that's where I'm currently stuck.
A portion of the function (leaving out the imports) I have with just the csv processing and uploading to S3, and this is the portion I like direclty ingested to Quicksight:
def featureengineering(event, context):
bucket_name = event['Records'][0]['s3']['bucket']['name']
s3_file_name = event['Records'][0]['s3']['object']['key']
read_file = s3_client.get_object(Bucket=bucket_name,Key=s3_file_name)
#turning the CSV into a dataframe in AWS Lambda
s3_data = io.BytesIO(read_file.get('Body').read())
df = pd.read_csv(s3_data, encoding="ISO-8859-1")
#replacing erroneous zero values to nan (missing) which is more accurate and a general table,
#and creating a new column with just three stages instead for simplification
df[['Column_A','Column_B']] = df[['Column_A','Column_B']].replace(0,np.nan)
#applying function for feature engineering of 'newstage' function
df['NewColumn'] = df.Stage.apply(newstage)
df1 = df
df1.to_csv(csv_buffer1)
s3_resource.Object(bucket1, csv_file_1).put(Body=csv_buffer1.getvalue()) #downloading df1 to S3
So at that point where the df1 is sent to its S3 bucket (which works fine), but I'd like it directly ingested into Quicksight as an automated spice refresh as well.
In digging around I did found a similar question with an answer
import boto3
import time
import sys
client = boto3.client('quicksight')
response = client.create_ingestion(DataSetId='<dataset-id>',IngestionId='<ingetion-id>',AwsAccountId='<aws-account-id>')
but the hang up I'm having is in the DataSetId or more generally, how do I turn the pandas DataFrame df1 in the Lambda Function into something the CreateIngestion API can accept and automatically send to QuickSight as an automated spice refresh of the most recent processed data?
You should first create a Quicksight Dataset, quoting from the docs:
A dataset identifies the specific data in a data source that you want to use. For example, the data source might be a table if you are connecting to a database data source. It might be a file if you are connecting to an Amazon S3 data source.
Once you have saved your DataFrame on S3 (either as a csv or parquet file), you can create a Quicksight Dataset that sources data from it.
You can do so either via Console or programmatically (probably what you’re looking for).
Finally, once you have the Dataset ID you can reference it in other Quicksight API calls.
Related
I have a WebScrapper application that is scheduled to run 5 times a day. This process is running in python, in my personal notebook
The output is a dataframe with no more than 20k rows. This dataframe is appended to a compressed file with all historical data. It’s a .csv.gzip ~110MB growing 0.5MB each day and its an input to a Dashboard in Power BI.
The problem is everytime the script runs, it has to
unzip → load the whole file in memory → append newly rows → save (overwrite)
and it’s not very efficient.
It seems the best way would be a format that allows to append latest data without reading all the file
Now we are migrating the application to Azure and we have to adapt our architecture. We are using Azure Functions to run the WebScrapper, and as storage, we are using Azure Blob.
Is there a more viable architecture to do this job (append new extraction to a historical file) rather than using gzip?
I am assuming that SQL Database would be more expensive, so i am giving the last chance to work out this on Blob, at low cost mode.
Update
Code below works locally. It appends newly data to historical gzip without loading it.
df.to_csv(gzip_filename, encoding='utf-8', compression='gzip', mode='a')
Code below not working on Azure. It overwrites historical data by newly data.
container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
output = df.to_csv(index=False, encoding='utf-8', compression='gzip', mode='a')
container_client.upload_blob(gzip_filename, output, overwrite=True, encoding='utf-8')
I have noticed you have used overwrite=True. Which means
If overwrite=True is used, the old append blob is removed and a new one is generated. False is the default value.
If overwrite=False is specified while the data already exists, no error is raised and the data is appended to the existing blob.
So, Try to change the overwrite method into False. Which follows,
container_client.upload_blob(gzip_filename, output, overwrite=False, encoding='utf-8')
Refer here for more information
I have found a way following #Dhanuka Jayasinghe's tip. Using Append Blobs.
The code below is working for me. It appends last lines without having to read the whole file.
#establish connection to Container in Blob Storage
container_client = ContainerClient.from_connection_string(conn_str='your_conn_str', container_name='your_container_name')
output = df.to_csv(index=False, encoding='utf-8', header=None) #header=None if your file already have headers
#saving to blob specifying **blob_type**
container_client.upload_blob("output_filename.csv", output, encoding='utf-8', blob_type=BlobType.AppendBlob)
Reference:
Microsoft Documentation about 3 blob types (block, page and append)
Stackoverflow question with an explanation for appendblobs
I have a table on dynamodb with about 50,000 records. I want to pull this whole table to my local computer using aws cli. But I know there is a 1MB limit.
What I've done so far
Get data in json format using aws dynamodb scan --table-name my_table_name> output.json
Then, Convert json output to dataframe using the python code below.
import json
import pandas as pd
with open("output.json", encoding='utf-8-sig', errors='ignore') as json_data:
data = json.load(json_data, strict=False)
df=pd.DataFrame(data["Items"])
After, necessary preprocessing steps...
Is there a way I can pull the whole table?
I have the task to transform and consolidate millions of single JSON file into BIG CSV files.
The operation would be very simple using a copy activity and mapping the schemas, I have already tested, the problem is that a massive amount of files have bad JSON format.
I know what is the error and the fix is very simple too, I figured that I could use a Python Data brick activity to fix the string and then pass the output to a copy activity that could consolidate the records into a big CSV file.
I have something in mind like this, I'm not sure if this is the proper way to address this task. I don't know to use the output of the Copy Activy in the Data Brick activity
It sounds like you want to transform a large number of single JSON file using Azure Data Factory, but it does not support on Azure now as #KamilNowinski said. However, now that you were using Azure Databricks, to write a simple Python script to do the same thing is easier for you. So a workaound solution is to directly use Azure Storage SDK and pandas Python package to do that via few steps on Azure Databricks.
Maybe these JSON files are all in a container of Azure Blob Storage, so you need to list them in container via list_blob_names and generate their urls with sas token for pandas read_json function, the code as below.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
blob_names = service.list_blob_names(container_name)
blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names)
#print(list(blob_urls_with_token))
Then, you can read these JSON file directly from blobs via read_json function to create their pandas Dataframe.
import pandas as pd
for blob_url_with_token in blob_urls_with_token:
df = pd.read_json(blob_url_with_token)
Even if you want to merge them to a big CSV file, you can first merge them to a big Dataframe via pandas functions listed in Combining / joining / merging like append.
To write a dataframe to a csv file, I think it's very easy by to_csv function. Or you can convert a pandas dataframe to a PySpark dataframe on Azure Databricks, as the code below.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
So next, whatever you want to do, it's simple. And if you want to schedule the script as notebook in Azure Databricks, you can refer to the offical document Jobs to run Spark jobs.
Hope it helps.
Copy JSON file to storage (e.g. BLOB) and you can get access to the storage from Databricks. Then you can fix the file using Python and even transform to the required format having cluster run.
So, in Copy Data activity do the copy of the files to BLOB if you haven't them there yet.
I have a data pipeline that is something like this : Kinesis Firehose -> S3
When I use the glue crawler to create an Athena table over this data the table only reads some of the actual rows. The data in the file looks like this :
{row1}{row2}{row3}{row4}\n
{row5}{row6}{row7}
If I modify the data to have a newline after each row the Athena table will read the data properly. What I am wondering is how other people have solved this problem.
The solution I am considering is to write a python lambda function that takes care of the \n new line delimiter for the rows. Is there a better way to do this?
Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue