I have a data pipeline that is something like this : Kinesis Firehose -> S3
When I use the glue crawler to create an Athena table over this data the table only reads some of the actual rows. The data in the file looks like this :
{row1}{row2}{row3}{row4}\n
{row5}{row6}{row7}
If I modify the data to have a newline after each row the Athena table will read the data properly. What I am wondering is how other people have solved this problem.
The solution I am considering is to write a python lambda function that takes care of the \n new line delimiter for the rows. Is there a better way to do this?
Related
Suppose I have list of API's like the following...
https://developer.genesys.cloud/devapps/api-explorer#get-api-v2-alerting-alerts-active
https://developer.genesys.cloud/devapps/api-explorer#get-api-v2-alerting-interactionstats-rules
https://developer.genesys.cloud/devapps/api-explorer#get-api-v2-analytics-conversations-details
I want to read this API's one by one and store the data to snowflake using Pandas and SQLalchemy.
Do you have any ideas for reading the API's one by one in my python script?
-Read the API's one by one from a file.
Load the data to snowflake table directly.
I have a table on dynamodb with about 50,000 records. I want to pull this whole table to my local computer using aws cli. But I know there is a 1MB limit.
What I've done so far
Get data in json format using aws dynamodb scan --table-name my_table_name> output.json
Then, Convert json output to dataframe using the python code below.
import json
import pandas as pd
with open("output.json", encoding='utf-8-sig', errors='ignore') as json_data:
data = json.load(json_data, strict=False)
df=pd.DataFrame(data["Items"])
After, necessary preprocessing steps...
Is there a way I can pull the whole table?
I would like to do a daily ingesting job that takes a CSV file from blob storage and put it integrate it into a PostgreSQL database. I have the constraint to use python. Which solution do you recommend me to use for building/hosting my ETL solution ?
Have a nice day :)
Additional information:
The size and shape of the CSV file are 1.35 GB, (1292532, 54).
I will push to the database only 12 columns out of 54.
You can try to use Azure Data Factory to achieve this. New a Copy Data activity, source is your csv and sink is PostgreSQL database. In the Mapping setting, just select the columns you need. Finally, create a schedule trigger to run it.
I've been working on a project that uses a fairly simple data pipeline to clean and transform raw csv files into processed data using Python3.8 and Lambda to create various subsets which are sent to respective S3 buckets. The Lambda function is triggered by uploading a raw csv file to an intake S3 bucket, which initiates the process.
However, I would like to also send some of that processed data directly to Quicksight for ingestion from that same Lambda function for visual inspection as well, and that's where I'm currently stuck.
A portion of the function (leaving out the imports) I have with just the csv processing and uploading to S3, and this is the portion I like direclty ingested to Quicksight:
def featureengineering(event, context):
bucket_name = event['Records'][0]['s3']['bucket']['name']
s3_file_name = event['Records'][0]['s3']['object']['key']
read_file = s3_client.get_object(Bucket=bucket_name,Key=s3_file_name)
#turning the CSV into a dataframe in AWS Lambda
s3_data = io.BytesIO(read_file.get('Body').read())
df = pd.read_csv(s3_data, encoding="ISO-8859-1")
#replacing erroneous zero values to nan (missing) which is more accurate and a general table,
#and creating a new column with just three stages instead for simplification
df[['Column_A','Column_B']] = df[['Column_A','Column_B']].replace(0,np.nan)
#applying function for feature engineering of 'newstage' function
df['NewColumn'] = df.Stage.apply(newstage)
df1 = df
df1.to_csv(csv_buffer1)
s3_resource.Object(bucket1, csv_file_1).put(Body=csv_buffer1.getvalue()) #downloading df1 to S3
So at that point where the df1 is sent to its S3 bucket (which works fine), but I'd like it directly ingested into Quicksight as an automated spice refresh as well.
In digging around I did found a similar question with an answer
import boto3
import time
import sys
client = boto3.client('quicksight')
response = client.create_ingestion(DataSetId='<dataset-id>',IngestionId='<ingetion-id>',AwsAccountId='<aws-account-id>')
but the hang up I'm having is in the DataSetId or more generally, how do I turn the pandas DataFrame df1 in the Lambda Function into something the CreateIngestion API can accept and automatically send to QuickSight as an automated spice refresh of the most recent processed data?
You should first create a Quicksight Dataset, quoting from the docs:
A dataset identifies the specific data in a data source that you want to use. For example, the data source might be a table if you are connecting to a database data source. It might be a file if you are connecting to an Amazon S3 data source.
Once you have saved your DataFrame on S3 (either as a csv or parquet file), you can create a Quicksight Dataset that sources data from it.
You can do so either via Console or programmatically (probably what you’re looking for).
Finally, once you have the Dataset ID you can reference it in other Quicksight API calls.
Basically I want a SQL connection to a csv file in a s3 bucket using Amazon Athena. I also do not know any information other than that the first row will give the names of the headers. Does anyone know any solution to this?
You have at least two ways of doing this. One is to examine a few rows of the file to detect the data types, then create a CREATE TABLE SQL statement as seen at the Athena docs.
If you know you are getting only strings and numbers (for example) and if you know all the columns will have values, it can be relatively easy to build it that way. But if types can be more flexible or columns can be empty, building a robust solution from scratch might be tricky.
So the second option would be to use the AWS Glue Catalog to define a crawler, which does exactly what I told you above, but automatically. It also creates the metadata you need in Athena, so you don't need to write the CREATE TABLE statement.
As a bonus, you can use that automatically catalogued data not only from Athena, but also from Redshift and EMR. And if you keep adding new files to the same bucket (every day, every hour, every week...) you can tell the crawl to pass again and rediscover the data in case the schema has evolved.