Load large flat file based data in warehouse from s3 - python

This is more of design and architecture question so please bare with me.
Requirement - Lets consider we have 2 type of flat files (csv, xls or txt) for below two db tables.
Doctor
name
degree
...
Patient
name
doctorId
age
...
each file contains data of each tables respectively. (volume 2-3millions each file).
we have to load these two files data into Doctor, Patient table of warehouse. after some of the validations like null value, foreign key, duplicates in doctor etc..
if any invalid data identifies, i will need to attach the reasons like null value, duplicate value. so that i can evaluate the invalid data.
Note that, expectations is to load 1 million records in ~1-2mins of span.
My Designed workflow (so far)
After several articles and blog reading i find it to go with AWS Glue & Databrew of my ETL for source to target along with custom validations.
Please find below design and architecture. Suggest & guide me on it.
Is there any scope of parallel or partition based processing for quick validation and loading the data? Your help is going to really help me and others (who gonna come to this type of case).
Thanks a ton.

I'm not sure if you're asking the same thing, but your architecture should follow these guidelines
File land to S3 raw bucket.
Your lambda trigger once file put on S3 bucket.
AWS lambda trigger will invoke Step function which contains following steps
3.1 ) Data governance (AWS Deeque) check all validations
3.2 ) perform transformation
Move process data to your process bucket where you have data for reconciliation and other process.
Finally your data move to your production bucket where you have only required process data not all
Note :
Partitioning is help to achieve parallels processing in Glue.
Your lambda and ETL logic should be stateless so rerun will not corrupts your data.
All Synchronous call should use proper retry Exponential Backoff and jitter.
Logs every steps into DynamoDB table so analysis logs and help in reconciliation.

Related

How to populate an AWS Timestream DB?

I am trying to use AWS Timestream to store data with timesteamp (in python using boto3).
The data I need to store corresponds to prices over time of different tokens. Each record has 3 field: token_address, timestamp, price. I have around 100 M records (with timestamps from 2019 to now).
I have all the data in a CSV and I would like to populate the DB with it. But I don't find a way to do this in the documentation as I am limited by 100 writes per query according to quotas. The only optimization proposed in documentation is Writing batches of records with common attributes but in my my case they don't share the same values (they all have the same structure but not the same values so I can not define a common_attributes as they do in the example).
So is there a way to populate a Timestream DB without writing records by batch of 100 ?
I asked AWS support, here is the answer:
Unfortunately, "Records per WriteRecords API request" is a non-configurable limit. This limitation is already noted by the development team.
However, to get any additional insights to help with your load, I have reached out to my internal team. I will get back to you as soon as I have an update from the team.
EDIT:
I had a new answer from AWS support:
Team, suggested that a new feature called batch load is being released tentatively at the end of February (2023). This feature will allow the customer to ingest data from CSV files directly into Timestream in bulk.

Export Bigquery table to gcs bucket into multiple folders/files corresponding to clusters

Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.

Reading a single csv file from s3 bucket using amazon athena and querying it

Basically I want a SQL connection to a csv file in a s3 bucket using Amazon Athena. I also do not know any information other than that the first row will give the names of the headers. Does anyone know any solution to this?
You have at least two ways of doing this. One is to examine a few rows of the file to detect the data types, then create a CREATE TABLE SQL statement as seen at the Athena docs.
If you know you are getting only strings and numbers (for example) and if you know all the columns will have values, it can be relatively easy to build it that way. But if types can be more flexible or columns can be empty, building a robust solution from scratch might be tricky.
So the second option would be to use the AWS Glue Catalog to define a crawler, which does exactly what I told you above, but automatically. It also creates the metadata you need in Athena, so you don't need to write the CREATE TABLE statement.
As a bonus, you can use that automatically catalogued data not only from Athena, but also from Redshift and EMR. And if you keep adding new files to the same bucket (every day, every hour, every week...) you can tell the crawl to pass again and rediscover the data in case the schema has evolved.

Get the data from a REST API and store it in HDFS/HBase

I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation

schema free solution to BigQuery Load job

Background
I studied and found that bigQuery doesn't accept schemas defined by online tools (which have different formats, even though meaning is same).
So, I found that if I want to load data (where no. of columns keeps varying and increasing dynamically) into a table which has a fixed schema.
Thoughts
What i could do as a workaround is:
First check if the data being loaded has extra fields.
If it has, a schema mismatch will occur, so first you create a temporary table in BQ and load this data into the table using "autodetect" parameter, which gives me a schema (that is in a format,which BQ accepts schema files).
Now i can download this schema file and use it,to update my exsisting table in BQ and load it with appropriate data.
Suggestion
Any thoughts on this, if there is a better approach please share.
We are in the process of releasing a new feature that can update the schema of the destination table within a load/query job. With autodetect and the new feature you can directly load the new data to the existing table, and the schema will be updated as part of the load job. Please stay tuned. The current ETA is 2 weeks.

Categories