The goal is to compare the data in the dataset source with the data after cleaning, modelling and ingesting it into the data warehouse and send an alert in case of mismatch.
Is this possible with dbt Cloud or I should use Python
If by "source" you mean a different database, before it is ingested into your data warehouse, I'd recommend using datadiff.
Once the data is in your warehouse, you can use dbt to compare two different tables (let's say in your raw or source schema and your final modeled schema).
There are quite a few tests in dbt-utils for this. You may also be interested in dbt-expectations if you need more powerful or complex tests.
Related
This is more of design and architecture question so please bare with me.
Requirement - Lets consider we have 2 type of flat files (csv, xls or txt) for below two db tables.
Doctor
name
degree
...
Patient
name
doctorId
age
...
each file contains data of each tables respectively. (volume 2-3millions each file).
we have to load these two files data into Doctor, Patient table of warehouse. after some of the validations like null value, foreign key, duplicates in doctor etc..
if any invalid data identifies, i will need to attach the reasons like null value, duplicate value. so that i can evaluate the invalid data.
Note that, expectations is to load 1 million records in ~1-2mins of span.
My Designed workflow (so far)
After several articles and blog reading i find it to go with AWS Glue & Databrew of my ETL for source to target along with custom validations.
Please find below design and architecture. Suggest & guide me on it.
Is there any scope of parallel or partition based processing for quick validation and loading the data? Your help is going to really help me and others (who gonna come to this type of case).
Thanks a ton.
I'm not sure if you're asking the same thing, but your architecture should follow these guidelines
File land to S3 raw bucket.
Your lambda trigger once file put on S3 bucket.
AWS lambda trigger will invoke Step function which contains following steps
3.1 ) Data governance (AWS Deeque) check all validations
3.2 ) perform transformation
Move process data to your process bucket where you have data for reconciliation and other process.
Finally your data move to your production bucket where you have only required process data not all
Note :
Partitioning is help to achieve parallels processing in Glue.
Your lambda and ETL logic should be stateless so rerun will not corrupts your data.
All Synchronous call should use proper retry Exponential Backoff and jitter.
Logs every steps into DynamoDB table so analysis logs and help in reconciliation.
Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.
I have a BigQuery data warehouse containing all the data from a mongodb database, those data are sync once a day.
I would like to add a column to one of my table, that column is a cleaned + lemmatized version of another column (the type is string). I can't do that with DBT because I need to use the python library Spacy. How could I run such a transformation on my table without having to get all the data locally and sending 10M UPDATE on bigquery ? Is there some GCP tools to run python function against bigquery like dataflow or something like that ?
And in a more general way, how do you tranform data when tools like DBT are not enough ?
Thanks for your help !
You can try Dataflow Batch processing for your requirement since Dataflow is a fully managed service which can run a transformation on your table without downloading the data locally and spaCy library can be used along with the Dataflow pipelines. Although Bigquery and Dataflow is a managed service that can process larger amounts of data, it is always a best practice to split larger jobs into smaller ones for larger NLP jobs as discussed here.
Note - As you want to add a column which is a lemmatized and cleaned version of a column in a table, it would be better to create a new destination table.
I am working on project where real time data is stored in Mongo DB. I need to analyze that data and calculate new parameters using present one in database.
Analysis may vary from basic statistics to predicting value from previous one
I came across with two approaches that is Mongo dB queries using pymongo or ETL tool.
Please help me to find the best approach for this.
I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation