How to inherit DBT source documentation in the models

How to inherit DBT source documentation in the models - python

I have multiple DBT models that read from the same source database, and despite having written documentation (descriptions, tests, etc) for the source database, none of the columns descriptions carry over to the models.
I'd like the descriptions I wrote for each of the source columns to be copied to their respective columns in each model.
Is this possible?
Serving docs with descriptions written for the columns in the source table produced a properly documented source, but none of the documentation appeared on the models based on that source.
I was hoping I wouldn't have to make duplicate descriptions for the columns reused between multiple models

Related

Django data quality monitoring: best approach to 'construct' queries based on a settings file?

EDIT 28/04/2021
Have no chances with my question ; )
No one deals with this problem?
I want to develop an app for automatic data quality monitoring (DQM) based on different queries:
missing data
outliers
inconsitent data in the same models
incosistent data between models
missing form
User should be able to 'customize' its own queries in a structured settings files (in a excel file loaded in a Django model or directly in a Django model with UI interface).
DQM app should read this setting file and run queries with results stored in a model.
User can edit list of query result and make correction in database in order to resolve and improve data quality.
I look for Django package that could already do that but could not find any. So I would appreciate some help in conception.
I have designed the settings files as below :
I've read about data quality with Pandas bt nothing that cover all data quality queries mentionned above.
Nevertheless, Pandas could be used to read excel settings file using dataframe.
Then, I need to 'construct' query based on the setting file. I thought about 2 options:
use raw: concatenate SQL orders with data read from dataframe and passed SQL order in raw()
use Django queryset with unpack dictionnary to provide keyword arguments : qs = mymodel.objects.filter(**kwargs)
Is there a more clean way to achieve data quality?

How can I have a database with thousands of tables with varying number of columns that are all of the same class in Django / SQLAlchemy ORM?

I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.

You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.

Smart ways to map data from different data sources creating unique ids

I have data originating from different flat .csv files that I have uploaded to Azure Blob Storage. With Azure Data Factory, I have created a SQL database containing all the tables from the different files. All data sources contain the same underlying data, but use slightly different naming conventions.
The data levels in my data sources are:
Brand House
Brand Group
Product Name
Size
I would like to create one unique mapping convention (unique ID on the lowest hierarchy level) that can link all data sources together. The goal is to have a unique ID on Size level that is created in each table.
Currently I am thinking of writing a script for this in Python that looks at the string values in the different tables and creates a unique ID for each hierarchy level in the data. Then this script is ran with Azure Data Bricks and all IDs are created. This approach requires me to have a look at all the different options on each hierarchy level and think of a smart naming convention.
Is there any built-in functionality in Azure Data Factory or other smart tools that can help me with this problem? The approach I have described above requires quite some manual effort and I would like to leverage any best practices here.

CKAN - Different datasets

I am starting to get involved with CKAN. Until now I have done some of the tutorials and currently I am installing some of the available extensions.
Does anybody know if there is any other extension for customizing metadata datasets fields according differences between datasources?
In example:
Uploading text files or documents like PDF: only I want 5 concrete
metadata fields to be requested
Uploading CSV files with Coordinates Fields (georeferenced): I want
10 fields requested metadata fields. These fields could be different
fields than PDF's ones.
In fact, I would like to add a new page where the user could specify first the tipology of the datasource and then the application could request for those fields which are necesary to be requested.
I have seen how to customize In the tutorial a schema with some extra metadata fields but I don't know how to work with different metadata schemas. And also this extension could be useful for customizing dataset fields.
But, does someone have any idea about how to have different schemas depending of the type of a dataset?
Thanks for helping me :)
Jordi.

I think with the ckan-scheming extension you get everything you want.
As you can see in their documentation, you can specify different schemas according to your needs:
Camel
Standard dataset
Feel free to create your own, customized schema, with exactly the fields that you need.
Once you have your schema (in fact you want to create two different ones, one for the text files and one for the georeferenced CSVs), you can simply use the generated form to enter those specific types of datasets.
The important bit here is, that you specify a new type of dataset in the schema, e.g. {"dataset_type": "my-custom-text-dataset",}. If everything is configured as it should be, you can find and add your datasets here: http://my-ckan-instance.com/my-custom-text-dataset

Can I create models.py Models that load data from the file system (NOT from a database)?

We have a test tool that produces test results following a directory structure like follows:
test-results/${TEST-ID}/exec_${YYYYMMDD}_${HHMMSS}/
Inside each exec folder, there are several files like CSVs, HTML reports, charts, etc. The structure is always the same, and for simplicity's sake we don't use a database.
Now I would like to use Django to build a simple website for displaying these test results. Think of a reporting website, with some basic functionality like comparing test executions against each other.
From reading The Tutorial, I understand that in a Django app I should define my data in models.py using classes that extend django.db.models.Model, and later work with the API (e.g. object.save(), object.delete(), etc.) while the framework takes care of the database operations.
My data is a set of test results, which lives on a file system, not in a database.
That said, I would like to keep the data abstraction in models.py (i.e. to keep the MVC abstraction). The Django app only needs to read data, e.g.:
TestResult.objects.all() would load all TestResults from the test-results directory
TestResult.objects.filter(test_id=1) would return all TestResults for TEST-ID 1
and so on.
Updating data is not necessary; the app only reads data from the file system and displays it.
Can I achieve this behaviour using Django?
My current assumption is that I have to write the abstraction layer somewhere (extend the Model class and overwrite certain methods?), but I'm not sure this is the best/correct approach.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.