Smart ways to map data from different data sources creating unique ids

Smart ways to map data from different data sources creating unique ids - python

I have data originating from different flat .csv files that I have uploaded to Azure Blob Storage. With Azure Data Factory, I have created a SQL database containing all the tables from the different files. All data sources contain the same underlying data, but use slightly different naming conventions.
The data levels in my data sources are:
Brand House
Brand Group
Product Name
Size
I would like to create one unique mapping convention (unique ID on the lowest hierarchy level) that can link all data sources together. The goal is to have a unique ID on Size level that is created in each table.
Currently I am thinking of writing a script for this in Python that looks at the string values in the different tables and creates a unique ID for each hierarchy level in the data. Then this script is ran with Azure Data Bricks and all IDs are created. This approach requires me to have a look at all the different options on each hierarchy level and think of a smart naming convention.
Is there any built-in functionality in Azure Data Factory or other smart tools that can help me with this problem? The approach I have described above requires quite some manual effort and I would like to leverage any best practices here.

Related

How to change the structure of tables in multiple data dictionary across a advantage database server?

We are having same structure of tables in multiple dictionary on advantage database server version 10. If one change in structure of table in one dictionary we have to change the same in all other data dictionary manually. Is there any way to automate this with any program or tool. The changes can be done across multiple data dictionary in a single process. Please help on this subject.

How can I have a database with thousands of tables with varying number of columns that are all of the same class in Django / SQLAlchemy ORM?

I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.

You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.

CKAN - Different datasets

I am starting to get involved with CKAN. Until now I have done some of the tutorials and currently I am installing some of the available extensions.
Does anybody know if there is any other extension for customizing metadata datasets fields according differences between datasources?
In example:
Uploading text files or documents like PDF: only I want 5 concrete
metadata fields to be requested
Uploading CSV files with Coordinates Fields (georeferenced): I want
10 fields requested metadata fields. These fields could be different
fields than PDF's ones.
In fact, I would like to add a new page where the user could specify first the tipology of the datasource and then the application could request for those fields which are necesary to be requested.
I have seen how to customize In the tutorial a schema with some extra metadata fields but I don't know how to work with different metadata schemas. And also this extension could be useful for customizing dataset fields.
But, does someone have any idea about how to have different schemas depending of the type of a dataset?
Thanks for helping me :)
Jordi.

I think with the ckan-scheming extension you get everything you want.
As you can see in their documentation, you can specify different schemas according to your needs:
Camel
Standard dataset
Feel free to create your own, customized schema, with exactly the fields that you need.
Once you have your schema (in fact you want to create two different ones, one for the text files and one for the georeferenced CSVs), you can simply use the generated form to enter those specific types of datasets.
The important bit here is, that you specify a new type of dataset in the schema, e.g. {"dataset_type": "my-custom-text-dataset",}. If everything is configured as it should be, you can find and add your datasets here: http://my-ckan-instance.com/my-custom-text-dataset

Storing pandas DataFrames in SQLAlchemy models

I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?

Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!

how to make table partitions?

I am not very familiar with databases, and so I do not know how to partition a table using SQLAlchemy.
Your help would be greatly appreciated.

There are two kinds of partitioning: Vertical Partitioning and Horizontal Partitioning.
From the docs:
Vertical Partitioning
Vertical partitioning places different
kinds of objects, or different tables,
across multiple databases:
engine1 = create_engine('postgres://db1')
engine2 = create_engine('postgres://db2')
Session = sessionmaker(twophase=True)
# bind User operations to engine 1, Account operations to engine 2
Session.configure(binds={User:engine1, Account:engine2})
session = Session()
Horizontal Partitioning
Horizontal partitioning partitions the
rows of a single table (or a set of
tables) across multiple databases.
See the “sharding” example in
attribute_shard.py
Just ask if you need more information on those, preferably providing more information about what you want to do.

It's quite an advanced subject for somebody not familiar with databases, but try Essential SQLAlchemy (you can read the key parts on Google Book Search -- p 122 to 124; the example on p. 125-126 is not freely readable online, so you'd have to purchase the book or read it on commercial services such as O'Reilly's Safari -- maybe on a free trial -- if you want to read the example).
Perhaps you can get better answers if you mention whether you're talking about vertical or horizontal partitioning, why you need partitioning, and what underlying database engines you are considering for the purpose.

Automatic partitioning is a very database engine specific concept and SQLAlchemy doesn't provide any generic tools to manage partitioning. Mostly because it wouldn't provide anything really useful while being another API to learn. If you want to do database level partitioning then do the CREATE TABLE statements using custom Oracle DDL statements (see Oracle documentation how to create partitioned tables and migrate data to them). You can use a partitioned table in SQLAlchemy just like you would use a normal table, you just need the table declaration so that SQLAlchemy knows what to query. You can reflect the definition from the database, or just duplicate the table declaration in SQLAlchemy code.
Very large datasets are usually time-based, with older data becoming read-only or read-mostly and queries usually only look at data from a time interval. If that describes your data, you should probably partition your data using the date field.
There's also application level partitioning, or sharding, where you use your application to split data across different database instances. This isn't all that popular in the Oracle world due to the exorbitant pricing models. If you do want to use sharding, then look at SQLAlchemy documentation and examples for that, for how SQLAlchemy can support you in that, but be aware that application level sharding will affect how you need to build your application code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.