Generic data handling in Python

Generic data handling in Python - python

The situation
While reading the Bible (as context) I'd like to point out certain dependencies e.g. of people and locations. Due to swift expandability I'm choosing Python to handle this versatile data. Currently I'm creating many feature vectors independent from each other, containing various information as the database.
In the end I'd like to type in a keyword to search in this whole database, which shall return everything that is in touch with it. Something simple as
results = database(key)
What I'm looking for
Unfortunately I'm not a Pro about different database handling possibilities and I hope you can help me finding an appropriate option.
Are there possibilities that can be used out of the box or do I need to create all logic by myself?

This is a little vague so I'll try to handle the People and Location bit of it to help you get started.
One possibility is to build a SQLite database. (The sqlite3 library + documentation is relatively friendly). Also here's a nice tutorial on getting started with SQLite.
To start, you can create two entity tables:
People: contains details about every person in bible.
Locations: contains details about every location in bible.
You can then create two relationship tables that reference people and locations (as Foreign Keys). For example, one of these relationship tables might be
People_Visited_Locations: contains information about where each person visited in their lifetime. The schema might looks something like this:
| person (Foreign Key)| location (Foreign Key) | year |
Remember that Foreign Key refers to an entry in another table. In our case, person is an existing unique ID from your entity table People, location is an existing unique ID from your entity table Locations, and year could be the year that person went to that location.
Then to fetch every place that some person, say Adam in the bible visited, you can create a Select statement that returns all entries in People_Visited_Locations with Adam as person.
I think key (pun intended) takeaway is how Relationship tables can help you map relationships between entities.
Hope this helps get you started :)

Related

How can I have a database with thousands of tables with varying number of columns that are all of the same class in Django / SQLAlchemy ORM?

I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.

You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.

Association objects with history for relationships using ORM

A common type of relationship in schemas is this: a joiner table has a datetime element and is meant to store history about relationships between the rows of two other tables over time. These relationships are one-to-one or one-to-many even though we're using an association table which usually implies many-to-many. At any given point in time only one mapping, the latest at that point in time, is valid. For example:
Tables:
Computer: [id, name, description]
Locations: [id, name, address]
ComputerLocations: [id, computers_id, locations_id, timestamp]
A Computers object can only belong to one Locations object at a time (and Locations can have many Computers), but we store the history in the table. Rows in ComputerLocations aren't deleted, only superseded by new rows at query-time. Perhaps in the future some prune-type event will remove older rows as their usefulness is reduced.
What I'm looking do do is model this in SQLAlchemy, specifically in the ORM, so that a Computers class has the following properties:
A new Computer can be created without (independently of) a location (this makes sense because the location table is separate)
A new Location can be created without (independently of) a computer
If a Computer has a location it must be a member of Locations (foreign key constraint)
When updating an existing Computers object's location, a new row will be added to ComputerLocations with a datetime of NOW()
When creating a new Computers object with a location, a new row will be added to ComputerLocations with a datetime of NOW()
Everything should be atomic (i.e. fail if a new Computer is created but the row associating it to a location can't be created)
Is there a specific design pattern or a concrete method in SQLAlchemy ORM to accomplish this? The documentation has a section on Non-traditional mappings that includes mapping a class against multiple tables and to arbitrary selects so this looks promising. Further there was another question of stackoverflow that mentioned vertical tables. Due to my relative inexperience with SQLAlchemy I cannot synthesize this information into a robust and elegant solution yet. Any help would be greatly appreciated.
I'm using MySQL but a solution should be general enough for any database through the SQLAlchemy dialects system.

MongoEngine: EmbeddedDocument v/s. ReferenceField

EmbeddedDocument will allow to store a document inside another document, while RefereneField just stores it's reference. But, they're achieving a similar goal. Do they have specific use cases?
PS:
There's already a question on SO, but no good answers.

The answer to this really depends on what intend to do with the data you are storing in mongodb. It is important to remember that a ReferenceField will point to a document in another collection in mongodb, whereas an EmbeddedDocument is stored in the same document in the same collection.
Consider this schema:
Person
-> name
-> address
Address
-> street
-> city
-> country
If you expect every person to have only one address and each address to only be associated with one person (a one-to-one relationship) and you are generally going to query the database for one or more Person documents then the Person.address field should be EmbeddedDocumentField.
If you expect every person to have more than one address but each address will only be associated to one person (a one-to-many relationship) and you will still mainly query for a Person then you can use an EmbeddedDocumentListField.
If you expect every person to have more than one address and each address will be associated with many people (a many-to-many relationship) you probably should use ReferenceField.
However, even if you are one-to-one or one-to-many, if the Address is part of your data model that is of interest then it may be advantageous to have it stored in it's own collection because it makes aggregation and indexing easier.
One other point to consider is that unless you turn it off mongoengine will de-reference every ReferenceFieldwhen you retrieve a document - this might introduce performance penalties with lots of ReferenceField or references to very large documents.

It's really about the schema design of your collections in MongoDB. Generally it depends on different factors like cardinality of the relationship, way of accessing the data or size of the documents. It's explained well in official MongoDB's blog with some examples and I recommend you take a look at it.

Django design patterns - models with ForeignKey references to multiple classes

I'm working through the design of a Django inventory tracking application, and have hit a snag in the model layout. I have a list of inventoried objects (Assets), which can either exist in a Warehouse or in a Shipment. I want to store different lists of attributes for the two types of locations, e.g.:
For Warehouses, I want to store the address, manager, etc.
For Shipments, I want to store the carrier, tracking number, etc.
Since each Warehouse and Shipment can contain multiple Assets, but each Asset can only be in one place at a time, adding a ForeignKey relationship to the Asset model seems like the way to go. However, since Warehouse and Shipment objects have different data models, I'm not certain how to best do this.
One obvious (and somewhat ugly) solution is to create a Location model which includes all of the Shipment and Warehouse attributes and an is_warehouse Boolean attribute, but this strikes me as a bit of a kludge. Are there any cleaner approaches to solving this sort of problem (Or are there any non-Django Python libraries which might be better suited to the problem?)

what about having a generic foreign key on Assets?

I think its perfectly reasonable to create a "through" table such as location, which associates an asset, a content (foreign key) and a content_type (warehouse or shipment) . And you could set a unique constraint on the asset_fk so thatt it can only exist in one location at a time

Corresponding Objects to Tables in Database Design

Say I have an object which is composed of multiple pieces of information: rating, like, comment. Let's call this object a preference. Each preference would be associated with a user. That is, each user has many preferences, but each preference has only one user.
In what ways would it be better for my preference object to be structured into the design of the database, for example, as a table with columns rating, like, comment, and a foreign id key pointing to a user? A user's preference may or may not contain a rating, like, or comment, and if they don't, the entry for that specific column would be left blank.
And in what ways would it be better for my preference object to be instead assembled outside of the design of the database, by collecting each piece it needs from several tables, a table each for rating, like, and comment, and each table having a column pointing to a foreign id key of a user? If the user lacks a rating, like, or comment, that table would simply not have an entry for that user.
Specifically I will be using python and sqlalchemy to accomplish this.

You might want to look into the Entity-Attribute-Value model:
http://weblogs.sqlteam.com/davidm/articles/12117.aspx

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.