Since mongo doesn't have a schema, does that mean that we won't have to do migrations when we change the models?
What does the migration process look like with a non-relational db?
I think this is a really good question, but the answers are going to be a little scattered based on the libs you're using and your expectations for a "migration".
Let's take a look at some common migration actions:
Add a field: Mongo makes this very easy. Just add a field and you're done.
Delete a field: In theory, you're not actually tied to your schema, so "deletion" here is relative. If you remove the "property" and no longer load the field, then it doesn't really matter if that field is in the data. So if you don't care about "cleaning up" the database, then removing a field doesn't affect the database. If you do care about cleaning the DB, you'll basically need to run a giant for loop against the DB.
Modify a field name: This is also a difficult problem. When you rename a field "where" are you renaming it? If you want the DB to reflect the new field name, then you basically have to execute a giant for loop on the DB. TO be safe you probably have to "add" data, then push code, then "unset" the old field.
Some Wrinkles
However, the concept of a field name in tandem with an ActiveRecord object is just a little skewed. An ActiveRecord object is effectively providing mappings of object properties to actual database fields.
In a typical RDBMS the "size" of a field name is not really relevant. However, in Mongo, the field name actually occupies data space and this makes a big difference in terms of performance.
Now, if you're using some form of "data object" like ActiveRecord, why would you attempt to store full field names in the data? The DB should probably be storing all fields in alphabetical order with a map on the Object side. So a Document could have 8 fields/properties and the DB names would be "a", "b"..."j", but the Object names would be readable stuff like "Name", "Price", "Quantity".
The reason I bring this up is that it adds yet another wrinkle to Modify a field name. If you're implementing a mapping then modifying a field name doesn't really cause a migration at all.
Some more Wrinkles
If you do want to implement a migration on a deletion, then you'll have to do so after a deploy. You'll also have to recognize that you won't save any current disk space when you do so.
Mongo pre-allocates space and it doesn't really "give it back" unless you do a DB repair. So if you delete a bunch of fields on documents, those documents still occupy the same space on disk. If the documents are later moved, then you may reclaim space, however documents only move when they grow.
If you remove a large field from lots of documents you'll want to do a repair or a check out the new in-place compact command.
There is no silver bullet. Adding or removing fields is easier with non-relational db (just don't use unneeded fields or use new fields), renaming a field is easier with traditional db (you'll usually have to change a lot of data in case of field rename in schemaless db), data migration is on par - depending on task.
What does the migration process look like with a non-relational db?
Depends on if you need to update all the existing data or not.
In many cases, you may not need to touch the old data, such as when adding a new optional field. If that field also has a default value, you may also not need to update the old documents, if your application can handle a missing field correctly. However, if you want to build an index on the new field to be able to search/filter/sort, you need to add the default value back into the old documents.
Something like field renaming (trivial in a relational db, because you only need to update the catalog and not touch any data) is a major undertaking in MongoDB (you need to rewrite all documents).
If you need to update the existing data, you usually have to write a migration function that iterates over all the documents and updates them one by one (although this process can be shared and run in parallel). For large data sets, this can take a lot of time (and space), and you may miss transactions (if you end up with a crashed migration that went half-way through).
Related
if you have some fixed data in Django, for example, ten rows and 5 columns.
Is it better to create a database for it and read it from the database, or is it not good and it is better to create a dictionary and read the data from the dictionary?
In terms of speed and logic and ...
If the database is not a good choice, should I write the data as a dictionary in View Django or inside a text file or inside an Excel file?
Whichever method is better, please explain why.
It depends upon the application.. but if there is doubt, create a model for it and put it in the database. And here's why I say that:
If your data needs to be changed, or if you want to view it, you can easily do so in the Django Admin app.
If your applications contains models which relate to this data, you can use a foreign key to reference it, rather than replicating it or using references that aren't enforced by the database.
It makes it much easier to do queries on your whole database if everything is in the database. For example, let's say that you have a table of "houses" and each house has a "color".. but you've stored the list of color names in a dictionary outside the database. Now you want a list of houses that are "Bright Blue". First you have to look in your dictionary to find the id of the color "Bright Blue", then you have to do your database lookup using the id you found. It takes something that would normally be a very simple one-line query in Django and makes it much harder.
By the same logic, if you wanted a list of houses along with their color, this would be a very simple query if done entirely in the database but is extra work if you keep some data elsewhere.
I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.
You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.
Can you take form data and change database schema? Is it a good idea? Is there a downside to many migrations from a 'default' database?
I want users to be able to add / remove tables, columns, and rows. Making schema changes requires migrations, so adding in that functionality would require writing a view that takes form data and inserts it into a function that then uses Flask-Migrate.
If I manage to build this, don't migrations build the required separate scripts and everything that goes along with that each time something is added or removed? Is that practical for something like this, where 10 or 20 new tables might be added to the starting database?
If I allow users to add columns to a table, it will have to modify the table's class. Is that possible, or a safe idea? If not, I'd appreciate it if someone could help me out, and at least get me pointed in the right direction.
In a typical web application, the deployed database does not change its schema at runtime. The schema is only changed during an upgrade, and only the developers make these changes. Operations that users perform on the application can add, remove or modify rows, but not modify the tables or columns themselves.
If you need to offer your users a way to add flexible data structures, then you should design your database schema in a way that this is possible. For example, if you wanted your users to add custom key/value pairs, you could have a table with columns user_id, key_name and value. You may also want to investigate if a schema-less database fits your needs better.
I'm designing a bulk import tool from an old system into a new Django based system.
I'd like to retain all of the current IDs of objects (they are just 5 digit strings), now due to the design in the current system there are lots of references between these objects.
To import I can see 2 possible techniques - import a known object, and carefully recurse through these relationships making sure to import in the right way and only set relationships as soon as I know they exist
... or start at item 00001 set the foreignkeys to values I know exist and just grab everything in order, knowing that once we get to item 99999 all the relationships will exist.
So is there a way to set a foreignkey to the ID of an item that doesn't exist, but will, even for imports only?
To add further complexity, not all of these relationships are straightforward foreignkeys, some are ManyToMany relationships as well.
To be able to handle any database that Django supports and not have to deal with peculiarities of the backend, I'd export the old database in the format that Django loaddata can read, and then give this exported file to loaddata. This command has no issue importing the type of structure you are talking about.
Creating the file that loaddata will read could be done by writing your own converter that reads the old database and dumps an appropriate file. However, a way which might be easier would be to create a throwaway Django project with models that have the same structure as the tables in the old database, point the Django project to the old database, and use dumpdata to create the file. If table details between the old database and the new database have changed, you'd still have to modify the file but at least some of the conversion work would have already been done.
A more direct way would be to bypass Django completely to do the import in SQL but turn off foreign key constraints for the time of the import. For MySQL this would be done by setting foreign_key_checks to 0 for the time of the import, and then back to 1 when done. For SQLite this would be done by using PRAGMA foreign_keys = OFF; and then ON when done.
Postgresql does not allow just turning off these constraints but Django creates foreign key constraints as DEFERRABLE INITIALLY DEFERRED, which means that the constraint is not checked until the end of a transaction. So initiating a transaction, importing and then committing should work. If something prevents this, then you have to drop the constraint before importing and add it back afterwards.
Sounds like you need a database migration tool like South, the standard for Django. Worth noting that Django 1.7 Beta 1 was released recently and it provides in-built migration.
I apologize if this has been asked already, or if this is answered somewhere else.
Anyways, I'm working on a project that, in short, stores image metadata and then allows the user to search said metadata (which resembles a long list of key-value pairs). This wouldn't be too big of an issue if the metadata was standardized. However, the problem is that for any given image in the database, there is any number of key/values in its metadata. Also there is no standard list of what keys there are.
Basically, I need to find a way to store a dictionary for each model, but with arbitrary key/value pairs. And I need to be able to query them. And the organization I'm working for is planning on uploading thousands of images to this program, so it has to query reasonably fast.
I have one model in my database, an image model, with a filefield.
So, I'm in between two options, and I could really use some help from people with more experience on choosing the best one (or any other solutions that would work better)
Using a traditional relational database like MySql, and creating a separate model with a foreignkey to the image model, a key field, and a value field. Then, when I need to query the data, I'll ask for every instance of this separate table that relates to an image, and then query those rows for the key/value combination I need.
Using something like MongoDB, with django-toolbox and its DictField to store the metadata. Then, when I need to query, I'll access the dict and search it for the key/value combination I need.
While I feel like 1 would be much better in terms of query time, each image may have up to 40 key/values of metadata, and that makes me worry about that separate "dictionary" table growing far too large if there's thousands of images.
Any advice would be much appreciated. Thanks!
What's the type of metadata? Both key and value are string? I assume it's the case.
The scale of your dataset matters. If you will have up to thousands images and each image has up to 40 key-value pairs, then in option 1, the separate table would have at most 400k records. That's no problem for modern database, as long as you have not bad machine and correct DB settings. One issue to take care is to composite index fields in the table. In Django ORM, it would be something like:
class ImageMeta(models.Model):
image = models.ForeignKey('Image')
key = models.CharField(max_length=XXXX)
value = models.CharField(max_length=XXXX)
class Meta:
index_together = [ ["image", "key", "value"], ] # Django 1.5 and above
In a Django project you've got 4 alternatives for this kind of problem, in no particular order:
using PostgreSQL, you can use the hstore field type, that's basically a pickled python dictionary. It's not very helpful in terms of querying it, but does its job saving your data.
using Django-NoRel with mongodb you get the ListField field type that does the same thing and can be queried just like anything in mongo. (option 2)
using Django-eav to create an entity attribute value store with your data. Elegant solution but painfully slow queries. (option 1)
storing your data as a json string in a long enough TextField and creating your own functions to serializing and deserializing the data, without thinking on being able to make a query over it.
In my own experience, if you by any chance need to query over the data, your option two is by far the best choice. EAV in Django, without composite keys, is painful.