I'm looking for a way, so that a relationship only uses some rows of a table, instead of the whole table.
I was thinking about using a view instead of the original table as base for the relationship. Where the view contains only a filtered set of rows from the original table.
Example
I want to collect statistical information for songs (rating, play-count, last time played) from a couple of players over different devices and from different users. I use Python and SQL Alchemy.
I came up with the following table layout (you can take a look at the code for the objects and the SQL for the tables):
The problem
For each Song object I can easily get all the associated Stats objects. But most of the time I only want some of them. I only want those Stats for a Song that relate to one or more Commit, eg. only the data from all commits by a given user, or one certain commit.
So I need some way to filter the stats data on the song object.
I'm not quiet sure how to archive this.
Should I use some custom query? But where do I place it and how? Also: while this might give me the data I want (select path, rating... over the three joined tables or something along those lines), I won't get objects back.
I was thinking about using a view on the stats table, containing only the lines matching the given commits. The view has to be created dynamically trough, so different commits can be filtered. And than using this view as the base for a relationship from songs to stats. But I have no idea how to do it.
So: Any ideas on how to solve this problem?
Or how to solve this another way?
You could try something like this:
from sqlalchemy.orm import object_session
# defined inside of your Song class
def stats_by_commit(self,commit):
#this could also be implemented as a join
return object_session(self).query(Stat)\
.filter(Stat.song_id == self.song_id,Stat.commit_id == Commit.commit_id)
Usage:
commit = db_session.query(Commit).filter_by(id=1)
for song in db_session.query(Song).filter_by(path='some_song_path'):
for stat in song.stats_by_commit(self,commit):
print stat.rating
Related
if you have some fixed data in Django, for example, ten rows and 5 columns.
Is it better to create a database for it and read it from the database, or is it not good and it is better to create a dictionary and read the data from the dictionary?
In terms of speed and logic and ...
If the database is not a good choice, should I write the data as a dictionary in View Django or inside a text file or inside an Excel file?
Whichever method is better, please explain why.
It depends upon the application.. but if there is doubt, create a model for it and put it in the database. And here's why I say that:
If your data needs to be changed, or if you want to view it, you can easily do so in the Django Admin app.
If your applications contains models which relate to this data, you can use a foreign key to reference it, rather than replicating it or using references that aren't enforced by the database.
It makes it much easier to do queries on your whole database if everything is in the database. For example, let's say that you have a table of "houses" and each house has a "color".. but you've stored the list of color names in a dictionary outside the database. Now you want a list of houses that are "Bright Blue". First you have to look in your dictionary to find the id of the color "Bright Blue", then you have to do your database lookup using the id you found. It takes something that would normally be a very simple one-line query in Django and makes it much harder.
By the same logic, if you wanted a list of houses along with their color, this would be a very simple query if done entirely in the database but is extra work if you keep some data elsewhere.
I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.
You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.
I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.
A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.
I have an SQLAlchemy ORM class, linked to MySQL, which works great at saving the data I need down to the underlying table. However, I would like to also save the identical data to a second archive table.
Here's some psudocode to try and explain what I mean
my_data = Data() #An ORM Class
my_data.name = "foo"
#This saves just to the 'data' table
session.add(my_data)
#This will save it to the identical 'backup_data' table
my_data_archive = my_data
my_data_archive.__tablename__ = 'backup_data'
session.add(my_data_archive)
#And commits them both
session.commit()
Just a heads up, I am not interested in mapping a class to a JOIN, as in: http://www.sqlalchemy.org/docs/05/mappers.html#mapping-a-class-against-multiple-tables
I list some options below. I would go for the DB trigger if you do not need to work on those objects in your model.
use database trigger to do this job for you
create a SessionExtension which will create and add to session copy-objects (usually on before_flush). Edit-1: You can take versioning example from SA as a basic; this code is doing even more then you need.
see SA Versioning example which will not only give you a copy of the object, but the whole version history, which might be what you wish for
see Reverse mapping from a table to a model in SQLAlchemy question, where the proposed solution is described in the blogpost.
Create 2 identical models: one mapped to main table and another mapped to archive table. Create a MapperExtension with redefined method after_insert() (depending on your demands you might also need after_update() and after_delete()). This method should copy data from main model to archive and add it to the session. There are some tricks to copy all columns and many-to-many relations automagically.
Note, that you'll have to flush() session twice to store both objects since unit of work is computed before mapper extension adds new object to the session. You can redefine Session.flush() to take care of this problem. Also auto-incremented fields are assigned when the object is flushed, so you'll have to delay copying if you need them too.
It is one possible scenario which is proved to work. I'd like to know if there is a better way.
I am trying to design a tagging system with a model like this:
Tag:
content = CharField
creator = ForeignKey
used = IntergerField
It is a many-to-many relationship between tags and what's been tagged.
Everytime I insert a record into the assotication table,
Tag.used is incremented by one, and decremented by one in case of deletion.
Tag.used is maintained because I want to speed up answering the question 'How many times this tag is used?'.
However, this seems to slow insertion down obviously.
Please tell me how to improve this design.
Thanks in advance.
http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html
If your database support materialized indexed views then you might want to create one for this. You can get a large performance boost for frequently run queries that aggregate data, which I think you have here.
your view would be on a query like:
SELECT
TagID,COUNT(*)
FROM YourTable
GROUP BY TagID
The aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.
I don't think it's a good idea to denormalize your data like that.
I think a more elegant solution is to use django aggregation to track how many times the tag has been used http://docs.djangoproject.com/en/dev/topics/db/aggregation/
You could attach the used count to your tag object by calling something like this:
my_tag = Tag.objects.annotate(used=Count('post'))[0]
and then accessing it like this:
my_tag.used
assuming that you have a Post model class that has a ManyToMany field to your Tag class
You can order the Tags by the named annotated field if needed:
Tag.objects.annotate(used=Count('post')).order_by('-used')