I have various models of which I would like to keep track and collect statistical data.
The problem is how to store the changes throughout time.
I thought of various alternative:
Storing a log in a TextField, open it and update it every time the model is saved.
Alternatively pickle a list and store it in a TextField.
Save logs on hard drive.
What are your suggestions?
Don't reinvent the wheel.. Use django-reversion for logging changes.
I'd break statistics off into a separate model though.
Quoth my elementary chemistry teacher: "If you don't write it down, it didn't happen", therefore save logs in a file.
Since the log information is disjoint from your application data (it's meta-data, actually), keep them separate. You could log to a database table but it should be distinct from your model.
Text pickle data is difficult for humans to read, binary pickle data even more so; log in an easily parsed format and the data can be imported into analysis software easily.
I've had similar situation in which we were supposed to keep the history of changed. But we also needed audit to track who made the changes and the ability to revert. In our approach storing in database seemed more logical. However considering you have statistical data and it's gonnna be large, perhaps separate file based approach would be better for you.
In any case you should use a generic mechanism to log the changes on models rather than coding each model invidually.
Take a look at this: http://www.djangosnippets.org/snippets/1052/
Related
I have a model with hundreds of properties. The properties can be of different types (integer, strings, uploaded files, ...). I would like to implement this complex model step by step, starting with the most important properties. I can think of two options:
Define the properties as regular model fields
Define a separate model to hold each property separately, and link it to the main model with a ForeignKey
I have not found any suggestions on how to handle models with lots of properties with django. What are the advantages / drawbacks of both approaches?
You definitely should not define your properties as ForeignKeys. Every time you need a full model, your database server will have to make hundreds of JOINs, therefore ruining your performance.
If your properties are needed almost every time you access the model, you should keep them in the same model. If not, you could make a separate Properties model and link it to your original model via OneToOneField.
I personally had such an experience. We had to build a hotel recomendation engine, and we were using Drupal back then. And as Drupal stores every custom property in a separate MySQL table, we quickly realised we should switch the framework, because every single query crashed our production servers (20+ JOINs are a deadly thing to MySQL). BTW, we ended up using a custom solution based on ElasticSearch, which handles hundreds of fields just fine.
Update: If you're lucky enough to be using a recent version of PostgreSQL, you could leverage the JSONField storage to pack all your fields to a single model field. Note, though, that you'll have to implement a validation scheme by yourself.
customer requirement.
First off, I feel your pain and wish you the best! I wish to reiterate if this wasn't the case that you should be first looking to change this as there should never be any need for hundreds of properties on a single object, it normally shows a need for an array, inheritance, or separate classes etc..
Going forward, you're going to need to heavily make use of values and values_list to only return the properties that you actually need from the database since performance will be severely crippled from this.
Since you can't do anything with the model, you should try to address your performance issues from the design side of things. The single responsibility principle should feature heavily in your website which will mean you'll only ever have a few values needed to be returned from the model. This way it really won't make much difference what option you choose since what is returned will be very limited.
Filter where you can, and use ordering sparingly.
You could group them into a few separate models, linked by OneToOneFields to the main model. That would "namespace" your data, and namespaces are "one honking great idea".
I set fixtures in my Django project to populate my database. This works well but has a serious limit: you can't create lots of stuff.
In theory, you can put as much elements as you want, but since you need to write them one by one, it's impossible to have 20 000 items in your db.
I need a tool that would fill the primary keys itself, and would be able to generate random typed data to fill the fixtures (e.g: emails, integers in a range, dates in a range, phones). Another nice functionally would be to set functional rules in data generation.
Does someone knows a way (library, ...) to do this in a Django project?
I took a look at https://github.com/joke2k/faker - the tool itself seems good, but no integration with Django.
Otherwise, I guess I could write it myself using Faker (since writing a fixture file just consists on json generation), but I don't like to reinvent the wheel :)
Thanks.
Factory Boy: https://factoryboy.readthedocs.org
It's a fixtures replacement that works really well for unit testing or otherwise making fixture data. You can write classes that hook into your models and generate populated model instances and you can construct them to save to the database, or not.
I have data that is best represented by a tree. Serializing the structure makes the most sense, because I don't want to sort it every time, and it would allow me to make persistent modifications to the data.
On the other hand, this tree is going to be accessed from different processes on different machines, so I'm worried about the details of reading and writing. Basic searches didn't yield very much on the topic.
If two users simultaneously attempt to revive the tree and read from it, can they both be served at once, or does one arbitrarily happen first?
If two users have the tree open (assuming they can) and one makes an edit, does the other see the change implemented? (I assume they don't because they each received what amounts to a copy of the original data.)
If two users alter the object and close it at the same time, again, does one come first, or is an attempt made to make both changes simultaneously?
I was thinking of making a queue of changes to be applied to the tree, and then having the tree execute them in the order of submission. I thought I would ask what my problems are before trying to solve any of them.
Without trying it out I'm fairly sure the answer is:
They can both be served at once, however, if one user is reading while the other is writing the reading user may get strange results.
Probably not. Once the tree has been read from the file into memory the other user will not see edits of the first user. If the tree hasn't been read from the file then the change will still be detected.
Both changes will be made simultaneously and the file will likely be corrupted.
Also, you mentioned shelve. From the shelve documentation:
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
Personally, at this point, you may want to look into using a simple key-value store like Redis with some kind of optimistic locking.
You might try klepto, which provides a dictionary interface to a sql database (using sqlalchemy under the covers). If you choose to persist your data to a mysql, postgresql, or other available database (aside from sqlite), then you can have two or more people access the data simultaneously or have two threads/processes access the database tables -- and have the database manage the concurrent read-writes. Using klepto with a database backend will perform under concurrent access as well as if you were accessing the database directly. If you don't want to use a database backend, klepto can write to disk as well -- however there is some potential for conflict when writing to disk -- even though klepto uses a "copy-on-write, then replace" strategy that minimizes concurrency conflicts when working with files on disk. When working with a file (or directory) backend, your issues 1-2-3 are still handled due to the strategy klepto employs for saving writes to disk. Additionally, klepto can use a in-memory caching layer that enables fast access, where loads/dumps from the on-disk (or database) backend are done either on-demand or when the in-memory cache reaches a user-determined size.
To be specific: (1) both are served at the same time. (2) if one user makes an edit, the other user sees the change -- however that change may be 'delayed' if the second user is using an in-memory caching layer. (3) multiple simultaneous writes are not a problem, due to klepto letting NFS or the sql database handle the "copy-on-write, then replace" changes.
The dictionary interface for klepto.archvives is also available in a decorator form that provided LRU caching (and LFU and others), so if you have a function that is generating/accessing the data, hooking up the archive is really easy -- you get memorization with an on-disk or database backend.
With klepto, you can pick from several different serialization methods to encrypt your data. You can have klepto cast data to a string, or use a hashing algorithm (like md5), or use a pickler (like json, pickle, or dill).
You can get klepto here: https://github.com/uqfoundation/klepto
I am writing a reusable django application for returning json result for jquery ui autocomplete.
Currently i am storing the Class/function for getting the result in a dictionary with a unique key for each class/function.
When a request comes then I selects the corresponding class/function from the dict and returns the output.
My query is whether is the best practice to do the above or are there some other tricks to obtains the same result.
Sample GIST : https://gist.github.com/ajumell/5483685
You seem to be talking about a form of memoization.
This is OK, as long as you don't rely on that result being in the dictionary. This is because the memory will be local to each process, and you can't guarantee subsequent requests being handled by the same process. But if you have a fallback where you generate the result, this is a perfectly good optimization.
That's a very general question. It primary depends on the infrastructure of your code. The way your class and models are defined and the dynamics of the application.
Second, is important to have into account the resources of the server where your application is running. How much memory do you have available, and how much disk space so you can take into account what would be better for the application.
Last but not least, it's important to take into account how much operations does it need to put all these resources in memory. Memory is volatile, so if your application restarts you'll have to instantiate all the classes again and maybe this is to much work.
Resuming, as an optimization is very good choice to keep in memory objects that are queried often (that's what cache is all about) but you have to take into account all of the previous stuff.
Storing a series of functions in a dictionary and conditionally selecting one based on the request is a perfectly acceptable way to handle it.
If you would like a more specific answer it would be very helpful to post your actual code. And secondly, this might be better suited to codereview.stackexchange
I am building an application to distribute to fellow academics. The application will take three parameters that the user submits and output a list of dates and codes related to those events. I have been building this using a dictionary and intended to build the application so that the dictionary loaded from a pickle file when the application called for it. The parameters supplied by the user will be used to lookup the needed output.
I selected this structure because I have gotten pretty comfortable with dictionaries and pickle files and I see this going out the door with the smallest learning curve on my part. There might be as many as two million keys in the dictionary. I have been satisfied with the performance on my machine with a reasonable subset. I have already thought through about how to break the dictionary apart if I have any performance concerns when the whole thing is put together. I am not really that worried about the amount of disk space on their machine as we are working with terabyte storage values.
Having said all of that I have been poking around in the docs and am wondering if I need to invest some time to learn and implement an alternative data storage file. The only reason I can think of is if there is an alternative that could increase the lookup speed by a factor of three to five or more.
The standard shelve module will give you a persistent dictionary that is stored in a dbm style database. Providing that your keys are strings and your values are picklable (since you're using pickle already, this must be true), this could be a better solution that simply storing the entire dictionary in a single pickle.
Example:
>>> import shelve
>>> d = shelve.open('mydb')
>>> d['key1'] = 12345
>>> d['key2'] = value2
>>> print d['key1']
12345
>>> d.close()
I'd also recommend Durus, but that requires some extra learning on your part. It'll let you create a PersistentDictionary. From memory, keys can be any pickleable object.
To get fast lookups, use the standard Python dbm module (see http://docs.python.org/library/dbm.html) to build your database file, and do lookups in it. The dbm file format may not be cross-platform, so you may want to to distrubute your data in Pickle or repr or JSON or YAML or XML format, and build the dbm database the user runs your program.
How much memory can your application reasonably use? Is this going to be running on each user's desktop, or will there just be one deployment somewhere?
A python dictionary in memory can certainly cope with two million keys. You say that you've got a subset of the data; do you have the whole lot? Maybe you should throw the full dataset at it and see whether it copes.
I just tested creating a two million record dictionary; the total memory usage for the process came in at about 200MB. If speed is your primary concern and you've got the RAM to spare, you're probably not going to do better than an in-memory python dictionary.
See this solution at SourceForge, esp. the "endnotes" documentation:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
Here are three things you can try:
Compress the pickled dictionary with zlib. pickle.dumps(dict).encode("zlib")
Make your own serializing format (shouldn't be too hard).
Load the data in a sqlite database.