Related
This is more of an efficiency question. My django web page is working fine, in the sense that I don't get any errors, but it is very slow. That being said, I don't know where else I would ask this, other than here, so here goes:
I am developing a sales dashboard. In doing so, I am accessing the same data over and over and I would like to speed things up.
For example, one of my metrics is number of opportunities won. This accesses my Opportunities model, sorts out the opportunities won within the last X days and reports it.
Another metric is neglected opportunities. That is, opportunities that are still reported as being worked on, but that there has been no activity on them for Y days. This metric also accesses my Opportunities model.
I read here that querysets are lazy, which, if I understand this concept correctly, would mean that my actual database is accessed only at the very end. Normally this would be an ideal situation, as all of the filters are in place and the queryset only accesses a minimal amount of information.
Currently, I have a separate function for each metric. So, for the examples above, I have compile_won_opportunities and compile_neglected_opportunities. Each function starts with something like:
won_opportunities_query = Opportunities.objects.all()
and then I filter it down from there. If I am reading the documentation correctly, this means that I am accessing the same database many, many times.
There is a noticeable lag when my web page loads. In an attempt to find out what is causing the lag, I commented out different sections of code. When I comment out the code that accesses my database for each function, my web page loads immediately. My initial thought was to access my database in my calling function:
opportunities_query = Opportunities.objects.all()
and then pass that query to each function that uses it. My rationale was that the database would only be accessed one time, but apparently django doesn't work that way, as it made no obvious difference in my page load time. So, after my very long-winded explanation, how can I speed up my page load time?
If I am reading the documentation correctly, this means that I am accessing the same database many, many times.
https://pypi.org/project/django-debug-toolbar/
Btw, go with this one https://docs.djangoproject.com/en/2.2/ref/models/querysets/#select-related
I'm looking for a solution for this intended application.
I maintain a database of performance of a list of companies in our investment portfolio. The database is a collection of the company's daily performance, going back a few months. The format of the table is as follows (simplified version):
ID, Date, Investment_Name, Market_Share, %_change
1, 1/1/2000, SPX, 0.1, 0.05
2, 1/1/2000, INDU, 0.2, -0.01
...
101, 1/2/2000, SPX, 0.1, 0.03
102, 1/2/2000, INDU, 0.2, 0.03
...
The data is maintained in Access currently and I have queries setup for myself to see things like daily return for the whole portfolio, get historical series for selected securities, etc. Basically anything I can think of that I can make a query for.
..
Now I'm trying to create something for other people to access this data. I want it to be an application (hosted on our local network), that gives people an idea for our company's current performance. The simplest thing I could think of is to make a browser app that queries the database and gives visualization of the data. The user can have the freedom to select things they want to look at, in both table and graph form. Maybe make a few modules that displays "top 5" "bottom 5" in each asset class. More advanced feature would be making "what if" scenarios, changing security composition and things like that.
My experience in development is quite limited. I know python to a workable extent, have used things like pandas and matplotlib for local graphing. I'm wondering if there's a lightweight way to set up a system like this using some other python modules? Or is there a web framework already available for this type of task? I'm sure this is a common task in many organization and I'd love to know what's the easiest way to accomplish this.
Thanks a lot for your time.
There are various opensource BI tools, the major players are:
http://www.pentaho.com/
http://www.actuate.com/products/open-source-bi/
http://community.jaspersoft.com/
It's not just about a tool though, do you want to keep the data in Access or do you want to put it in something like mysql or postgresql, how will you automate the ETL (see also http://www.talend.com), document the definitions etc.
Building it yourself is cooler, but also not that easy. You could build a RESTful api on pandas (running on UWSGI/Flask for instance) and then put a html5 library like d3.js (http://d3js.org) on top of it. But that's a really large effort, especially for a one man team. I would look at pentaho if i were you ;)
Probably the easiest way is to load the data into a shared database (using any version of SQL, in your case SQL Server most likely as you're starting in Access) and then slapping something like Spotfire (http://spotfire.tibco.com/) or Tableau (http://www.tableausoftware.com/) on top of that. I've used Tableau and it's pretty slick, you can pre-can some reports as well as allow your users to manipulate the data in real time. It also has a web interface that I've never used.
Both are a bit pricy though so I don't know if that will work for you. Good luck!
Not necessarily exactly what you wanting, but may accomplish most of the same goals, depending on security concerns, on-site requirements, etc...
Google has a beta product they've incorporated into their Google Drive product called Fusion Tables. It is basically a simple, streamlined, cloud hosted database engine. It supports SQL based updating via url or manual updating within the interface. Furthermore, it has a number of visualization capabilities built into it. Depending on how secure you need it to be, and/or if you already have google apps for business, that might be an option to look into.
You could just set up a new fusion table (get Google account, sign in, go to google drive, and click create, and there should be a table or fusion table or some such as that as a create option). You could then create the fields and manually add the data or just import the existing table you already have. Then edit the sharing properties to give others, as necessary, just read only permissions or read and write permissions to the table and the various visualization you can creat from it. There are some charting and graphing capabilities built in though I haven't played around with that side too much myself as I've mainly used it's integrated mapping capabilities.
Again, not perfect solution for what you were asking for, but thought it might help, so figured I'd share.
I am thinking about creating an open source data management web application for various types of data.
A privileged user must be able to
add new entity types (for example a 'user' or a 'family')
add new properties to entity types (for example 'gender' to 'user')
remove/modify entities and properties
These will be common tasks for the privileged user. He will do this through the web interface of the application. In the end, all data must be searchable and sortable by all types of users of the application. Two questions trouble me:
a) How should the data be stored in the database? Should I dynamically add/remove database tables and/or columns during runtime?
I am no database expert. I am stuck with the imagination that in terms of relational databases, the application has to be able to dynamically add/remove tables (entities) and/or columns (properties) at runtime. And I don't like this idea. Likewise, I am thinking if such dynamic data should be handled in a NoSQL database.
Anyway, I believe that this kind of problem has an intelligent canonical solution, which I just did not find and think of so far. What is the best approach for this kind of dynamic data management?
b) How to implement this in Python using an ORM or NoSQL?
If you recommend using a relational database model, then I would like to use SQLAlchemy. However, I don't see how to dynamically create tables/columns with an ORM at runtime. This is one of the reasons why I hope that there is a much better approach than creating tables and columns during runtime. Is the recommended database model efficiently implementable with SQLAlchemy?
If you recommend using a NoSQL database, which one? I like using Redis -- can you imagine an efficient implementation based on Redis?
Thanks for your suggestions!
Edit in response to some comments:
The idea is that all instances ("rows") of a certain entity ("table") share the same set of properties/attributes ("columns"). However, it will be perfectly valid if certain instances have an empty value for certain properties/attributes.
Basically, users will search the data through a simple form on a website. They query for e.g. all instances of an entity E with property P having a value V higher than T. The result can be sorted by the value of any property.
The datasets won't become too large. Hence, I think even the stupidest approach would still lead to a working system. However, I am an enthusiast and I'd like to apply modern and appropriate technology as well as I'd like to be aware of theoretical bottlenecks. I want to use this project in order to gather experience in designing a "Pythonic", state-of-the-art, scalable, and reliable web application.
I see that the first comments tend to recommending a NoSQL approach. Although I really like Redis, it looks like it would be stupid not to take advantage of the Document/Collection model of Mongo/Couch. I've been looking into mongodb and mongoengine for Python. By doing so, do I take steps into the right direction?
Edit 2 in response to some answers/comments:
From most of your answers, I conclude that the dynamic creation/deletion of tables and columns in the relational picture is not the way to go. This already is valuable information. Also, one opinion is that the whole idea of the dynamic modification of entities and properties could be bad design.
As exactly this dynamic nature should be the main purpose/feature of the application, I don't give up on this. From the theoretical point of view, I accept that performing operations on a dynamic data model must necessarily be slower than performing operations on a static data model. This is totally fine.
Expressed in an abstract way, the application needs to manage
the data layout, i.e. a "dynamic list" of valid entity types and a "dynamic list" of properties for each valid entity type
the data itself
I am looking for an intelligent and efficient way to implement this. From your answers, it looks like NoSQL is the way to go here, which is another important conclusion.
The SQL or NoSQL choice is not your problem. You need to read little more about database design in general. As you said, you're not a database expert(and you don't need to be), but
you absolutely must study a little more the RDBMS paradigm.
It's a common mistake for amateur enthusiasts to choose a NoSQL solution. Sometimes NoSQL
is a good solution, most of the times is not.
Take for example MongoDB, which you mentioned(and is one of the good NoSQL solutions I've tried). Schema-less, right? Err.. not exactly. You see when something is schema-less means no constraints, validation, etc. But your application's models/entities can't stand on thin air! Surely there will be some constraints and validation logic which you will implement on your software layer. So I give you mongokit! I will just quote from the project's description this tiny bit
MongoKit brings structured schema and validation layer on top of the
great pymongo driver
Hmmm... unstructured became structured.
At least we don't have SQL right? Yeah, we don't. We have a different query language which is of course inferior to SQL. At least you don't need to resort to map/reduce for basic queries(see CouchDB).
Don't get me wrong, NoSQL(and especially MongoDB) has its purpose, but most of the times these technologies are used for the wrong reason.
Also, if you care about serious persistence and data integrity forget about NoSQL solutions.
All these technologies are too experimental to keep your serious data. By researching a bit who(except Google/Amazon) uses NoSQL solutions and for what exactly, you will find that almost no one uses it for keeping their important data. They mostly use them for logging, messages and real time data. Basically anything to off-load some burden from their SQL db storage.
Redis, in my opinion, is probably the only project who is going to survive the NoSQL explosion unscathed. Maybe because it doesn't advertise itself as NoSQL, but as a key-value store, which is exactly what it is and a pretty damn good one! Also they seem serious about persistence. It is a swiss army knife, but not a good solution to replace entirely your RDBMS.
I am sorry, I said too much :)
So here is my suggestion:
1) Study the RDBMS model a bit.
2) Django is a good framework if most of your project is going to use an RDBMS.
3) Postgresql rocks! Also keep in mind that version 9.2 will bring native JSON support. You could dump all your 'dynamic' properties in there and you could use a secondary storage/engine to perform queries(map/reduce) on said properties. Have your cake and eat it too!
4) For serious search capabilities consider specialized engines like solr.
EDIT: 6 April 2013
5) django-ext-hstore gives you access to postgresql hstore type. It's similar
to a python dictionary and you can perform queries on it, with the limitation that
you can't have nested dictionaries as values. Also the value of key can be only
of type string.
Have fun
Update in response to OP's comment
0) Consider the application 'contains data' and has already been used
for a while
I am not sure if you mean that it contains data in a legacy dbms or you are just
trying to say that "imagine that the DB is not empty and consider the following points...".
In the former case, it seems a migration issue(completely different question), in the latter, well OK.
1) Admin deletes entity "family" and all related data
Why should someone eliminate completely an entity(table)? Either your application has to do with families, houses, etc or it doesn't. Deleting instances(rows) of families is understandable of course.
2) Admin creates entity "house"
Same with #1. If you introduce a brand new entity in your app then most probably it will encapsulate semantics and business logic, for which new code must be written. This happens to all applications as they evolve through time and of course warrants a creation of a new table, or maybe ALTERing an existing one. But this process is not a part of the functionality of your application. i.e. it happens rarely, and is a migration/refactoring issue.
3) Admin adds properties "floors", "age", ..
Why? Don't we know beforehand that a House has floors? That a User has a gender?
Adding and removing, dynamically, this type of attributes is not a feature, but a design flaw. It is part of the analysis/design phase to identify your entities and their respective properties.
4) Privileged user adds some houses.
Yes, he is adding an instance(row) to the existing entity(table) House.
5) User searches for all houses with at least five floors cheaper than
100 $
A perfectly valid query which can be achieved with either SQL or NoSQL solution.
In django it would be something along those lines:
House.objects.filter(floors__gte=5, price__lt=100)
provided that House has the attributes floors and price. But if you need to do text-based queries, then neither SQL nor NoSQL will be satisfying enough. Because you don't want to implement faceting or stemming on your own! You will use some of the already discussed solutions(Solr, ElasticSearch, etc).
Some more general notes:
The examples you gave about Houses, Users and their properties, do not warrant any dynamic schema. Maybe you simplified your example just to make your point, but you talk about adding/removing Entities(tables) as if they are rows in a db. Entities are supposed to be a big deal in an application. They define the purpose of your application and its functionality. As such, they can't change every minute.
Also you said:
The idea is that all instances ("rows") of a certain entity ("table") share the same set of properties/attributes ("columns"). However, it will be perfectly valid if certain instances have an empty value for certain properties/attributes.
This seems like a common case where an attribute has null=True.
And as a final note, I would like to suggest to you to try both approaches(SQL and NoSQL), since it doesn't seem like your career depends on this project. It will be a beneficiary experience, as you will understand first-hand, the cons and pros of each approach. Or even how to "blend" these approaches together.
What you're asking about is a common requirement in many systems -- how to extend a core data model to handle user-defined data. That's a popular requirement for packaged software (where it is typically handled one way) and open-source software (where it is handled another way).
The earlier advice to learn more about RDBMS design generally can't hurt. What I will add to that is, don't fall into the trap of re-implementing a relational database in your own application-specific data model! I have seen this done many times, usually in packaged software. Not wanting to expose the core data model (or permission to alter it) to end users, the developer creates a generic data structure and an app interface that allows the end user to define entities, fields etc. but not using the RDBMS facilities. That's usually a mistake because it's hard to be nearly as thorough or bug-free as what a seasoned RDBMS can just do for you, and it can take a lot of time. It's tempting but IMHO not a good idea.
Assuming the data model changes are global (shared by all users once admin has made them), the way I would approach this problem would be to create an app interface to sit between the admin user and the RDBMS, and apply whatever rules you need to apply to the data model changes, but then pass the final changes to the RDBMS. So for example, you may have rules that say entity names need to follow a certain format, new entities are allowed to have foreign keys to existing tables but must always use the DELETE CASCADE rule, fields can only be of certain data types, all fields must have default values etc. You could have a very simple screen asking the user to provide entity name, field names & defaults etc. and then generate the SQL code (inclusive of all your rules) to make these changes to your database.
Some common rules & how you would address them would be things like:
-- if a field is not null and has a default value, and there are already existing records in the table before that field was added by the admin, update existing records to have the default value while creating the field (multiple steps -- add field allowing null; update all existing records; alter the table to enforce not null w/ default) -- otherwise you wouldn't be able to use a field-level integrity rule)
-- new tables must have a distinct naming pattern so you can continue to distinguish your core data model from the user-extended data model, i.e. core and user-defined have different RDBMS owners (dbo. vs. user.) or prefixes (none for core, __ for user-defined) or somesuch.
-- it is OK to add fields to tables that are in the core data model (as long as they tolerate nulls or have a default), and it is OK for admin to delete fields that admin added to core data model tables, but admin cannot delete fields that were defined as part of the core data model.
In other words -- use the power of the RDBMS to define the tables and manage the data, but in order to ensure whatever conventions or rules you need will always be applied, do this by building an app-to-DB admin function, instead of giving the admin user direct DB access.
If you really wanted to do this via the DB layer only, you could probably achieve the same by creating a bunch of stored procedures and triggers that would implement the same logic (and who knows, maybe you would do that anyway for your app). That's probably more of a question of how comfortable are your admin users working in the DB tier vs. via an intermediary app.
So to answer your questions directly:
(1) Yes, add tables and columns at run time, but think about the rules you will need to have to ensure your app can work even once user-defined data is added, and choose a way to enforce those rules (via app or via DB / stored procs or whatever) when you process the table & field changes.
(2) This issue isn't strongly affected by your choice of SQL vs. NoSQL engine. In every case, you have a core data model and an extended data model. If you can design your app to respond to a dynamic data model (e.g. add new fields to screens when fields are added to a DB table or whatever) then your app will respond nicely to changes in both the core and user-defined data model. That's an interesting challenge but not much affected by choice of DB implementation style.
Good luck!
Maybe doesn't matter the persistence engine of your model objects (RDBMS, NoSQL, etc...). The technology you're looking for is an index to search for and find your objects.
I think you need to find your objects using their schema. So, if the schema is defined dynamically and persisted on a database you can build dynamic search forms, etc.. Some kind of reference of the entity and attributes to the real objects is needed.
Give a look to the Entity-Attribute-Model pattern (EAV). This can be implemented over SQLAlchemy to use an RDBMS database as mean to store vertically your schema and data and relate thems.
You're entering in the field of the Semantic Web Programming, maybe you should read at less the first chapter of this book:
Programming The Semantic Web
it tells the whole story of your problem: from rigid schemas to dynamic schemas, implemented first as a key-value store and later improved to a graph persistence over a relational model.
My opinion is that the best implementations of this could be achieved nowadays with graph databases and a very good example of current implementations are Berkeley DBs (some LDAP implementations use Berkeley DBs as a tech implementation to this indexing problem.)
Once in a graph model you could do some kind of "inferences" on the graph, making appear the DB with somekind of "intelligence". An example of this is stated on the book.
So, if you conceptualize your entities as "documents," then this whole problem maps onto a no-sql solution pretty well. As commented, you'll need to have some kind of model layer that sits on top of your document store and performs tasks like validation, and perhaps enforces (or encourages) some kind of schema, because there's no implicit backend requirement that entities in the same collection (parallel to table) share schema.
Allowing privileged users to change your schema concept (as opposed to just adding fields to individual documents - that's easy to support) will pose a little bit of a challenge - you'll have to handle migrating the existing data to match the new schema automatically.
Reading your edits, Mongo supports the kind of searching/ordering you're looking for, and will give you the support for "empty cells" (documents lacking a particular key) that you need.
If I were you (and I happen to be working on a similar, but simpler, product at the moment), I'd stick with Mongo and look into a lightweight web framework like Flask to provide the front-end. You'll be on your own to provide the model, but you won't be fighting against a framework's implicit modeling choices.
I am coding a psychology experiment in Python. I need to store user information and scores somewhere, and I need it to work as a web application (and be secure).
Don't know much about this - I'm considering XML databases, BerkleyDB, sqlite, an openoffice spreadsheet, or I'm very interested in the python "shelve" library.
(most of my info coming from this thread: http://developers.slashdot.org/story/08/05/20/2150246/FOSS-Flat-File-Database
DATA: I figure that I'm going to have maximally 1000 users. For each user I've got to store...
Username / Pass
User detail fields (for a simple profile)
User scores on the exercise (2 datapoints: each trial gets a score (correct/incorrect/timeout, and has an associated number from 0.1 to 1.0 that I need to record)
Metadata about the trials (when, who, etc.)
Results of data analysis for user
VERY rough estimate, each user generates 100 trials / day. So maximum of 10k datapoints / day. It needs to run that way for about 3 months, so about 1m datapoints. Safety multiplier 2x gives me a target of a database that can handle 2m datapoints.
((note: I could either store trial response data as individual data points, or group trials into Python list objects of varying length (user "sessions"). The latter would dramatically bring down the number database entries, though not the amount of data. Does it matter? How?))
I want a solution that will work (at least) until I get to this 1000 users level. If my program is popular beyond that level, I'm alright with doing some work modding in a beefier DB. Also reiterating that it must be easily deployable as a web application.
Beyond those basic requirements, I just want the easiest thing that will make this work. I'm pretty green.
Thanks for reading
Tr3y
SQLite can certainly handle those amount of data, it has a very large userbase with a few very well known users on all the major platforms, it's fast, light, and there are awesome GUI clients that allows you to browse and extract/filter data with a few clicks.
SQLite won't scale indefinitely, of course, but severe performance problems begins only when simultaneous inserts are needed, which I would guess is a problem appearing several orders of magnitude after your prospected load.
I'm using it since a few years now, and I never had a problem with it (although for larger sites I use MySQL). Personally I find that "Small. Fast. Reliable. Choose any three." (which is the tagline on SQLite's site) is quite accurate.
As for the ease of use... SQLite3 bindings (site temporarily down) are part of the python standard library. Here you can find a small tutorial. Interestingly enough, simplicity is a design criterion for SQLite. From here:
Many people like SQLite because it is small and fast. But those qualities are just happy accidents. Users also find that SQLite is very reliable. Reliability is a consequence of simplicity. With less complication, there is less to go wrong. So, yes, SQLite is small, fast, and reliable, but first and foremost, SQLite strives to be simple.
There's a pretty spot-on discussion of when to use SQLite here. My favorite line is this:
Another way to look at SQLite is this: SQLite is not designed to replace Oracle. It is designed to replace fopen().
It seems to me that for your needs, SQLite is perfect. Indeed, it seems to me very possible that you will never need anything else:
With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (2^41 bytes).
It doesn't sound like you'll have that much data at any point.
I would consider MongoDB. It's very easy to get started, and is built for multi-user setups (unlike SQLite).
It also has a much simpler model. Instead of futzing around with tables and fields, you simply take all the data in your form and stuff it in the database. Even if your form changes (oops, forgot a field) you won't need to change MongoDB.
I'm curious about how others have approached the problem of maintaining and synchronizing database changes across many (10+) developers without a DBA? What I mean, basically, is that if someone wants to make a change to the database, what are some strategies to doing that? (i.e. I've created a 'Car' model and now I want to apply the appropriate DDL to the database, etc..)
We're primarily a Python shop and our ORM is SQLAlchemy. Previously, we had written our models in such a way to create the models using our ORM, but we recently ditched this because:
We couldn't track changes using the ORM
The state of the ORM wasn't in sync with the database (e.g. lots of differences primarily related to indexes and unique constraints)
There was no way to audit database changes unless the developer documented the database change via email to the team.
Our solution to this problem was to basically have a "gatekeeper" individual who checks every change into the database and applies all accepted database changes to an accepted_db_changes.sql file, whereby the developers who need to make any database changes put their requests into a proposed_db_changes.sql file. We check this file in, and, when it's updated, we all apply the change to our personal database on our development machine. We don't create indexes or constraints on the models, they are applied explicitly on the database.
I would like to know what are some strategies to maintain database schemas and if ours seems reasonable.
Thanks!
The solution is rather administrative then technical :)
The general rule is easy, there should only be tree-like dependencies in the project:
- There should always be a single master source of schema, stored together with the project source code in the version control
- Everything affected by the change in the master source should be automatically re-generated every time the master source is updated, no manual intervention allowed never, if automatic generation does not work -- fix either master source or generator, don't manually update the source code
- All re-generations should be performed by the same person who updated the master source and all changes including the master source change should be considered a single transaction (single source control commit, single build/deployment for every affected environment including DBs update)
Being enforced, this gives 100% reliable result.
There are essentially 3 possible choices of the master source
1) DB metadata, sources are generated after DB update by some tool connecting to the live DB
2) Source code, some tool is generating SQL scheme from the sources, annotated in a special way and then SQL is run on the DB
3) DDL, both SQL schema and source code are generated by some tool
4) some other description is used (say a text file read by a special Perl script generating both SQL schema and the source code)
1,2,3 are equally good, providing that the tool you need exists and is not over expensive
4 is a universal approach, but it should be applied from the very beginning of the project and has an overhead of couple thousands lines of code in a strange language to maintain
Have you tried the SQLalchemy Migrate tools?
They are specifically designed to auto-migrate your database design changes.
So am I correct in assuming you are designing your db directly on the physical db? I used to do this many years ago but the quality of the resultant db was pretty poor. If you use a modelling tool (personally I think Sybase pdesigner is still best of breed, but look around) everybody can make changes to the model and just sync their local db’s as required (it will also pick up documentation tasks). So, re bobah’s post, the master is the pdesigner model rather than a physical db.
Is your accepted_db_changes.sql file one humongous list of change scripts? I’m not sure I like the idea of changing the file name, etc. Given that the difference between to two db versions is a sequential list of alter scripts, how about a model like:
Ver1 (folder)
Change 1-1.sql
Change 1-2.sql
Change 1-3.sql
Ver2 (folder)
Change 2-1.sql
Change 2-2.sql
Change 2-3.sql
Where each change (new file) is reviewed before committing.
A general rule should be to make a conscious effort to automate as much of the db deployment in your dev environments as possible; we have defiantly got a respectable ROI on this work. You can use tools like redgate to generate your ddl (it has an api, not sure if it works with SQLAlchemy though). IMO, DB changes should be trivial, if you are finding they are blocking then look at what you can automate.
You might find the book Refactoring Databases helpful as it contains general strategies for managing database, not just how to refractor them.
His system expects that every developer will have their own copy of the database as well as some general test database used before deploying to production. Your situation is one of the easier situations in the book describes as you don't have a number of separate applications using the database (although you do need someone who knows how to describe database migrations). The biggest thing is to be able to build the database from information in source control and have changes described by small migrations (see #WoLpH's answer) rather than just making the change in the database. Also you will find things easier if you have at least ORM <-> database tests to check that they are still in sync.