I am not very familiar with databases, and so I do not know how to partition a table using SQLAlchemy.
Your help would be greatly appreciated.
There are two kinds of partitioning: Vertical Partitioning and Horizontal Partitioning.
From the docs:
Vertical Partitioning
Vertical partitioning places different
kinds of objects, or different tables,
across multiple databases:
engine1 = create_engine('postgres://db1')
engine2 = create_engine('postgres://db2')
Session = sessionmaker(twophase=True)
# bind User operations to engine 1, Account operations to engine 2
Session.configure(binds={User:engine1, Account:engine2})
session = Session()
Horizontal Partitioning
Horizontal partitioning partitions the
rows of a single table (or a set of
tables) across multiple databases.
See the “sharding” example in
attribute_shard.py
Just ask if you need more information on those, preferably providing more information about what you want to do.
It's quite an advanced subject for somebody not familiar with databases, but try Essential SQLAlchemy (you can read the key parts on Google Book Search -- p 122 to 124; the example on p. 125-126 is not freely readable online, so you'd have to purchase the book or read it on commercial services such as O'Reilly's Safari -- maybe on a free trial -- if you want to read the example).
Perhaps you can get better answers if you mention whether you're talking about vertical or horizontal partitioning, why you need partitioning, and what underlying database engines you are considering for the purpose.
Automatic partitioning is a very database engine specific concept and SQLAlchemy doesn't provide any generic tools to manage partitioning. Mostly because it wouldn't provide anything really useful while being another API to learn. If you want to do database level partitioning then do the CREATE TABLE statements using custom Oracle DDL statements (see Oracle documentation how to create partitioned tables and migrate data to them). You can use a partitioned table in SQLAlchemy just like you would use a normal table, you just need the table declaration so that SQLAlchemy knows what to query. You can reflect the definition from the database, or just duplicate the table declaration in SQLAlchemy code.
Very large datasets are usually time-based, with older data becoming read-only or read-mostly and queries usually only look at data from a time interval. If that describes your data, you should probably partition your data using the date field.
There's also application level partitioning, or sharding, where you use your application to split data across different database instances. This isn't all that popular in the Oracle world due to the exorbitant pricing models. If you do want to use sharding, then look at SQLAlchemy documentation and examples for that, for how SQLAlchemy can support you in that, but be aware that application level sharding will affect how you need to build your application code.
Related
Disclaimer: I am still pretty new to Django and am no veteran.
I am in the midst of building the "next generation" of a software package I built 10 years ago. The original software was built using CodeIgniter and the LAMP stack. The current software still works great, but it's just time to move on. The tech is now old. I have been looking at Django to write the new software in, but I have concerns using the ORM and the models file getting out of control.
So here's my situation, each client must have their own database. No exceptions due to data confidentiality and contracts. Each database mainly stores weather forecast data. There is a skeleton database that is currently used to setup a client. This skeleton does have tables that are common across all clients. What I am concerned about are the forecast data tables I have to dynamically create. Each forecast table is unique and different with the exception of the first four columns in the table being used for referencing/indexing and letting you know when the data was added. The rest of the columns are forecast values in a real/float datatype. There could be anything from 12 forecast data columns to over 365. Between all clients, there are hundreds of different/unique forecast tables.
I am trying to wrap my head around how I can use the ORM without having hundreds of methods in model.py. Even if I made a subdirectory and then a "model.py" for each client, I'd still have tons of model methods to deal with.
I have been reading up on how the ORM works for Django, but I haven't found anything (yet) out there that helps with my kind of situation. It's not the norm.
Without getting any more long winded about this, should I skip the ORM because of all these complexities or is there some stable way to deal with this besides going with SQL queries and stored procedures to get some performance gains?
Things to note: I did thorough benchmarking between MySQL and Postgres and will be using Postgres for the new project. I did test the option of using an array column vs having a column for each forecast value in Postgres hoping this would help with the potential modeling bloat issue. To my surprise, having a column for each forecast value provided faster querying than storing everything in an array column. So array storage is not a viable option for my data.
I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.
You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.
Currently I have 2 databases
Primary (default) database, containing everything like users, posts etc. (this one runs on PostgreSQL wie psycopg2)
Secondary (geo) database, containing only geo data (this one runs on postgis 1.5)
Django and PG do not support cross-database relations for good reasons, I already know that, but I split my databases up, because I fear that the geo database is optimised for geo data and if I mix all data in one database the whole performance will suffer from this, plus I don't even know if I can have everyting in one database geo and normal data.
But I want to relate data from the primary(1) database to the secondary(2) database.
Is this approach reasonable or is it completely wrong to split it up?
If you are concerned about performance, it is obvious that the data needs to be in one database.
Spatial isn't special, it's just another data type. I don't see why enabling PostGIS would compromise the performance of a database. However, it doesn't hurt to test this with a copy of the primary database, particularly for production environments.
If you are concerned that enabling PostGIS will add hundreds of functions to the "public" schema, you can make a "postgis" schema and put the extension there. See these details. However, I'm not sure how geodjango will cope with this setup.
I'm starting a Django project and need to shard multiple tables that are likely to all be of too many rows. I've looked through threads here and elsewhere, and followed the Django multi-db documentation, but am still not sure how that all stitches together. My models have relationships that would be broken by sharding, so it seems like the options are to either drop the foreign keys of forgo sharding the respective models.
For argument's sake, consider the classic Authot, Publisher and Book scenario, but throw in book copies and users that can own them. Say books and users had to be sharded. How would you approach that? A user may own a copy of a book that's not in the same database.
In general, what are the best practices you have used for routing and the sharding itself? Did you use Django database routers, manually selected a database inside commands based on your sharding logic, or overridden some parts of the ORM to achive that?
I'm using PostgreSQL on Ubuntu, if it matters.
Many thanks.
In the past I've done something similar using Postgresql Table Partitioning, however this merely splits a table up in the same DB. This is helpful in reducing table search time. This is also nice because you don't need to modify your django code much. (Make sure you perform queries with the fields you're using for constraints).
But it's not sharding.
If you haven't seen it yet, you should check out Sharding Postgres with Instagram.
I agree with #DanielRoseman. Also, how many is too many rows. If you are careful with indexing, you can handle a lot of rows with no performance problems. Keep your indexed values small (ints). I've got tables in excess of 400 million rows that produce sub-second responses even when joining with other many million row tables.
It might make more sense to break user up into multiple tables so that the user object has a core of commonly used things and then the "profile" info lives elsewhere (std Django setup). Copies would be a small table referencing books which has the bulk of the data. Considering how much ram you can put into a DB server these days, sharding before you have too seems wrong.
I've been learning python by building a webapp on google app engine over the past five or six months. I also just finished taking a databases class this semester where I learned about views, and their benefits.
Is there an equivalent with the GAE datastore using python?
Read-only views (the most common type) are basically queries against one or more tables to present the illusion of new tables. If you took a college-level database course, you probably learned about relational databases, and I'm guessing you're looking for something like relational views.
The short answer is No.
The GAE datastore is non-relational. It doesn't have tables. It's essentially a very large distributed hash table that uses composite keys to present the (very useful) illusion of Entities, which are easy at first glance to mistake for rows in a relational database.
The longer answer depends on what you'd do if you had a view.
First of all answer to your question: With normal GAE, i.e. non relational DB GAE, you won't have such things as views
Since you are probably starting with Relational SQL in school, I would suggest switch to Relational SQL based GAE at http://code.google.com/apis/sql/ and http://code.google.com/apis/sql/docs/before_you_begin.html#enroll ( I am not sure if it's available right away, or you need to wait for approval to use an instance, but register right away)
Web based applications are using emerging NON relational DBs and you would be benefited by studying them as well. That way you also could understand GAE non relational better. As a basic level start at http://en.wikipedia.org/wiki/NoSQL and then you have many more to explore, specially famous once being Mongo DB, Amazon Simple DB etc.