Tool to validate migrated data from RDBMS to mongoDB

Tool to validate migrated data from RDBMS to mongoDB - python

I need to create a python tool to validate the completeness and accuracy of migrated data from aRDBMS DB to MongoDB. It should support at least 30 tables and 1 million records (in the source).
I am aware that consistency checkers, for example, MD5 checksums, can be used to validate the same, but since I am new to python, can anybody please share their views if they have done a similar activity in the past (or any hint would be appreciated).
Thanks.

Related

Django ORM And Multiple Dynamic Databases

Disclaimer: I am still pretty new to Django and am no veteran.
I am in the midst of building the "next generation" of a software package I built 10 years ago. The original software was built using CodeIgniter and the LAMP stack. The current software still works great, but it's just time to move on. The tech is now old. I have been looking at Django to write the new software in, but I have concerns using the ORM and the models file getting out of control.
So here's my situation, each client must have their own database. No exceptions due to data confidentiality and contracts. Each database mainly stores weather forecast data. There is a skeleton database that is currently used to setup a client. This skeleton does have tables that are common across all clients. What I am concerned about are the forecast data tables I have to dynamically create. Each forecast table is unique and different with the exception of the first four columns in the table being used for referencing/indexing and letting you know when the data was added. The rest of the columns are forecast values in a real/float datatype. There could be anything from 12 forecast data columns to over 365. Between all clients, there are hundreds of different/unique forecast tables.
I am trying to wrap my head around how I can use the ORM without having hundreds of methods in model.py. Even if I made a subdirectory and then a "model.py" for each client, I'd still have tons of model methods to deal with.
I have been reading up on how the ORM works for Django, but I haven't found anything (yet) out there that helps with my kind of situation. It's not the norm.
Without getting any more long winded about this, should I skip the ORM because of all these complexities or is there some stable way to deal with this besides going with SQL queries and stored procedures to get some performance gains?
Things to note: I did thorough benchmarking between MySQL and Postgres and will be using Postgres for the new project. I did test the option of using an array column vs having a column for each forecast value in Postgres hoping this would help with the potential modeling bloat issue. To my surprise, having a column for each forecast value provided faster querying than storing everything in an array column. So array storage is not a viable option for my data.

Python MongoDB Multiple Database Design Query

I was hoping someone with more in-depth knowledge of MongoDB could provide some feedback on the implementation for my database requirements.
I am currently implementing a web application using Flask and MongoDB for my client who's business has multiple physical locations. They, as the Managing Director, want to be able to access information for each of the locations individually but they want to restrict the managers of each location to just their information.
I can envisage two different solutions to this but am not familiar enough with the underlying technology/design of MongoDB to fully appreciate the pros and cons of each implementation.
For each solution I have a user model that contains which location(s) a user has access to.
Solution 1 would involve each document having a field to record which location the object belonged to and then including that as a filter in every query. This seems the simplest to implement but would result in a single large database, could there be scaling issues?
Solution 2 involves having a separate user database, and an individual database for each location. When a request is made the application would have to establish from the user database which location they are accessing and connect to that database. This at first appears more complex to implement but would result in smaller isolated databases which may make it easier to scale but possibly harder to manage.
I would greatly appreciate any feedback on my solutions or any other solutions I have not considered.
Many thanks in advance.

What's faster: temporary SQL tables or Python dicts for session data?

Have some programming background, but in the process of both learning Python and making a web app, and I'm a long-time lurker but first-time poster on Stack Overflow, so please bear with me.
I know that SQLite (or another database, seems like PostgreSQL is popular) is the way to store data between sessions. But what's the most efficient way to store large amounts of data during a session?
I'm building a script to identify the strongest groups of employees to work on various projects in a company. I have received one SQLite database per department containing employee data including skill sets, achievements, performance, and pay.
My script currently runs one SQL query on each database in response to an initial query by the user, pulling all the potentially-relevant employees and their data. It stores all of that data in a list of Python dicts so the end-user can mix-and-match relevant people.
I see two other options: I could still run the comprehensive initial queries but instead of storing it in Python dicts, dump it all into SQLite temporary tables; my guess is that this would save some space and computing because I wouldn't have to store all the joins with each record. Or I could just load employee name and column/row references, which would save a lot of joins on the first pass, then pull the data on the fly from the original databases as the user requests additional data, storing little if any data in Python data structures.
What's going to be the most efficient? Or, at least, what is the most common/proper way of handling large amounts of data during a session?
Thanks in advance!

Aren't you over-optimizing? You don't need the best solution, you need a solution which is good enough.
Implement the simplest one, using dicts; it has a fair chance to be adequate. If you test it and then find it inadequate, try SQLite or Mongo (both have downsides) and see if it suits you better. But I suspect that buying more RAM instead would be the most cost-effective solution in your case.
(Not-a-real-answer disclaimer applies.)

Sharding a Django Project

I'm starting a Django project and need to shard multiple tables that are likely to all be of too many rows. I've looked through threads here and elsewhere, and followed the Django multi-db documentation, but am still not sure how that all stitches together. My models have relationships that would be broken by sharding, so it seems like the options are to either drop the foreign keys of forgo sharding the respective models.
For argument's sake, consider the classic Authot, Publisher and Book scenario, but throw in book copies and users that can own them. Say books and users had to be sharded. How would you approach that? A user may own a copy of a book that's not in the same database.
In general, what are the best practices you have used for routing and the sharding itself? Did you use Django database routers, manually selected a database inside commands based on your sharding logic, or overridden some parts of the ORM to achive that?
I'm using PostgreSQL on Ubuntu, if it matters.
Many thanks.

In the past I've done something similar using Postgresql Table Partitioning, however this merely splits a table up in the same DB. This is helpful in reducing table search time. This is also nice because you don't need to modify your django code much. (Make sure you perform queries with the fields you're using for constraints).
But it's not sharding.
If you haven't seen it yet, you should check out Sharding Postgres with Instagram.

I agree with #DanielRoseman. Also, how many is too many rows. If you are careful with indexing, you can handle a lot of rows with no performance problems. Keep your indexed values small (ints). I've got tables in excess of 400 million rows that produce sub-second responses even when joining with other many million row tables.
It might make more sense to break user up into multiple tables so that the user object has a core of commonly used things and then the "profile" info lives elsewhere (std Django setup). Copies would be a small table referencing books which has the bulk of the data. Considering how much ram you can put into a DB server these days, sharding before you have too seems wrong.

how to make table partitions?

I am not very familiar with databases, and so I do not know how to partition a table using SQLAlchemy.
Your help would be greatly appreciated.

There are two kinds of partitioning: Vertical Partitioning and Horizontal Partitioning.
From the docs:
Vertical Partitioning
Vertical partitioning places different
kinds of objects, or different tables,
across multiple databases:
engine1 = create_engine('postgres://db1')
engine2 = create_engine('postgres://db2')
Session = sessionmaker(twophase=True)
# bind User operations to engine 1, Account operations to engine 2
Session.configure(binds={User:engine1, Account:engine2})
session = Session()
Horizontal Partitioning
Horizontal partitioning partitions the
rows of a single table (or a set of
tables) across multiple databases.
See the “sharding” example in
attribute_shard.py
Just ask if you need more information on those, preferably providing more information about what you want to do.

It's quite an advanced subject for somebody not familiar with databases, but try Essential SQLAlchemy (you can read the key parts on Google Book Search -- p 122 to 124; the example on p. 125-126 is not freely readable online, so you'd have to purchase the book or read it on commercial services such as O'Reilly's Safari -- maybe on a free trial -- if you want to read the example).
Perhaps you can get better answers if you mention whether you're talking about vertical or horizontal partitioning, why you need partitioning, and what underlying database engines you are considering for the purpose.

Automatic partitioning is a very database engine specific concept and SQLAlchemy doesn't provide any generic tools to manage partitioning. Mostly because it wouldn't provide anything really useful while being another API to learn. If you want to do database level partitioning then do the CREATE TABLE statements using custom Oracle DDL statements (see Oracle documentation how to create partitioned tables and migrate data to them). You can use a partitioned table in SQLAlchemy just like you would use a normal table, you just need the table declaration so that SQLAlchemy knows what to query. You can reflect the definition from the database, or just duplicate the table declaration in SQLAlchemy code.
Very large datasets are usually time-based, with older data becoming read-only or read-mostly and queries usually only look at data from a time interval. If that describes your data, you should probably partition your data using the date field.
There's also application level partitioning, or sharding, where you use your application to split data across different database instances. This isn't all that popular in the Oracle world due to the exorbitant pricing models. If you do want to use sharding, then look at SQLAlchemy documentation and examples for that, for how SQLAlchemy can support you in that, but be aware that application level sharding will affect how you need to build your application code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.