ways to detect data redundancy between tables with different structures

ways to detect data redundancy between tables with different structures - python

I'm working on a problem that involves multiple database instances, each with different table structures. The problem is, between these tables, there are lots and lots of duplicates, and i need a way to efficiently find them, report them, and possibly eliminate them.
Eg. I have two tables, the first table, CustomerData with the fields:
_countId, customerFID, customerName, customerAddress, _someRandomFlags
and I have another table, CustomerData2 (built later) with the fields:
_countId, customerFID, customerFirstName, customerLocation, _someOtherRandomFlags.
Between the two tables above, I know for a fact that customerName and customerFirstName were used to store the same data, and similarly customerLocation and customerAddress were also used to store the same data.
Lets say, some of the sales team have been using customerData, and others have been using customerData2. I'd like to have a scalable way of detecting the redundancies between the tables and report them. It can be assumed with some amount of surety that customerFID in both tables are consistent, and refer to the same customer.
One solution I could think off was, to create a customerData class in python, map the records in the two tables to this class, and compute a hash/signature for the objects within the class that are required (customerName, customerLocation/Address) and store them to a signature table, which has the columns:
sourceTableName, entityType (customerData), identifyingKey (customerFID), signature
and then for each entityType, I look for duplicate signatures for each customerFID
In reality, I'm working with huge sets of biomedical data, with lots and lots of columns. They were created at different people (and sadly with no standard nomenclature or structure) and have been duplicate data stored in them
EDIT:
For simplicity sake, I can move all the database instances to a single server instance.

If I couldn't care for performance, I'd use a high-level practical approach. Use Django (or SQLAlchemy or...) to build your desired models (your tables) and fetch the data to compare. Then use an algorithm for efficiently identifying duplicates (...from lists or dicts,it depends of "how" you hold your data). To boost performance you may try to "enhance" your app with the multiprocessing module or consider a map-reduce solution.

Related

Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model

Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
I have three datasets, each containing about 20 milions rows (csv files)
There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.

Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!

JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
nodes.csv: one line per node with all attributes
links.csv: one line per link with source_id and target_id and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script

What is an optimal method for isolating unrelated sets of data in neo4j?

I'm using neo4j to contain temporary datasets from different source systems. My data consists of a few parent objects which each contain ~4-7 layers of child objects of varying types. Total object count per dataset varies between 2,000 and 1.5 million. I'm using the python py2neo library, which has had good performance both during the data creation phase, and for passing through cypher queries for reporting.
I'd like to isolate datasets from unrelated systems for querying and purging purposes, but I'm worried about performance. I have a few ideas, but it's not clear to me which are the most likely to be viable.
The easiest to implement (for my code) would be a top-level "project" object. That project object would then have a few direct children (via a relationship) and many indirect children. I'm worried that when I want to filter by project, I'll have to use a relationship wildcard MATCH (pr:project)<-[:IN_PROJECT*7]-(c:child_object) distance, which seems to very expensive query-wise.
I could also make a direct relationship between the project object and every other object in the project. MATCH (pr:project)<-[:IN_PROJECT]-(c:child_object)This should be easier for writing queries, but I don't know what might happen when I have a single object with potentially millions of relationships.
Finally, I could set a project-id property on every single object in the dataset. MATCH (c:child_object {project-id:"A1B2C3"}) It seems to be a wasteful solution, but I think it might be better performance wise in the graph DB model.
Apologies if I mangled the sample Cypher queries / neo4j terminology. I set aside this project for 6 weeks, and I'm a little rusty.

If you have a finite set of datasets, you should consider using a dedicated label to specify the data source. In Neo4j's property graph data model, a node is allowed to have multiple labels.
MATCH (c:child_object:DataSourceA)
Labels are always indexed, so performance should be better than that of your proposals 1-3. I also think this is a more elegant solution -- however, it will get tricky if you do not know the number of data sets up front. In the latter case, you might use something like
MATCH (c:child_object)
WHERE 'DataSourceA' IN labels(c)
But this is more like a "full table scan", so performance-wise, you'll be better off using your approach 3 and building an index on project-id.

Efficient large dicts of dicts to represent M:M relationships in Python

I have a very large dataset - millions of records - that I want to store in Python. I might be running on 32-bit machines so I want to keep the dataset down in the hundreds-of-MB range and not ballooning much larger than that.
These records - represent a M:M relationship - two IDs (foo and bar) and some simple metadata like timestamps (baz).
Some foo have too nearly all bar in them, and some bar have nearly all foo. But there are many bar that have almost no foos and many foos that have almost no bar.
If this were a relational database, a M:M relationship would be modelled as a table with a compound key. You can of course search on either component key individually comfortably.
If you store the rows in a hashtable, however, you need to maintain three hashtables as the compound key is hashed and you can't search on the component keys with it.
If you have some kind of sorted index, you can abuse lexical sorting to iterate the first key in the compound key, and need a second index for the other key; but its less obvious to me what actual data-structure in the standard Python collections this equates to.
I am considering a dict of foo where each value is automatically moved from tuple (a single row) to list (of row tuples) to dict depending on some thresholds, and another dict of bar where each is a single foo, or a list of foo.
Are there more efficient - speedwise and spacewise - ways of doing this? Any kind of numpy for indices or something?
(I want to store them in Python because I am having performance problems with databases - both SQL and NoSQL varieties. You end up being IPC memcpy and serialisation-bound. That is another story; however the key point is that I want to move the data into the application rather than get recommendations to move it out of the application ;) )

Have you considered using a NoSQL database that runs in memory such at Redis? Redis supports a decent amount of familiar data structures.
I realize you don't want to move outside of the application, but not reinventing the wheel can save time and quite frankly it may be more efficient.

If you need to query the data in a flexible way, and maintain various relationships, I would suggest looking further into using a database, of which there are many options. How about using an in-memory databse, like sqlite (using ":memory:" as the file)? You're not really moving the data "outside" of your program, and you will have much more flexibility than with multi-layered dicts.
Redis is also an interesting alternative, as it has other data-structures to play with, rather than using a relational model with SQL.

What you describe sounds like a sparse matrix, where the foos are along one axis and the bars along the other one. Each non-empty cell represents a relationship between one foo and one bar, and contains the "simple metadata" you describe.
There are efficient sparse matrix packages for Python (scipy.sparse, PySparse) you should look at. I found these two just by Googling "python sparse matrix".
As to using a database, you claim that you've had performance problems. I'd like to suggest that you may not have chosen an optimal representation, but without more details on what your access patterns look like, and what database schema you used, it's awfully hard for anybody to contribute useful help. You might consider editing your post to provide more information.

NoSQL systems like redis don't provide MM tables.
In the end, a python dict keyed by pairs holding the values, and a dict of the set of pairings for each term was the best I could come up with.
class MM:
def __init__(self):
self._a = {} # Bs for each A
self._b = {} # As for each B
self._ab = {}

python solutions for managing scientific data dependency graph by specification values

I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.

I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/

ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.

Techniques for data comparison between different schemas

Are there techniques for comparing the same data stored in different schemas? The situation is something like this. If I have a db with schema A and it stores data for a feature in say, 5 tables. Schema A -> Schema B is done during an upgrade process. During the upgrade process some transformation logic is applied and the data is stored in 7 tables in Schema B.
What i'm after is some way to verify data integrity, basically i would have to compare different schemas while factoring in the transformation logic. Short of writing some custom t-sql sprocs to compare the data, is there an alternate method? I'm leaning towards python to automate this, are there any python modules that would help me out?
To better illustrate my question the following diagram is a rough picture of one of the many data sets i would need to compare, Properties 1,2,3 and 4 are migrated from Schema source to destination, but they are spread across different tables.
Table1Src Table1Dest
| |
--ID(Primary Key) --ID(Primary Key)
--Property1 --Property1
--Property2 --Property5
--Property3 --Property6
Table2Src Table2Dest
| |
--ID(Foreign Key->Table1Src) --ID(Foreign Key->Table1Dest)
--Property4 --Property2
--Property3
Table3Dest
|
--ID(Foreign Key->Table1Dest)
--Property4
--Property7

Make "views" on both the schemas that translate to the same buisness representation of data. Export these views to flat files and then you can use any plain vanilla file diff utility to compare and point out differences.

Basically, you should create object representations for both schema versions, and then compare objects. This is best done if they all fit into memory simultaneously; if not, you need to iterate over all objects in one representation, fetch the corresponding object in the other representation, compare them, and then do the same vice versa.
The difficult part may be to obtain object representations; you can see whether SQLAlchemy can be used conveniently for your tables. SQLAlchemy is, in principle, capable of mapping existing schema definitions onto objects.

I've used SQLAlchemy successfully for migration between one schema and another - that's a similar process (as indicated by Martin v. Löwis) as comparison. Especially if you use an .equals(other) method.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.