I have a very large dataset - millions of records - that I want to store in Python. I might be running on 32-bit machines so I want to keep the dataset down in the hundreds-of-MB range and not ballooning much larger than that.
These records - represent a M:M relationship - two IDs (foo and bar) and some simple metadata like timestamps (baz).
Some foo have too nearly all bar in them, and some bar have nearly all foo. But there are many bar that have almost no foos and many foos that have almost no bar.
If this were a relational database, a M:M relationship would be modelled as a table with a compound key. You can of course search on either component key individually comfortably.
If you store the rows in a hashtable, however, you need to maintain three hashtables as the compound key is hashed and you can't search on the component keys with it.
If you have some kind of sorted index, you can abuse lexical sorting to iterate the first key in the compound key, and need a second index for the other key; but its less obvious to me what actual data-structure in the standard Python collections this equates to.
I am considering a dict of foo where each value is automatically moved from tuple (a single row) to list (of row tuples) to dict depending on some thresholds, and another dict of bar where each is a single foo, or a list of foo.
Are there more efficient - speedwise and spacewise - ways of doing this? Any kind of numpy for indices or something?
(I want to store them in Python because I am having performance problems with databases - both SQL and NoSQL varieties. You end up being IPC memcpy and serialisation-bound. That is another story; however the key point is that I want to move the data into the application rather than get recommendations to move it out of the application ;) )
Have you considered using a NoSQL database that runs in memory such at Redis? Redis supports a decent amount of familiar data structures.
I realize you don't want to move outside of the application, but not reinventing the wheel can save time and quite frankly it may be more efficient.
If you need to query the data in a flexible way, and maintain various relationships, I would suggest looking further into using a database, of which there are many options. How about using an in-memory databse, like sqlite (using ":memory:" as the file)? You're not really moving the data "outside" of your program, and you will have much more flexibility than with multi-layered dicts.
Redis is also an interesting alternative, as it has other data-structures to play with, rather than using a relational model with SQL.
What you describe sounds like a sparse matrix, where the foos are along one axis and the bars along the other one. Each non-empty cell represents a relationship between one foo and one bar, and contains the "simple metadata" you describe.
There are efficient sparse matrix packages for Python (scipy.sparse, PySparse) you should look at. I found these two just by Googling "python sparse matrix".
As to using a database, you claim that you've had performance problems. I'd like to suggest that you may not have chosen an optimal representation, but without more details on what your access patterns look like, and what database schema you used, it's awfully hard for anybody to contribute useful help. You might consider editing your post to provide more information.
NoSQL systems like redis don't provide MM tables.
In the end, a python dict keyed by pairs holding the values, and a dict of the set of pairings for each term was the best I could come up with.
class MM:
def __init__(self):
self._a = {} # Bs for each A
self._b = {} # As for each B
self._ab = {}
Related
I'm using neo4j to contain temporary datasets from different source systems. My data consists of a few parent objects which each contain ~4-7 layers of child objects of varying types. Total object count per dataset varies between 2,000 and 1.5 million. I'm using the python py2neo library, which has had good performance both during the data creation phase, and for passing through cypher queries for reporting.
I'd like to isolate datasets from unrelated systems for querying and purging purposes, but I'm worried about performance. I have a few ideas, but it's not clear to me which are the most likely to be viable.
The easiest to implement (for my code) would be a top-level "project" object. That project object would then have a few direct children (via a relationship) and many indirect children. I'm worried that when I want to filter by project, I'll have to use a relationship wildcard MATCH (pr:project)<-[:IN_PROJECT*7]-(c:child_object) distance, which seems to very expensive query-wise.
I could also make a direct relationship between the project object and every other object in the project. MATCH (pr:project)<-[:IN_PROJECT]-(c:child_object)This should be easier for writing queries, but I don't know what might happen when I have a single object with potentially millions of relationships.
Finally, I could set a project-id property on every single object in the dataset. MATCH (c:child_object {project-id:"A1B2C3"}) It seems to be a wasteful solution, but I think it might be better performance wise in the graph DB model.
Apologies if I mangled the sample Cypher queries / neo4j terminology. I set aside this project for 6 weeks, and I'm a little rusty.
If you have a finite set of datasets, you should consider using a dedicated label to specify the data source. In Neo4j's property graph data model, a node is allowed to have multiple labels.
MATCH (c:child_object:DataSourceA)
Labels are always indexed, so performance should be better than that of your proposals 1-3. I also think this is a more elegant solution -- however, it will get tricky if you do not know the number of data sets up front. In the latter case, you might use something like
MATCH (c:child_object)
WHERE 'DataSourceA' IN labels(c)
But this is more like a "full table scan", so performance-wise, you'll be better off using your approach 3 and building an index on project-id.
I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.
I have got a approximately 12GB of tab-separated data in a very simple format:
mainIdentifier, altIdentifierType, altIdentifierText
MainIdentifier is not a unique row identifier - only the whole combination of the 3 columns is unique. My main use-case is looking up corresponding entries going from mainIdentifier or going from two different types of alternative identifiers.
From what I can glean, I would need to construct a lookup index for each entry direction to make it fast. However, given the simplicity of the task, I do not really need the index pointing to the record - the index itself is the answer.
I've tried sqlite3 in python but as expected, the result is not as fast as I would've liked. I am now considering just storing the two lists and moving around in a binary-search fashion, however, I do not want to re-invent the wheel - is there any way existing solution how to solve this?
Also, I intend to run this a REST-enabled service, so it's not feasible for the lookup table to be stored in memory in any fashion..
I'm working on a problem that involves multiple database instances, each with different table structures. The problem is, between these tables, there are lots and lots of duplicates, and i need a way to efficiently find them, report them, and possibly eliminate them.
Eg. I have two tables, the first table, CustomerData with the fields:
_countId, customerFID, customerName, customerAddress, _someRandomFlags
and I have another table, CustomerData2 (built later) with the fields:
_countId, customerFID, customerFirstName, customerLocation, _someOtherRandomFlags.
Between the two tables above, I know for a fact that customerName and customerFirstName were used to store the same data, and similarly customerLocation and customerAddress were also used to store the same data.
Lets say, some of the sales team have been using customerData, and others have been using customerData2. I'd like to have a scalable way of detecting the redundancies between the tables and report them. It can be assumed with some amount of surety that customerFID in both tables are consistent, and refer to the same customer.
One solution I could think off was, to create a customerData class in python, map the records in the two tables to this class, and compute a hash/signature for the objects within the class that are required (customerName, customerLocation/Address) and store them to a signature table, which has the columns:
sourceTableName, entityType (customerData), identifyingKey (customerFID), signature
and then for each entityType, I look for duplicate signatures for each customerFID
In reality, I'm working with huge sets of biomedical data, with lots and lots of columns. They were created at different people (and sadly with no standard nomenclature or structure) and have been duplicate data stored in them
EDIT:
For simplicity sake, I can move all the database instances to a single server instance.
If I couldn't care for performance, I'd use a high-level practical approach. Use Django (or SQLAlchemy or...) to build your desired models (your tables) and fetch the data to compare. Then use an algorithm for efficiently identifying duplicates (...from lists or dicts,it depends of "how" you hold your data). To boost performance you may try to "enhance" your app with the multiprocessing module or consider a map-reduce solution.
I'm currently writing a python application that will take a directory of text files and parse them into custom python objects based on the the attributes specified in the text file. As part of my application, I compare the current loaded object data set to a previous dataset (same format) and scan it for possible duplicates, conflicts, updates, etc. However since there can be ~10,000+ objects at a time, I'm not really sure how to approach this.
I'm currently storing the previous data set in a DB as it's being used by another web app. As of now, my python application loads the 'proposed' dataset into memory (creating the rule objects), and then I store those objects in a dictionary (problem #1). Then when it comes time to compare, I use a combination of SQL queries and failed inserts to determine new/existing and existing but updated entries (problem #2).
This is hackish and terrible at best. I'm looking for some advice on restructuring the application and handling the object storage/comparisons.
You can fake what Git does and load the entire set as basically a single file and parse from there. The biggest issue is that dictionaries are not ordered so your comparisons will not always be 1:1. A list of tuples will give you 1:1 comparisons. If a lot has changed this will be difficult.
Here is a basic flow for how you can do this.
Start with both tuple lists at index 0.
Compare a hash of each tuple hashlib.sha1(str(tuple1)) == hashlib.sha1(str(tuple2))
If they are equal, record the matching indexes and add 1 to each index and compare again
If the are unequal, search each side for a match and record the matching indexes
If there are no matches, you can assume there is an insert/update/delete happening and come back to it later
You can map your matching items as reference points to do further investigation into the ones that did not match. This technique can be applied at each level you drill down. You will end up with a map of what is different down to the individual values.
The nice thing is each of the slices that you create can be compared in parallel since they will not correspond to each other... unless you are moving things from one file to another.
Then again, it may be easier to use a diff library to compare the two data sets. Might as well not reinvent the wheel; even if it might be a really shiny wheel.
Check out http://docs.python.org/library/difflib.html