Analyze rarely populated fields in mongodb

Analyze rarely populated fields in mongodb - python

I have a mongodb collection with close to 100,000 records and each record has around 5000 keys. Lot of this is empty. How can I find (maybe visually represent) this emptiness of data.
In other words, I would like to analyze the type of values in each key. What would be the right approach for this.

You could take a look to MongoDB aggregation strategies. Check out $group.
From how you exposed your problem, I could totally see an accumulator over the number of keys of each record.
As an example, with the appropriate thresholds and transformation, such an operation could basically return you the records grouped by number of keys (or an array solely populated with the number of keys for each record).
Such an approach could also allow you to perform some data analysis over the keys used for each record.

Related

Cassandra partition key vs clustering key vs index vs second table

I'm trying to make an application using the Python driver for Cassandra in which I want to store car objects in the database.
I've been reading a lot about Cassandra partition keys, clustering keys, indexes etc. but I can't figure out what would be the best schema for my situation.
Situation: The 'Car' objects contain 20 fields, of which I want to perform a single query for car_uuid (unique per car), factory_uuid (which factory made the car) and customer_uuid (who bought the car, this value is never null).
Based on what I read about partition keys, having a high cardinality partition key is preferred, hence it would make sense to choose car_uuid for this due to each car having a unique car_uuid. However if I now want to query for all cars from a certain factory_uuid I get an error that I need to enable ALLOW FILTERING since car_uuid is the first partition key and should always be specified when trying to query on the secondary partition key (if I have understood it correctly).
Most of my queries will either be getting all cars made by a certain factory (factory_uuid), or retrieving information about a single car (using car_uuid)
So I'm thinking of setting factory_uuid as first partition key as, for my particular application, I know I'll always have the factory_uuid when querying for a specific car_uuid. However now the main partition key does not have as high cardinality, as there are only that many factories, so in theory my data will not be a properly distributed anymore. This raised some question:
To what extend will I notice that impact of using a low(er) cardinality primary partition key?
In some posts people made a suggestion to use lookup tables. Would this be a solution for my situation, and which would be better: using car_uuid in the main table and factory_uuid in the lookup table as primary partition keys? (My intuition says that one of these tables will then still have this low cardinality primary partition key problem, is this true?)
Just to be sure, as far as I understand, indexes are not not desirable to use for factory_uuid since, while trying to get all cars for a certain factory_uuid, Cassandra will go over all nodes to check for matching results, which will have major performance implications on such a frequent query.
Any suggestions / inputs are welcomed!

Fast SQL record look-up by using hash in Python/pandas

I'm looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.
My question: which hash function to pick (given that I'm in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)
Python's hash is unsuited as it changes for every session (like: Create hash value for each row of data with selected columns in dataframe in python pandas does)
md5 or sha1 are cryptographic hashes. I don't need the crypto part, as this is not for security. Might be a bit slow as well, and I had some troubles with strings as these require encoding.
is a solution like CRC good enough?
For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.
Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.

Have you tried pandas.util.hash_pandas_object?
Not sure how efficient this is, but maybe you could use it like this:
df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)
This will at least get you a pandas Series of hashes for each row in the df.

Real-time access to simple but large data set with Python

I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, e.g. "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.
My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:
CREATE TABLE WordMappings (key text, word text)
This table is created once and although alterations are possible, only read-access is time critical.
Following this guide, my SELECT statement looks as follows:
def databaseQuery(self, query_string):
self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
result = self.cursor.fetchone()
return result[0]
However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.
Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?

You can speed up lookups on the key column by creating an index for it:
CREATE INDEX kex_index ON WordMappings(key);
To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN.

A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.
I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.

It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.
Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.
The table is now created as follows:
CREATE TABLE WordMappings (key text PRIMARY KEY, word text)
A primary key is used because
It is implicitly unique, which is a property of the abbreviations stored
It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt
Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question, they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.

SQLAlchemy: how can I order a table by a column permanently?

I'm not sure if this has been answered before, I didn't get anything on a quick search.
My table is built in a random order, but thereafter it is modified very rarely. I do frequent selects from the table and in each select I need to order the query by the same column. Now is there a way to sort a table permanently by a column so that it does not need to be done again for each select?

You can add an index sorted by the column you want. The data will be presorted according to that index.

You can have only one place where you define it, and re-use that for
every query:
def base_query(session, what_for):
return session.query(what_for).order_by(what_for.rank_or_whatever)
Expand that as needed, then for all but very complex queries you can use that like so:
some_query = base_query(session(), Employee).filter(Employee.feet > 3)
The resulting query will be ordered by Employee.rank_or_whatever. If you are always querying for the same, You won't habve to use it as argument, of course.
EDIT: If you could somehow define a "permanent" order on your table which is observed by the engine without being issued an ORDER BY I'd think this would be an implementation feature specific to which RDBMS you use, and just convenience. Internally it makes for a DBMS no sense to being coerced how to store the data, since retrieving this data in a specific order is easily and efficiently accomplished by using an INDEX - forcing a specific order would probably decrease overall performance.

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)

Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.

I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.