I store several properties of objects in hashsets. Among other things, something like "creation date". There are several hashsets in the db.
So, my question is, how can I find all objects older than a week for example? Can you suggest an algorithm what faster than O(n) (naive implementation)?
Thanks,
Oles
My initial thought would be to store the data elsewhere, like relational database, or possibly using a zset.
If you had continuous data (meaning it was consistently set at N interval time periods), then you could store the hash key as the member and the date (as a int timestamp) as the value. Then you could do a zrank for a particular date, and use zrevrange to query from the first rank to the value you get from zrank.
Related
I'm trying to make an application using the Python driver for Cassandra in which I want to store car objects in the database.
I've been reading a lot about Cassandra partition keys, clustering keys, indexes etc. but I can't figure out what would be the best schema for my situation.
Situation: The 'Car' objects contain 20 fields, of which I want to perform a single query for car_uuid (unique per car), factory_uuid (which factory made the car) and customer_uuid (who bought the car, this value is never null).
Based on what I read about partition keys, having a high cardinality partition key is preferred, hence it would make sense to choose car_uuid for this due to each car having a unique car_uuid. However if I now want to query for all cars from a certain factory_uuid I get an error that I need to enable ALLOW FILTERING since car_uuid is the first partition key and should always be specified when trying to query on the secondary partition key (if I have understood it correctly).
Most of my queries will either be getting all cars made by a certain factory (factory_uuid), or retrieving information about a single car (using car_uuid)
So I'm thinking of setting factory_uuid as first partition key as, for my particular application, I know I'll always have the factory_uuid when querying for a specific car_uuid. However now the main partition key does not have as high cardinality, as there are only that many factories, so in theory my data will not be a properly distributed anymore. This raised some question:
To what extend will I notice that impact of using a low(er) cardinality primary partition key?
In some posts people made a suggestion to use lookup tables. Would this be a solution for my situation, and which would be better: using car_uuid in the main table and factory_uuid in the lookup table as primary partition keys? (My intuition says that one of these tables will then still have this low cardinality primary partition key problem, is this true?)
Just to be sure, as far as I understand, indexes are not not desirable to use for factory_uuid since, while trying to get all cars for a certain factory_uuid, Cassandra will go over all nodes to check for matching results, which will have major performance implications on such a frequent query.
Any suggestions / inputs are welcomed!
I'm looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.
My question: which hash function to pick (given that I'm in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)
Python's hash is unsuited as it changes for every session (like: Create hash value for each row of data with selected columns in dataframe in python pandas does)
md5 or sha1 are cryptographic hashes. I don't need the crypto part, as this is not for security. Might be a bit slow as well, and I had some troubles with strings as these require encoding.
is a solution like CRC good enough?
For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.
Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.
Have you tried pandas.util.hash_pandas_object?
Not sure how efficient this is, but maybe you could use it like this:
df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)
This will at least get you a pandas Series of hashes for each row in the df.
I have a mongodb collection with close to 100,000 records and each record has around 5000 keys. Lot of this is empty. How can I find (maybe visually represent) this emptiness of data.
In other words, I would like to analyze the type of values in each key. What would be the right approach for this.
You could take a look to MongoDB aggregation strategies. Check out $group.
From how you exposed your problem, I could totally see an accumulator over the number of keys of each record.
As an example, with the appropriate thresholds and transformation, such an operation could basically return you the records grouped by number of keys (or an array solely populated with the number of keys for each record).
Such an approach could also allow you to perform some data analysis over the keys used for each record.
I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, e.g. "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.
My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:
CREATE TABLE WordMappings (key text, word text)
This table is created once and although alterations are possible, only read-access is time critical.
Following this guide, my SELECT statement looks as follows:
def databaseQuery(self, query_string):
self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
result = self.cursor.fetchone()
return result[0]
However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.
Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?
You can speed up lookups on the key column by creating an index for it:
CREATE INDEX kex_index ON WordMappings(key);
To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN.
A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.
I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.
It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.
Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.
The table is now created as follows:
CREATE TABLE WordMappings (key text PRIMARY KEY, word text)
A primary key is used because
It is implicitly unique, which is a property of the abbreviations stored
It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt
Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question, they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.
I have a very large dataset - millions of records - that I want to store in Python. I might be running on 32-bit machines so I want to keep the dataset down in the hundreds-of-MB range and not ballooning much larger than that.
These records - represent a M:M relationship - two IDs (foo and bar) and some simple metadata like timestamps (baz).
Some foo have too nearly all bar in them, and some bar have nearly all foo. But there are many bar that have almost no foos and many foos that have almost no bar.
If this were a relational database, a M:M relationship would be modelled as a table with a compound key. You can of course search on either component key individually comfortably.
If you store the rows in a hashtable, however, you need to maintain three hashtables as the compound key is hashed and you can't search on the component keys with it.
If you have some kind of sorted index, you can abuse lexical sorting to iterate the first key in the compound key, and need a second index for the other key; but its less obvious to me what actual data-structure in the standard Python collections this equates to.
I am considering a dict of foo where each value is automatically moved from tuple (a single row) to list (of row tuples) to dict depending on some thresholds, and another dict of bar where each is a single foo, or a list of foo.
Are there more efficient - speedwise and spacewise - ways of doing this? Any kind of numpy for indices or something?
(I want to store them in Python because I am having performance problems with databases - both SQL and NoSQL varieties. You end up being IPC memcpy and serialisation-bound. That is another story; however the key point is that I want to move the data into the application rather than get recommendations to move it out of the application ;) )
Have you considered using a NoSQL database that runs in memory such at Redis? Redis supports a decent amount of familiar data structures.
I realize you don't want to move outside of the application, but not reinventing the wheel can save time and quite frankly it may be more efficient.
If you need to query the data in a flexible way, and maintain various relationships, I would suggest looking further into using a database, of which there are many options. How about using an in-memory databse, like sqlite (using ":memory:" as the file)? You're not really moving the data "outside" of your program, and you will have much more flexibility than with multi-layered dicts.
Redis is also an interesting alternative, as it has other data-structures to play with, rather than using a relational model with SQL.
What you describe sounds like a sparse matrix, where the foos are along one axis and the bars along the other one. Each non-empty cell represents a relationship between one foo and one bar, and contains the "simple metadata" you describe.
There are efficient sparse matrix packages for Python (scipy.sparse, PySparse) you should look at. I found these two just by Googling "python sparse matrix".
As to using a database, you claim that you've had performance problems. I'd like to suggest that you may not have chosen an optimal representation, but without more details on what your access patterns look like, and what database schema you used, it's awfully hard for anybody to contribute useful help. You might consider editing your post to provide more information.
NoSQL systems like redis don't provide MM tables.
In the end, a python dict keyed by pairs holding the values, and a dict of the set of pairings for each term was the best I could come up with.
class MM:
def __init__(self):
self._a = {} # Bs for each A
self._b = {} # As for each B
self._ab = {}