LevelDB for 100s of millions entries

LevelDB for 100s of millions entries - python

What are the top factors to consider when tuning inserts for a LevelDB store?
I'm inserting 500M+ records in the form:
key="rs1234576543" very predictable structure. rs<1+ digits>
value="1,20000,A,C" string can be much longer but usually ~ 40 chars
keys are unique
key insert order is random
into a LevelDB store using the python plyvel, and see dramatic drop in speed as the number of records grows. I guess this is expected but are there tuning measures I could look at to make it scale better?
Example code:
import plyvel
BATCHSIZE = 1000000
db = plyvel.DB('/tmp/lvldbSNP151/', create_if_missing=True)
wb = db.write_batch()
# items not in any key order
for key, value in DBSNPfile:
wb.put(key,value)
if i%BATCHSIZE==0:
wb.write()
wb.write()
I've tried various batch sizes, which helps bit, but am hoping there's something else I've missed. For example, can knowing the max length of a key (or value) be leveraged?

(Plyvel author here.)
LevelDB keeps all database items in sorted order. Since you are writing in a random order, this basically means that all parts of the database get rewritten all the time since LevelDB has to merge SSTs (this happens in the background). Once your database gets larger, and you keep adding more items to it, this results in a reduced write throughput.
I suspect that performance will not degrade as badly if you have better locality of your writes.
Other ideas that may be worth trying out are:
increase the write_buffer_size
increase the max_file_size
experiment with a larger block_size
use .write_batch(sync=False)
The above can all be used from Python using extra keyword arguments to plyvel.DB and to the .write_batch() method. See the api docs for details.

Related

Accumulate the grouped sum of values across trillions of values

I have a data reduction issue that is proving to be very difficult to solve.
Essentially, I have a program that calculates incremental values (floating point) for pairs of keys from a set of about 60 million keys total. The program will generate values for about 53 trillion pairs 'relatively' quickly (simply iterating through the values would take about three days 🤣). Not every pair of keys will occur, and many pairs will come up many times. There is no reasonable way to have the pairs come up in a particular order. What I need is a way to find the sum of the values generated for each pair of keys.
For data that would fit in memory, this is a very simple problem. In python it would look something like:
from collections import Counter
res = Counter()
for key1,key2,val in data_generator():
res[(key1,key2)] += val
The problem, of course, is that a mapping like that won't fit in memory. So I'm looking for a way to do this efficiently with a mix of on-disk and in-memory processing.
So far I've tried:
A postgresql table with upserts (ON CONFLICT UPDATE). This turned out to be far, far to slow.
A hybrid of in-memory dictionaries in python that write to a RocksDB or LMDB key value store when they get too big. Though these DBs are much faster than postgresql for this kind of task, the time to complete is still on the order of months.
At this point, I'm hoping someone has a better approach that I could try. Is there a way to break this problem up into smaller parts? Is there a standard MapReduce approach to this kind of problem?
Any tips or pointers would be greatly appreciated. Thanks!
Edit:
The computer I'm using has 64GB of RAM, 96 cores (most of my work is very parallelizable), and terabytes of HDD (and some SSD) storage.
It's hard to estimate the total number of key pairs that will be in the reduced result, but it will certainly be at least in the hundreds of billions.

As Frank Yellin observes, there's a one-round MapReduce algorithm. The mapper produces key-value pairs with key key1,key2 and value val. The MapReduce framework groups these pairs by key (the shuffle). The reducer sums the values.
In order to control the memory usage, MapReduce writes the intermediate data to disk. Traditionally there are n files, and all of the pairs with key key1,key2 go to file hash((key1,key2)) mod n. There is a tension here: n should be large enough that each file can be handled by an in-memory map, but if n is too large, then the file system falls over. Back of the envelope math suggests that n might be between 1e4 and 1e5 for you. Hopefully the OS will use RAM to buffer the file writes for you, but make sure that you're maxing out your disk throughput or else you may have to implement buffering yourself. (There also might be a suitable framework, but you don't have to write much code for a single machine.)
I agree with user3386109 that you're going to need a Really Big Disk. If you can regenerate the input multiple times, you can trade time for space by making k passes that each save only a 1/k fraction of the files.
I'm concerned that the running time of this MapReduce will be too large relative to the mean time between failures. MapReduce is traditionally distributed for fault tolerance as much as parallelism.
If there's anything you can tell us about how the input arises, and what you're planning to do with the output, we might be able to give you better advice.

Real-time access to simple but large data set with Python

I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, e.g. "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.
My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:
CREATE TABLE WordMappings (key text, word text)
This table is created once and although alterations are possible, only read-access is time critical.
Following this guide, my SELECT statement looks as follows:
def databaseQuery(self, query_string):
self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
result = self.cursor.fetchone()
return result[0]
However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.
Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?

You can speed up lookups on the key column by creating an index for it:
CREATE INDEX kex_index ON WordMappings(key);
To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN.

A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.
I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.

It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.
Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.
The table is now created as follows:
CREATE TABLE WordMappings (key text PRIMARY KEY, word text)
A primary key is used because
It is implicitly unique, which is a property of the abbreviations stored
It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt
Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question, they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.

How many times string appears in another string

I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks!
Daniel.
(BTW: if it matters, I'm using Python)

Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).

Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.

If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.

Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
To reduce the number of filters:
Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.

redis - Using Hashes

I'm implementing a social stream and a notification system for my web application by using redis. I'm new to redis and I have some doubts about hashes and their efficiency.
I've read this awesome Instagram post
and I planned to implement their similar solution for minimal storage.
As mentioned in their blog, they did like this
To take advantage of the hash type, we bucket all our Media IDs into buckets of 1000 (we just take the ID, divide by 1000 and discard the remainder). That determines which key we fall into; next, within the hash that lives at that key, the Media ID is the lookup key within the hash, and the user ID is the value. An example, given a Media ID of 1155315, which means it falls into bucket 1155 (1155315 / 1000 = 1155):
HSET "mediabucket:1155" "1155315" "939"
HGET "mediabucket:1155" "1155315"
> "939"
So Instead of having 1000 seperate keys they are storing it in one hash with thousand lookup keys. And my doubt is why can't we increase the lookup key values to even more larger.
For eg: Media ID of 1155315 will fall into mediabucket:115 by dividing it by 10000
or even greater than that.
Why are they settling with one hash bucket with 1000 lookup keys. Why can't they have one hash bucket with 100000 lookup keys. Is that related to efficiency?
I need your suggestion for implementing the efficient method in my web application.
P.S. Please! don't say that stackoverflow is not for asking suggestions and I don't know where to find help.
Thanks!

Yes, it's related to efficiency.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core developers, for input, and he suggested we use Redis hashes. Hashes in Redis are dictionaries that are can be encoded in memory very efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures the maximum number of entries a hash can have while still being encoded efficiently. We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity. For more details, you can check out the zipmap source file.
Small hashes are encoded in a special way (zipmaps), that is memory efficient, but makes operations O(N) instead of O(1). So, with one zipmap with 100k fields instead of 100 zipmaps with 1k fields you gain no memory benefits, but all your operations get 100 times slower.

Basically, they want the number of values stored in a single hash to not exceed 1000. Probably, they set up their Redis instance configuration to work nicely with this number (thy set hash-zipmap-max-entries).
Every time an hash will exceed the number of elements or element size specified it will be converted into a real hash table, and the memory saving will be lost.
-- http://redis.io/topics/memory-optimization
As I understand, your question is "why exactly 1000 and not more?" Well, it's because they had to choose between space efficiency and speed. Space-efficient representation has operation complexity O(N), not O(1) as normal hashes - it is N times slower, but takes less memory.
They tested different values and found that 1000 is a good compromise solution - takes not much space, yet still fast enough.

BST or Hash Table?

I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.

Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.

You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}

library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.