I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.
Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.
You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}
library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".
Related
I am processing the human genome and have ~10 million SNPs (identified by a "SNP_ID") in a single patient. I have two reference TSV's which contain rows, each row contains a SNP_ID and a floating point number (as well as lots of other metadata), it is all in ASCII format. These reference TSV's are 300-500GB in size.
I need to filter the 10 million SNPs based on criterion contained within the TSVs. In other words find the row with the SNP_ID, lookup the floating point number and decide if the value is above a threshold.
My thought is to store the SNPs in a python set, then do a scan over each TSV, doing a lookup to see if the row in the TSV matches any object in the set. Do you think this is a reasonable approach, or will the lookup time in the set with 10 million items be very slow? I have hundreds of patients this needs to be done over so it shouldn't take more than an hour or two to process.
Your data size is large enough that you should not be working with data structures in memory. Instead, consider using a relational database system. You can start with sqlite, which comes bundled with Python.
This SO answer has details about how to load a TSV into sqlite.
After your set of SNPs and your reference TSVs are in sqlite, you can filter the SNPs with a simple SQL query such as:
SELECT
t1.SNP_ID
FROM
snps t1
LEFT JOIN
ref_tsv t2
ON
t1.SNP_ID = t2.SNP_ID
WHERE
t2.value >= {your_threshold}
;
ok, here's what I would do in your case.
500GB of metadata is a lot, let's look how can we reduce this amount.
your idea to make a set() with SNP_ID is good. Read all your SNP data, make a set of SNP_ID, it will definitely fit into the memory
then read TSV data, for every row check if SNP_ID is in your set, if it is -- save the SNP_ID and the floating point number, discard the rest. You will have 10M records at the most, because one SNP has only that much.
do your magic
start over with the next SNP
It would be nice to put all the data on a fast SSD just in case.
And, something else to try, maybe if you discard the metadata, you will be able to reduce the TSV size to just a few gigabytes, saving the SNP_ID and the float? Then you may easily fit it into the memory and make things much faster.
What are the top factors to consider when tuning inserts for a LevelDB store?
I'm inserting 500M+ records in the form:
key="rs1234576543" very predictable structure. rs<1+ digits>
value="1,20000,A,C" string can be much longer but usually ~ 40 chars
keys are unique
key insert order is random
into a LevelDB store using the python plyvel, and see dramatic drop in speed as the number of records grows. I guess this is expected but are there tuning measures I could look at to make it scale better?
Example code:
import plyvel
BATCHSIZE = 1000000
db = plyvel.DB('/tmp/lvldbSNP151/', create_if_missing=True)
wb = db.write_batch()
# items not in any key order
for key, value in DBSNPfile:
wb.put(key,value)
if i%BATCHSIZE==0:
wb.write()
wb.write()
I've tried various batch sizes, which helps bit, but am hoping there's something else I've missed. For example, can knowing the max length of a key (or value) be leveraged?
(Plyvel author here.)
LevelDB keeps all database items in sorted order. Since you are writing in a random order, this basically means that all parts of the database get rewritten all the time since LevelDB has to merge SSTs (this happens in the background). Once your database gets larger, and you keep adding more items to it, this results in a reduced write throughput.
I suspect that performance will not degrade as badly if you have better locality of your writes.
Other ideas that may be worth trying out are:
increase the write_buffer_size
increase the max_file_size
experiment with a larger block_size
use .write_batch(sync=False)
The above can all be used from Python using extra keyword arguments to plyvel.DB and to the .write_batch() method. See the api docs for details.
I'm teaching myself data structures through this python book and I'd appreciate if someone can correct me if I'm wrong since a hash set seems to be extremely similar to a hash map.
Implementation:
A Hashset is a list [] or array where each index points to the head of a linkedlist
So some hash(some_item) --> key, and then list[key] and then add to the head of a LinkedList. This occurs in O(1) time
When removing a value from the linkedlist, in python we replace it with a placeholder because hashsets are not allowed to have Null/None values, correct?
When the list[] gets over a certain % of load/fullness, we copy it over to another list
Regarding Time Complexity Confusion:
So one question is, why is Average search/access O(1) if there can be a list of N items at the linkedlist at a given index?
Wouldnt the average case be the searchitem is in the middle of its indexed linkedlist so it should be O(n/2) -> O(n)?
Also, when removing an item, if we are replacing it with a placeholder value, isn't this considered a waste of memory if the placeholder is never used?
And finally, what is the difference between this and a HashMap other than HashMaps can have nulls? And HashMaps are key/value while Hashsets are just value?
For your first question - why is the average time complexity of a lookup O(1)? - this statement is in general only true if you have a good hash function. An ideal hash function is one that causes a nice spread on its elements. In particular, hash functions are usually chosen so that the probability that any two elements collide is low. Under this assumption, it's possible to formally prove that the expected number of elements to check is O(1). If you search online for "universal family of hash functions," you'll probably find some good proofs of this result.
As for using placeholders - there are several different ways to implement a hash table. The approach you're using is called "closed addressing" or "hashing with chaining," and in that approach there's little reason to use placeholders. However, other hashing strategies exist as well. One common family of approaches is called "open addressing" (the most famous of which is linear probing hashing), and in those setups placeholder elements are necessary to avoid false negative lookups. Searching online for more details on this will likely give you a good explanation about why.
As for how this differs from HashMap, the HashMap is just one possible implementation of a map abstraction backed by a hash table. Java's HashMap does support nulls, while other approaches don't.
The lookup time wouldn't be O(n) because not all items need to be searched, it also depends on the number of buckets. More buckets would decrease the probability of a collision and reduce the chain length.
The number of buckets can be kept as a constant factor of the number of entries by resizing the hash table as needed. Along with a hash function that evenly distributes the values, this keeps the expected chain length bounded, giving constant time lookups.
The hash tables used by hashmaps and hashsets are the same except they store different values. A hashset will contain references to a single value, and a hashmap will contain references to a key and a value. Hashsets can be implemented by delegating to a hashmap where the keys and values are the same.
A lot has been written here about open hash tables, but some fundamental points are missed.
Practical implementations generally have O(1) lookup and delete because they guarantee buckets won't contain more than a fixed number of items (the load factor). But this means they can only achieve amortized O(1) time for insert because the table needs to be reorganized periodically as it grows.
(Some may opt to reorganize on delete, also, to shrink the table when the load factor reaches some bottom threshold, gut this only affect space, not asymptotic run time.)
Reorganization means increasing (or decreasing) the number of buckets and re-assigning all elements into their new bucket locations. There are schemes, e.g. extensible hashing, to make this a bit cheaper. But in general it means touching each element in the table.
Reorganization, then, is O(n). How can insert be O(1) when any given one may incur this cost? The secret is amortization and the power of powers. When the table is grown, it must be grown by a factor greater than one, two being most common. If the table starts with 1 bucket and doubles each time the load factor reaches F, then the cost of N reorganizations is
F + 2F + 4F + 8F ... (2^(N-1))F = (2^N - 1)F
At this point the table contains (2^(N-1))F elements, the number in the table during the last reorganization. I.e. we have done (2^(N-1))F inserts, and the total cost of reorganization is as shown on the right. The interesting part is the average cost per element in the table (or insert, take your pick):
(2^N - 1)F 2^N
---------- ~= ------- = 2
(2^(N-1))F 2^(N-1)
That's where the amortized O(1) comes from.
One additional point is that for modern processors, linked lists aren't a great idea for the bucket lists. With 8-byte pointers, the overhead is meaningful. More importantly, heap-allocated nodes in a single list will almost never be contiguous in memory. Traversing such a list kills cache performance, which can slow things down by orders of magnitude.
Arrays (with an integer count for number of data-containing elements) are likely to work out better. If the load factor is small enough, just allocate an array equal in size to the load factor at the time the first element is inserted in the bucket. Otherwise, grow these element arrays by factors the same way as the bucket array! Everything will still amortize to O(1).
To delete an item from such a bucket, don't mark it deleted. Just copy the last array element to the location of the deleted one and decrement the element count. Of course this won't work if you allow external pointers into the hash buckets, but that's a bad idea anyway.
I am in the planning phase of building a simulation and need ideas on how to represent data, based on memory and speed considerations.
At each time-step, the simulation process creates 10^3 to 10^4 new data records, and looks at each new or existing records (there are 10^6 to 10^8 of them) then either deletes it or modifies it.
Each record has 3-10 simple fields, each either an integer or a string of several ASCII characters. In addition, each record has 1-5 other fields, each a variable-length list containing integers. A typical record weighs 100-500 bytes.
The modify-or-delete process works like this: For this record, compute a function whose arguments are the values of some of this record's fields, and the values of these fields of another record. Depending on the results, the process prepares to delete or modify its fields in some way.
Then repeat for each other record. Then move to the next record and repeat. When all records have been processed, the simulation is ready to move to the next time-step.
Just before moving to the next time-step, apply all the deletions and modifications as prepared.
The more records allowed, the better the simulation. If all records are in RAM, downside is simulation size and presumably upside is speed. The simulation doesn't need to be realtime, but obviously I don't want it too slow.
To represent each record in memory, I know of these options: a list or dict (with some lists nested in it), or a class instance. To store away all the records and continue the simulation another day, the options in order of decreasing familiarity to me are: a csv file where each line is a record, or just put all records in RAM then put them into a file (perhaps using pickle), or use some sort of database.
I've learned Python basics plus some concepts like generators but haven't learned database, haven't tried pickling, and obviously need to learn more. If possible, I'd avoid multiple computers because I have only 1, and concurrency because it looks too scary.
What would you advise about how to represent records in memory, and about how to store away the simulated system?
If we take your worst case, 10**8 records and 500 bytes per record, that would be a lot of RAM, so it's worth designing some flexibility and assuming not all records will always be resident in RAM. You could make an abstraction class that hides the details of where the records are.
class Record(object):
def __init__(self, x, y, z):
pass # code goes here
def get_record(id):
pass # code goes here
Instead of using the name get_record() you could use the name __index__() and then your class will act like a list, but might be going out to a database, or referencing a RAM cache, or whatever. Just use integers as the ID values. Then if you change your mind about the persistence store (switch from database to pickle or whatever) the actual code won't change.
You could also try just making a really huge swapfile and letting the virtual memory system handle shuffling records in and out of actual RAM. This is easy to try. It does not have any easy way to interrupt a calculation and save the state.
You could represent each record as a tuple, even a named tuple. I believe a tuple would have the lowest overhead of any "container" object in Python. (A named tuple just stores the names once in one place, so it's low overhead also.)
I'm implementing a social stream and a notification system for my web application by using redis. I'm new to redis and I have some doubts about hashes and their efficiency.
I've read this awesome Instagram post
and I planned to implement their similar solution for minimal storage.
As mentioned in their blog, they did like this
To take advantage of the hash type, we bucket all our Media IDs into buckets of 1000 (we just take the ID, divide by 1000 and discard the remainder). That determines which key we fall into; next, within the hash that lives at that key, the Media ID is the lookup key within the hash, and the user ID is the value. An example, given a Media ID of 1155315, which means it falls into bucket 1155 (1155315 / 1000 = 1155):
HSET "mediabucket:1155" "1155315" "939"
HGET "mediabucket:1155" "1155315"
> "939"
So Instead of having 1000 seperate keys they are storing it in one hash with thousand lookup keys. And my doubt is why can't we increase the lookup key values to even more larger.
For eg: Media ID of 1155315 will fall into mediabucket:115 by dividing it by 10000
or even greater than that.
Why are they settling with one hash bucket with 1000 lookup keys. Why can't they have one hash bucket with 100000 lookup keys. Is that related to efficiency?
I need your suggestion for implementing the efficient method in my web application.
P.S. Please! don't say that stackoverflow is not for asking suggestions and I don't know where to find help.
Thanks!
Yes, it's related to efficiency.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core developers, for input, and he suggested we use Redis hashes. Hashes in Redis are dictionaries that are can be encoded in memory very efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures the maximum number of entries a hash can have while still being encoded efficiently. We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity. For more details, you can check out the zipmap source file.
Small hashes are encoded in a special way (zipmaps), that is memory efficient, but makes operations O(N) instead of O(1). So, with one zipmap with 100k fields instead of 100 zipmaps with 1k fields you gain no memory benefits, but all your operations get 100 times slower.
Basically, they want the number of values stored in a single hash to not exceed 1000. Probably, they set up their Redis instance configuration to work nicely with this number (thy set hash-zipmap-max-entries).
Every time an hash will exceed the number of elements or element size specified it will be converted into a real hash table, and the memory saving will be lost.
-- http://redis.io/topics/memory-optimization
As I understand, your question is "why exactly 1000 and not more?" Well, it's because they had to choose between space efficiency and speed. Space-efficient representation has operation complexity O(N), not O(1) as normal hashes - it is N times slower, but takes less memory.
They tested different values and found that 1000 is a good compromise solution - takes not much space, yet still fast enough.