Accumulate the grouped sum of values across trillions of values - python

I have a data reduction issue that is proving to be very difficult to solve.
Essentially, I have a program that calculates incremental values (floating point) for pairs of keys from a set of about 60 million keys total. The program will generate values for about 53 trillion pairs 'relatively' quickly (simply iterating through the values would take about three days 🤣). Not every pair of keys will occur, and many pairs will come up many times. There is no reasonable way to have the pairs come up in a particular order. What I need is a way to find the sum of the values generated for each pair of keys.
For data that would fit in memory, this is a very simple problem. In python it would look something like:
from collections import Counter
res = Counter()
for key1,key2,val in data_generator():
res[(key1,key2)] += val
The problem, of course, is that a mapping like that won't fit in memory. So I'm looking for a way to do this efficiently with a mix of on-disk and in-memory processing.
So far I've tried:
A postgresql table with upserts (ON CONFLICT UPDATE). This turned out to be far, far to slow.
A hybrid of in-memory dictionaries in python that write to a RocksDB or LMDB key value store when they get too big. Though these DBs are much faster than postgresql for this kind of task, the time to complete is still on the order of months.
At this point, I'm hoping someone has a better approach that I could try. Is there a way to break this problem up into smaller parts? Is there a standard MapReduce approach to this kind of problem?
Any tips or pointers would be greatly appreciated. Thanks!
Edit:
The computer I'm using has 64GB of RAM, 96 cores (most of my work is very parallelizable), and terabytes of HDD (and some SSD) storage.
It's hard to estimate the total number of key pairs that will be in the reduced result, but it will certainly be at least in the hundreds of billions.

As Frank Yellin observes, there's a one-round MapReduce algorithm. The mapper produces key-value pairs with key key1,key2 and value val. The MapReduce framework groups these pairs by key (the shuffle). The reducer sums the values.
In order to control the memory usage, MapReduce writes the intermediate data to disk. Traditionally there are n files, and all of the pairs with key key1,key2 go to file hash((key1,key2)) mod n. There is a tension here: n should be large enough that each file can be handled by an in-memory map, but if n is too large, then the file system falls over. Back of the envelope math suggests that n might be between 1e4 and 1e5 for you. Hopefully the OS will use RAM to buffer the file writes for you, but make sure that you're maxing out your disk throughput or else you may have to implement buffering yourself. (There also might be a suitable framework, but you don't have to write much code for a single machine.)
I agree with user3386109 that you're going to need a Really Big Disk. If you can regenerate the input multiple times, you can trade time for space by making k passes that each save only a 1/k fraction of the files.
I'm concerned that the running time of this MapReduce will be too large relative to the mean time between failures. MapReduce is traditionally distributed for fault tolerance as much as parallelism.
If there's anything you can tell us about how the input arises, and what you're planning to do with the output, we might be able to give you better advice.

Related

More optimized way to do itertools.combinations

I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.
(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.
Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))

LevelDB for 100s of millions entries

What are the top factors to consider when tuning inserts for a LevelDB store?
I'm inserting 500M+ records in the form:
key="rs1234576543" very predictable structure. rs<1+ digits>
value="1,20000,A,C" string can be much longer but usually ~ 40 chars
keys are unique
key insert order is random
into a LevelDB store using the python plyvel, and see dramatic drop in speed as the number of records grows. I guess this is expected but are there tuning measures I could look at to make it scale better?
Example code:
import plyvel
BATCHSIZE = 1000000
db = plyvel.DB('/tmp/lvldbSNP151/', create_if_missing=True)
wb = db.write_batch()
# items not in any key order
for key, value in DBSNPfile:
wb.put(key,value)
if i%BATCHSIZE==0:
wb.write()
wb.write()
I've tried various batch sizes, which helps bit, but am hoping there's something else I've missed. For example, can knowing the max length of a key (or value) be leveraged?
(Plyvel author here.)
LevelDB keeps all database items in sorted order. Since you are writing in a random order, this basically means that all parts of the database get rewritten all the time since LevelDB has to merge SSTs (this happens in the background). Once your database gets larger, and you keep adding more items to it, this results in a reduced write throughput.
I suspect that performance will not degrade as badly if you have better locality of your writes.
Other ideas that may be worth trying out are:
increase the write_buffer_size
increase the max_file_size
experiment with a larger block_size
use .write_batch(sync=False)
The above can all be used from Python using extra keyword arguments to plyvel.DB and to the .write_batch() method. See the api docs for details.

Python: slow nested for loop

I need to find out an optimal selection of media, based on certain constraints. I am doing it in FOUR nested for loop and since it would take about O(n^4) iterations, it is slow. I had been trying to make it faster but it is still damn slow. My variables can be as high as couple of thousands.
Here is a small example of what I am trying to do:
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = []
for i in range(max_disks):
for j in range(max_ssds):
for k in range(max_tapes):
for l in range(max_BR):
allocations.append((i,j,k,l)) # this is just for example. In actual program, I do processing here, like checking for bandwidth and cost constraints, and choosing the allocation based on that.
It wasn't slow for up to hundreds of each media type but would slow down for thousands.
Other way I tried is :
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = [(i,j,k,l) for i in range(max_disks) for j in range(max_ssds) for k in range(max_tapes) for l in range(max_BR)]
This way it is slow even for such small numbers.
Two questions:
Why the second one is slow for small numbers?
How can I make my program work for big numbers (in thousands)?
Here is the version with itertools.product
max_disks = 500
max_ssds = 100
max_tapes = 100
max_BR = 100
# allocations = []
for i, j, k,l in itertools.product(range(max_disks),range(max_ssds),range(max_tapes),range(max_BR)):
pass
It takes 19.8 seconds to finish with these numbers.
From the comments, I got that you're working on a problem that can be rewritten as an ILP. You have several constraints, and need to find a (near) optimal solution.
Now, ILPs are quite difficult to solve, and brute-forcing them quickly becomes intractable (as you've already witnessed). This is why there are several really clever algorithms used in the industry that truly work magic.
For Python, there are quite a few interfaces that hook-up to modern solvers; for more details, see e.g. this SO post. You could also consider using an optimizer, like SciPy optimize, but those generally don't do integer programming.
Doing any operation in Python a trillion times is going to be slow. However, that's not all you're doing. By attempting to store all the trillion items in a single list you are storing lots of data in memory and manipulating it in a way that creates a lot of work for the computer to swap memory in and out once it no longer fits in RAM.
The way that Python lists work is that they allocate some amount of memory to store the items in the list. When you fill up the list and it needs to allocate more, Python will allocate twice as much memory and copy all the old entries into the new storage space. This is fine so long as it fits in memory - even though it has to copy all the contents of the list each time it expands the storage, it has to do so less frequently as it keeps doubling the size. The problem comes when it runs out of memory and has to swap unused memory out to disk. The next time it tries to resize the list, it has to reload from disk all the entries that are now swapped out to disk, then swap them all back out again to get space to write the new entries. So this creates lots of slow disk operations that will get in the way of your task and slow it down even more.
Do you really need to store every item in a list? What are you going to do with them when you're done? You could perhaps write them out to disk as you're going instead of accumulating them in a giant list, though if you have a trillion of them, that's still a very large amount of data! Or perhaps you're filtering most of them out? That will help.
All that said, without seeing the actual program itself, it's hard to know if you have a hope of completing this work by an exhaustive search. Can all the variables be on the thousands scale at once? Do you really need to consider every combination of these variables? When max_disks==2000, do you really need to distinguish the results for i=1731 from i=1732? For example, perhaps you could consider values of i 1,2,3,4,5,10,20,30,40,50,100,200,300,500,1000,2000? Or perhaps there's a mathematical solution instead? Are you just counting items?

How to optimise code for cartesian product of HUGE lists in python

I have only been programming for a few months, but I have done the research and attempted this code.
I currently have 2 files. The first contains ±3 million pairs of protein IDs (strings).
The second contains an enumerated list of each protein, with a unique number assigned to it for each feature it contains: i.e. if proteinA contains 3 features, it will appear as proteinA_1, proteinA_2, proteinA_3. Some proteins can have up to 3000 features.
I want a list of pairs of feature interactions.
My code so far:
import csv,itertools, gzip
from collections import Counter
#opens and reads/writes files using csv and gzip
#1. Counts how many features each protein has in the second file.
cnt = Counter()
for row in cfile1:
cnt[row[0]]+=1
#2. Considers pairs of interacting proteins
for row in cfile2:
p1 = row[0]; p2=row[1]
#3.1. if both proteins have no features, just write the pair to the new file
if cnt[p1]==0 and cnt[p2]==0:
cout.writerow([p1,p2])
#3.2. if one protein has no feature, but the other has a feature write e.g. (p1_1,p2) (p1_2,p2) (p1_3,p2)... (p1_k,p2)
elif cnt[p1]!=0 and cnt[p2]==0:
x = cnt[p1]
for i in range(1,x+1):
p1n=p1+"_%d"%(i)
cout.writerow([p1n,p2])
elif cnt[p1]==0 and cnt[p2]!=0:
x = cnt[p2]
for i in range(1,x+1):
p2n=p2+"_%d"%(i)
cout.writerow([p1,p2n])
#3.3 if both proteins have features, create a list of the enumerated proteins then get the cartesian product of that list, so that you get all possible f-f interactions
elif (cnt[p1]!=0) and (cnt[p2]!=0):
x = cnt[p1];y = cnt[p2]
xprots = []; yprots=[]
for i in range(1,x+1):
p1n =p1+"_%d"%(i)
xprots.append(p1n)
for i in range(1,y+1):
p2n=p2+"_%d"%(i)
yprots.append(p2n)
for i in itertools.product(xprots,yprots):
cout.writerow([i[0],i[1]])
The code seems to be working correctly, but its taken about 18 hours to get through the first 150000 pairs. So far there are nearly 2 billion interactions in the output file.
Is there anyway, other than maybe cutting out some of the features, that might speed this up. Any tips would be greatly appreciated!
Thanks in advance
It sounds like the problem is inherent in what you are trying to do, which will take an extraordinarily long time even with the fastest, most optimized low-level C program.
its taken about 18 hours to get through the first 150000 pairs. So far there are nearly 2 billion interactions in the output file.
Let's look at the numbers. You say there are 3 million protein pairs, and each protein can have up to 3000 features. So the total number of lines in the output will be roughly (3 million) * (3000)^2, which is 27 trillion. It looks like each line will contain at least 10 characters (bytes), so we are talking about approximately 270 terabytes of output.
I doubt your disk is even large enough to store such a file. You need to re-think what you are trying to do; even a 1000x improvement in this code will not change the size of the output, and your program can't be faster than the size of what it's producing. If you really need all that output, your problem may be better suited to parallel computing on a supercomputer or cluster, which will require specialized programming depending on the architecture.
I think there's little space available for python code optimization. I can recommend you to try and write an extension in C, C++ which will upgrade the performance of the computation to a higher magnitude order.
Writing python extensions is a well documented process and really work very fine for intensive computation applications.
The documentation site on this issue is:
https://docs.python.org/2/extending/
Hope it helps!

BST or Hash Table?

I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.
Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.
You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}
library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".

Categories