Efficiency with very large numpy arrays - python

I'm working with some very large arrays. An issue that I'm dealing with of course is running out of RAM to work with, but even before that my code is running slowly so that, even if I had infinite RAM, it would still take way too long. I'll give a bit of my code to show what I'm trying to do:
#samplez is a 3 million element 1-D array
#zfit is a 10,000 x 500 2-D array
b = np.arange((len(zfit))
for x in samplez:
a = x-zfit
mask = np.ma.masked_array(a)
mask[a <= 0] = np.ma.masked
index = mask.argmin(axis=1)
# These past 4 lines give me an index array of the smallest positive number
# in x - zift
d = zfit[b,index]
e = zfit[b,index+1]
f = (x-d)/(e-d)
# f is the calculation I am after
if x == samplez[0]:
g = f
index_stack = index
else:
g = np.vstack((g,f))
index_stack = np.vstack((index_stack,index))
I need to use g and index_stack, each of which are 3million x 10,000 2-D arrays, in a further calculation. Each iteration of this loop takes almost 1 second, so 3 million seconds total, which is way too long.
Is there anything I can do so that this calculation will run much faster? I've tried to think how I can do without this for loop, but the only way I can imagine is making 3 million copies of zfit, which is unfeasible.
And is there someway I can work with these arrays by not keeping everything in RAM? I'm a beginner and everything I've searched about this is either irrelevant or something I can't understand. Thanks in advance.

It is good to know that the smallest positive number will never show up in the end of rows.
In samplez there are 1 million unique values but in zfit, each row can only have 500 unique values at most. The entire zfit can have as much as 50 million unique values. The algorithm can be greatly sped up, if the number of 'finding the smallest positive number > each_element_in_samplez' calculation can be greatly reduced. Doing all 5e13 comparisons are probably an overkill and careful planing will be able to get rid of a large proportion of it. That will heavy depend on your actual underlying mathematics.
Without knowing it, there are still some small things can be done. 1, there are not so many of possible (e-d) so that can be taken out of the loop. 2, The loop can be eliminated by map. These two small fix, on my machine, result in about 22% speed-up.
def function_map(samplez, zfit):
diff=zfit[:,:-1]-zfit[:,1:]
def _fuc1(x):
a = x-zfit
mask = np.ma.masked_array(a)
mask[a <= 0] = np.ma.masked
index = mask.argmin(axis=1)
d = zfit[:,index]
f = (x-d)/diff[:,index] #constrain: smallest value never at the very end.
return (index, f)
result=map(_fuc1, samplez)
return (np.array([item[1] for item in result]),
np.array([item[0] for item in result]))
Next: masked_array can be avoided completely (which should bring significant improvement). samplez needs to be sorted as well.
>>> x1=arange(50)
>>> x2=random.random(size=(20, 10))*120
>>> x2=sort(x2, axis=1) #just to make sure the last elements of each col > largest val in x1
>>> x3=x2*1
>>> f1=lambda: function_map2(x1,x3)
>>> f0=lambda: function_map(x1, x2)
>>> def function_map2(samplez, zfit):
_diff=diff(zfit, axis=1)
_zfit=zfit*1
def _fuc1(x):
_zfit[_zfit<x]=(+inf)
index = nanargmin(zfit, axis=1)
d = zfit[:,index]
f = (x-d)/_diff[:,index] #constrain: smallest value never at the very end.
return (index, f)
result=map(_fuc1, samplez)
return (np.array([item[1] for item in result]),
np.array([item[0] for item in result]))
>>> import timeit
>>> t1=timeit.Timer('f1()', 'from __main__ import f1')
>>> t0=timeit.Timer('f0()', 'from __main__ import f0')
>>> t0.timeit(5)
0.09083795547485352
>>> t1.timeit(5)
0.05301499366760254
>>> t0.timeit(50)
0.8838210105895996
>>> t1.timeit(50)
0.5063929557800293
>>> t0.timeit(500)
8.900799036026001
>>> t1.timeit(500)
4.614129018783569
So, that is another 50% speed-up.
masked_array is avoided and that saves some RAM. Can't think of anything else to reduce RAM usage. It may be necessary to process samplez in parts. And also, dependents on the data and the required precision, if you can use float16 or float32 instead of the default float64 that can save you a lot of RAM.

Related

Python similarity on sets of strings via Pandas crashes memory. How can I make it work?

I'm struggling to get my python code to run, as I always run out of memory. So, I have the following data frame:
I have a column with a key and a column with features. This is a set containing a maximum of 10 strings that all have no spaces. And in this example I have about 70k rows.
key features
0 String A {'Thisisastring', 'Thisisanothersentence', ... 'Maximumof10Strings'}
1 String B {'Hellothere', 'Woop', ... 'Maxiningoutat10Strings'}
2 String C {'Yessir', 'Stackovervlowisawesome', ... 'Maximumof10Strings'}
...
70000 String XY {'Aintnostring', 'Maybeitis', ... 'pleasehelpme'}
...
Now what I want to do is to compare each of the feature-sets with all the other feature-sets and get their similarity. The similarity score is fairly easy in its code, as I only want, if there are 5 of the 10 the same, to give me 0.5 score of similarity, etc.:
def similarity_score(a, b):
c = a.intersection(b)
return 2 * float(len(c)) / (len(a) + len(b))
This is the current code, at the end I want to have a matrix, so that I can easily cluster them together based upon a similarity score threshold:
base_pd = original_pd['features']
i = base_pd.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']
fnc = np.vectorize(similarity_score)
y = fnc(k['A'], k['B']).reshape(len(base_pd), len(base_pd))
queries = original_pd['key'].to_list()
df = pd.DataFrame(data=y, index=keys, columns=keys)
The issue is, though, that this wipes out my memory and uses > 25 GB quite early on. Obviously, it's also due to the huge amount of data with 70K rows, but there will be even possibilities of using even more rows, so I need to find a solution.
I've already tried with NumPy, to get around it a bit, but I'm not getting anywhere.
How could I make this more efficient? I need to use strings originally, obviously could change them to hashes or so, but even then I am a bit lost.
Best and thanks in advance,
Lukas

Efficient Way to Repeatedly Split Large NumPy Array and Record Middle

I have a large NumPy array nodes = np.arange(100_000_000) and I need to rearrange this array by:
Recording and then removing the middle value in the array
Split the array into the left half and right half
Repeat Steps 1-2 for each half
Stop when all values are exhausted
So, for a smaller input example nodes = np.arange(10), the output would be:
[5 2 8 1 4 7 9 0 3 6]
This was accomplished by naively doing:
import numpy as np
def split(node, out):
mid = len(node) // 2
out.append(node[mid])
return node[:mid], node[mid+1:]
def reorder(a):
nodes = [a.tolist()]
out = []
while nodes:
tmp = []
for node in nodes:
for n in split(node, out):
if n:
tmp.append(n)
nodes = tmp
return np.array(out)
if __name__ == "__main__":
nodes = np.arange(10)
print(reorder(nodes))
However, this is way too slow for nodes = np.arange(100_000_000) and so I am looking for a much faster solution.
You can vectorize your function with Numpy by working on groups of slices.
Here is an implementation:
# Similar to [e for tmp in zip(a, b) for e in tmp] ,
# but on Numpy arrays and much faster
def interleave(a, b):
assert len(a) == len(b)
return np.column_stack((a, b)).reshape(len(a) * 2)
# n is the length of the input range (len(a) in your example)
def fast_reorder(n):
if n == 0:
return np.empty(0, dtype=np.int32)
startSlices = np.array([0], dtype=np.int32)
endSlices = np.array([n], dtype=np.int32)
allMidSlices = np.empty(n, dtype=np.int32) # Similar to "out" in your implementation
midInsertCount = 0 # Actual size of allMidSlices
# Generate a bunch of middle values as long as there is valid slices to split
while midInsertCount < n:
# Generate the new mid/left/right slices
midSlices = (endSlices + startSlices) // 2
# Computing the next slices is not needed for the last step
if midInsertCount + len(midSlices) < n:
# Generate the nexts slices (possibly with invalid ones)
newStartSlices = interleave(startSlices, midSlices+1)
newEndSlices = interleave(midSlices, endSlices)
# Discard invalid slices
isValidSlices = newStartSlices < newEndSlices
startSlices = newStartSlices[isValidSlices]
endSlices = newEndSlices[isValidSlices]
# Fast appending
allMidSlices[midInsertCount:midInsertCount+len(midSlices)] = midSlices
midInsertCount += len(midSlices)
return allMidSlices[0:midInsertCount]
On my machine, this is 89 times faster than your scalar implementation with the input np.arange(100_000_000) dropping from 2min35 to 1.75s. It also consume far less memory (rougthly 3~4 times less). Note that if you want a faster code, then you probably need to use a native language like C or C++.
Edit:
The question has been updated to have a much smaller input array so I leave the below for historical reasons. Basically it was likely a typo but we often get accustomed to computers working with insanely large numbers and when memory is involved they can be a real problem.
There is already a numpy based solution submitted by someone else that I think fits the bill.
Your code requires an insane amount of RAM just to hold 100 billion 64 bit integers. Do you have 800GB of RAM? Then you convert the numpy array to a list which will be substantially larger than the array (each packed 64 bit int in the numpy array will become a much less memory efficient python int object and the list will have a pointer to that object). Then you make a lot of slices of the list which will not duplicate the data but will duplicate the pointers to the data and use even more RAM. You also append all the result values to a list a single value at a time. Lists are very fast for adding items generally but with such an extreme size this will not only be slow but the way the list is allocated is likely to be extremely wasteful RAM wise and contribute to major problems (I believe they double in size when they get to a certain level of fullness so you will end up allocating more RAM than you need and doing many allocations and likely copies). What kind of machine are you running this on? There are ways to improve your code but unless you're running it on a super computer I don't know that you're going to ever finish that calculation. I only..only? have 32GB of RAM and I'm not going to even try to create a 100B int_64 numpy array as I don't want to use up ssd write life for a mass of virtual memory.
As for improving your code stick to numpy arrays don't change to a python list it will greatly increase the RAM you need. Preallocate a numpy array to put the answer in. Then you need a new algorithm. Anything recursive or recursive like (ie a loop splitting the input,) will require tracking a lot of state, your nodes list is going to be extraordinarily gigantic and again use a lot of RAM. You could use len(a) to indicate values that are removed from your list and scan through the entire array each time to figure out what to do next but that will save RAM in favour of a tremendous amount of searching a gigantic array. I feel like there is an algorithm to cut numbers from each end and place them in the output and just track the beginning and end but I haven't figured it out at least not yet.
I also think there is a simpler algorithm where you just track the number of splits you've done instead of making a giant list of slices and keeping it all in memory. Take the middle of the left half and then the middle of the right then count up one and when you take the middle of the left half's left half you know you have to jump to the right half then the count is one so you jump over to the original right half's left half and on and on... Based on the depth into the halves and the length of the input you should be able to jump around without scanning or tracking all of those slices though I haven't been able to dedicate much time to thinking this through in my head.
With a problem of this nature if you really need to push the limits you should consider using C/C++ so you can be as efficient as possible with RAM usage and because you're doing an insane number of tiny things which doesn't map well to python performance.

Why does np.add.at() return the wrong answer for large arrays?

I have a large data set, statistic, with statistic.shape = (1E10,) that I want to effectively bin (sum) into an array of zeros, out = np.zeros(1E10). Each entry in statistic has a corresponding index, idx, which tells me in which out bin it belongs. The indices are not unique so I cannot use out += statistic[idx] since this will only count the first time a particular index is encountered. Therefore I'm using np.add.at(out, idx, statistic). My problem is that for very large arrays, np.add.at() returns the wrong answer.
Below is an example script that shows this behaviour. The function check_add() should return 1.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N)
np.add.at(out, np.arange(N), np.ones(N))
return np.sum(out)/N
n_arr = [1E3, 1E5, 1E8, 1E10]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
This example returns for me:
N = 1000.0 (log(N) = 3.0); output ratio is 1.0
N = 100000.0 (log(N) = 5.0); output ratio is 1.0
N = 100000000.0 (log(N) = 8.0); output ratio is 1.0
N = 10000000000.0 (log(N) = 10.0); output ratio is 0.1410065408
Can someone explain to me why the function fails for N=1E10?
This is an old bug, NumPy issue 13286. ufunc.at was using a too-small variable for the loop counter. It got fixed a while ago, so update your NumPy. (The fix is present in 1.16.3 and up.)
You're overflowing int32:
1E10 % (np.iinfo(np.int32).max - np.iinfo(np.int32).min + 1) # + 1 for 0
Out[]: 1410065408
There's your weird number (googling that number actually got me to here which is how I figured this out.)
Now, what's happening in your function is a bit more weird. By the documentation of ufunc.at you should just be accumulate-adding the 1 values in the indices that are lower than np.iinfo(np.int32).max and the negative indices above np.iinfo(np.int32).min - but it seems to be 1) working backwards and 2) stopping when it gets to the last overflow. Without digging into the c code I couldn't tell you why, but it's probably a good thing it does - your function would fail silently and with the "correct" mean if it had done things this way, while corrupting your results (having 2 or 3 in those indices and 0 in the middle).
It is most likely due to integer precision indeed. If you play around with the numpy data-type (e.g. you constrain it to an (unsigned) value between 0-255) by setting uint8, you will see that they ratios start declining already for the second array. I do not have enough memory to test it, but setting all dtypes to uint64 as below should help:
def check_add(N):
N = int(N)
out = np.zeros(N,dtype='uint64')
np.add.at(out, np.arange(N,dtype='uint64'), 1)
return np.sum(out)/N
To understand the behavior, I recommend setting dtype='uint8' and checking the behavior for smaller N. So what happens is that the np.arange function creates ascending integers for the vector elements until it reaches the integer limit. It then starts again at 0 and counts up again, so at the beginning (smaller Ns) you get correct sum (although your out vector contains a lot of elements >1 in the positions 0:limit and a lot of elements = 0 beyond the limit). If however you choose N large enough, the elements in your out vector start exceeding the integer limit and start again from 0. As soon as that happens your sum is vastly off. To double-check, realize that the uint8 limit is 255(256 integers) and 256^2=65536. Set N = 65536 with dtype='uint8' and check_add(65536) will return 0.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N,dtype='uint8')
np.add.at(out, np.arange(N,dtype='uint8'), 1)
return np.sum(out)/N
n_arr = [1E1, 1E3, 1E5,65536, 1E7]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
Also note, that you don't need the np.ones vector but can simply replace it by 1, if all you care about is uniformly incrementing everything by 1.
Guessing as I couldn't run it, but could it be a problem that you are exceeding max integer value in python for the last option? Ie exceeds 2147483647.
Use longinteger type instead as per below.
Referring to: [enter link description here][1]https://docs.python.org/2.0/ref/integers.html
Hope this helps. Please let me know if it does work.

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.
To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))
If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.
An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Python: Conditional elements in array

A question from a complete Python novice.
I have a column array where I need to force certain values to zero depending on a conditional statement applied to another array. I have found two solutions, which both provide the correct answer. But they are both quite time consuming for the larger arrays I typically need (>1E6 elements) - also I suspect that it is poor programming technique. The two versions are:
from numpy import zeros,abs,multiply,array,reshape
def testA(y, f, FC1, FC2):
c = zeros((len(f),1))
for n in xrange(len(f)):
if abs(f[n,0]) >= FC1 and abs(f[n,0]) <= FC2:
c[n,0] = 1.
w = multiply(c,y)
return w
def testB(y, f, FC1, FC2):
z = [(abs(f[n,0])>=FC1 and abs(f[n,0])<=FC2) for n in xrange(len(f))]
z = multiply(array(z,dtype=float).reshape(len(f),1), y)
return z
The input arrays are column arrays as this matches the post processing to be done. The test can be done like:
>>> from numpy.random import normal as randn
>>> fs, N = 1.E3, 2**22
>>> f = fs/N*arange(N).reshape((N,1))
>>> x = randn(size=(N,1))
>>> w1 = testA(x,f,200.,550.)
>>> z1 = testB(x,f,200.,550.)
On my laptop testA takes 18.7 seconds and testB takes 19.3 - both for N=2**22. In testB I also tried to include "z = [None]*len(f)" to preallocate as suggested in another thread but this doesn't really make any difference.
I have two questions, which I hope to have the same answer:
What is the "correct" Python solution to this problem?
Is there anything I can do to get the answer faster?
I have deliberately not used any time at all using compiled Python for example - I wanted to have some working code first. Hopefully also something, which is good Python style. I hope to be able to get the execution time for N=2**22 below two seconds or so. This particular operation will be used many times so the execution time does matter.
I apologize in advance if the question is stupid - I haven't been able to find an answer in the overwhelming amount of not always easily accessible Python documentation or in another thread.
use bool array to access elements in array y:
def testC(y, f, FC1, FC2):
f2 = abs(f)
idx = (f2>=FC1) & (f2<=FC2)
y[~idx] = 0
return y
All of these are slower than HYRY solution by a large factor:
How about
( x[1] if FC1<=abs(x[0])<=FC2 else 0 for x in itertools.izip(f,x) )
If you need to do random access (very slow)
[ x[1] if FC1<=abs(x[0])<=FC2 else 0 for x in itertools.izip(f,x) ]
or you can also use map
map(lambda x: x[1] if FC1<=abs(x[0])<=FC2 else 0 , itertools,izip(f,x))
or using vectorize (faster than A and B but much much slower than C)
b1v = np.vectorize(lambda a,b: a if 200<=abs(b)<=550 else 0)
b1 = b1v(f,x)

Categories