python arrays, huge memory consumption - python

I have a huge file with the first row as a string and others rows represent integers. The number of columns is variable, dependent on a row.
A have one global list where I save my intermediate results. arr_of_arr is an list of lists of floats. The length of arr_of_arr is about 5000. Each element (again an array) of this array has a length from 100.000 to 10.000.000. The maximum length of the elements might vary, so I cannot extend each element in advance when I create arr_of_arr.
After I have proceeded the whole file,I add artificially I compute the mean over the elements for each of the global array and plot.max_size_arr is a length of the longest element in an array ( I compete it when I iterate over the lines in file)
arr = [x+[0]*(max_size_arr - len(x)) for x in arr_of_arr]
I need to calculate the means across arrays element wise.
For example,
[[1,2,3],[4,5,6],[0,2,10]] would result in [5/3,9/3,19/3] (mean of the first elements across arrays, mean od second elements across arrays etc.)
arr = np.mean(np.array(arr),axis=0)
However, this result in a huge memory consumption (like 100GB according to the cluster information). What would be a good solution in sense of structure to reduce memory consumption? Would numpy arrays be lighter than the normal arrays in python?

I think that the huge memory consumption is because you want to have the whole array in memory at once.
Why don't you use slices combined with numpy arrays?. Doing that you can simulate a batch processing of your data. You can give to a function the batch size (1000 or 10000 arrays), calculate the means and write the results into a dict or a file indicating slices and its own respectives means.

Have you tried using the Numba package? It reduces computation time and memory usage with standard numpy arrays.
http://numba.pydata.org

If the lines vary widely in the number of values, I'd stick with a list of lists, at long as is practical. numpy arrays are best when the data is 'row' lengths are the same.
To illustrate with a small example:
In [453]: list_of_lists=[[1,2,3],[4,5,6,7,8,9],[0],[1,2]]
In [454]: list_of_lists
Out[454]: [[1, 2, 3], [4, 5, 6, 7, 8, 9], [0], [1, 2]]
In [455]: [len(x) for x in list_of_lists]
Out[455]: [3, 6, 1, 2]
In [456]: [sum(x) for x in list_of_lists]
Out[456]: [6, 39, 0, 3]
In [458]: [sum(x)/float(len(x)) for x in list_of_lists]
Out[458]: [2.0, 6.5, 0.0, 1.5]
With your array approach to taking the mean, I get different numbers - because of all the padding 0s. Is that intentional?
In [460]: max_len=6
In [461]: arr=[x+[0]*(max_len-len(x)) for x in list_of_lists]
In [462]: arr
Out[462]:
[[1, 2, 3, 0, 0, 0],
[4, 5, 6, 7, 8, 9],
[0, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0]]
mean along columns?
In [463]: np.mean(np.array(arr),axis=0)
Out[463]: array([ 1.5 , 2.25, 2.25, 1.75, 2. , 2.25])
mean along rows
In [476]: In [463]: np.mean(np.array(arr),axis=1)
Out[476]: array([ 1. , 6.5, 0. , 0.5])
list mean with max length:
In [477]: [sum(x)/float(max_len) for x in list_of_lists]
Out[477]: [1.0, 6.5, 0.0, 0.5]

Related

Concatenate nested list of array with partial empty sublist

The objective is to concatenate nested list of arrays (i.e., list_arr). However, some of the sublists within the list_arr is of len zero.
Simply using np.array or np.asarray on the list_arr does not produce the intended result.
import numpy as np
ncondition=2
nnodes=30
nsbj=6
np.random.seed(0)
# Example of nested list list_arr
list_arr=[[[np.concatenate([[idx_sbj],[ncondi],[nepoch] ,np.random.rand(nnodes)]) for nepoch in range(np.random.randint(5))] \
for ncondi in range(ncondition)] for idx_sbj in range(nsbj)]
The following does not produce the expected concatenate output
test1=np.asarray(list_arr)
test2=np.array(list_arr)
test3= np.vstack(list_arr)
The expected output is an array of shapes (15,33)
OK, my curiosity got the better of me.
Make an object dtype array from the list:
In [280]: arr=np.array(list_arr,object)
In [281]: arr.shape
Out[281]: (6, 2)
All elements of this array are lists, with len:
In [282]: np.frompyfunc(len,1,1)(arr)
Out[282]:
array([[4, 1],
[0, 2],
[0, 2],
[0, 0],
[2, 3],
[1, 0]], dtype=object)
Looking at specific sublists. One has two empty lists
In [283]: list_arr[3]
Out[283]: [[], []]
others have one empty list, either first or second:
In [284]: list_arr[-1]
Out[284]:
[[array([5. , 0. , 0. , 0.3681024 , 0.3127533 ,
0.80183615, 0.07044719, 0.68357296, 0.38072924, 0.63393096,
...])],
[]]
and some have lists of differing numbers of arrays:
If I add up the numbers in [282] I get 15, so that must be where you get the (15,33). And presumably all the arrays have the same length.
The outer layer of nesting isn't relevant, so we can ravel and remove it.
In [295]: alist = arr.ravel().tolist()
then filter out the empty lists, and apply vstack to the remaining:
In [296]: alist = [np.vstack(x) for x in alist if x]
In [297]: len(alist)
Out[297]: 7
and one more vstack to join those:
In [298]: arr2 = np.vstack(alist)
In [299]: arr2.shape
Out[299]: (15, 33)

Index values to a vector with numpy in python

I have created a vector of zeros called Qc_vector (18 rows x 1 column).
I have created another vector called s_vector (6 rows x 1 column) that is generated each time by a for loop within the range ingreso_datos, that is, for this example it is generated 5 times.
I have also created a list called indices that is generated for each iteration of the loop, these indices tell me the row number to which I should index the values from s_vector to Qc_vector
PROBLEM
When trying to do this I get the following error: ValueError: shape mismatch: value array of shape (6,) could not be broadcast to indexing result of shape (6,1)
For element 6 of the matrix ingreso_datos, the indices are: [1,2,3,4,5,6]
For the end of the loop, that is, for element number 5 s_vector it looks like this:
s_vector for element 5
Qc_vector indexed, how it should look
import numpy as np
# Element 1(i) 2(i) 3(i) 1(j) 2(j) 3(j) x(i) y(i) x(j) y(j) | W(kg/m) Axis(kg/m)
# [Col0] [Col1] [Col2] [Col3] [Col4] [Col5] [Col6] [Col7] [Col8] [Col9] [Col10] | [Col11] [Col12]
ingreso_datos = [[ 1, 13, 14, 15, 7, 8, 9, 0, 0, 0, 2.5, 0, 0],
[ 2, 16, 17, 18, 10, 11, 12, 4.5, 0, 4.5, 2.5, 0, 0],
[ 3, 7, 8, 9, 1, 2, 3, 4.5, 0, 4.5, 2.5, 0, 0],
[ 4, 10, 11, 12, 4, 5, 6, 4.5, 0, 4.5, 2.5, 0, 0],
[ 5, 7, 8, 9, 10, 11, 12, 4.5, 0, 4.5, 2.5, -2200, 0]]
Qc_vector = np.zeros((12,1)) # Vector de zeros
for i in range(len(ingreso_datos)):
indices = []
indices.append([ingreso_datos[i][0], ingreso_datos[i][1], ingreso_datos[i][2], ingreso_datos[i][3],
ingreso_datos[i][4], ingreso_datos[i][5], ingreso_datos[i][6]])
for row in indices:
indices = np.array(row[1:])
L = np.sqrt((ingreso_datos[i][9]-ingreso_datos[i][7])**2+(ingreso_datos[i][10]-ingreso_datos[i][8])**2)
lx = (ingreso_datos[i][9]-ingreso_datos[i][7])/L
ly = (ingreso_datos[i][10]-ingreso_datos[i][8])/L
w = ingreso_datos[i][11]
ad = ingreso_datos[i][12]
s_vector = np.array([ad*L/2, w*L/2, (w*L**2)/12, ad*L/2, w*L/2, (-w*L**2)/12]) # s_vector
Qc_vector[np.ix_(indices)] = s_vector # Indexing
Qc_vector is (18,1).
indices = [ingreso_datos[i][0], ingreso_datos[i][1], ingreso_datos[i][2], ingreso_datos[i][3], ingreso_datos[i][4], ingreso_datos[i][5], ingreso_datos[i][6]])
or simply:
indices = [ingreso_datos[i,[0,1,2,3,4,5,6]]]
followed by:
for row in indices:
indices = np.array(row[1:])
which is just
ingreso_datos[i,[1,2,3,4,5,6]]
s_vector is a 6 element array, shape (6,)
In:
Qc_vector[np.ix_(indices)] = s_vector
you don't need ix_. In my previous answer I suggested:
master_matrix[np.ix_(indices,indices)] ==little_matrix
as a way of doing the indexing for all rows, not just one at a time.
I think your assignment can be simplified to
Qc_vector[indices, 0] = s_vector
That way there's a shape (6,) array on both sides.
I have a feeling you are still trying to write this code by copying other people's code, without understanding what is happening, or why they suggest things.
or define Qc_vector with shape (18,) rather than (18,1).
A quick fix if you don't want to bother too much would be to use numpy.reshape().
This way you can manage the shape mismatch.

Getting a column index in numpy

I'm pretty new to NumPy and I'm looking for a way to get the index of a current column I'm iterating over in a matrix.
import numpy as np
#sum of elements in each column
def p_b(mtrx):
b = []
for c in mtrx.T:
summ = 0
for i in c:
summ += i
b.append(summ)
return b
#return modified matrix where each element is equal to itself divided by
#the sum of the current column in the original matrix
def a_div_b(mtrx):
for c in mtrx:
for i in c:
#change i to be i/p_b(mtrx)[index_of_a_current_column]
return mtrx
For the input ([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) the result would be
([[1/12, 2/12, 3/12], [4/15, 5/15, 6/15], [7/18, 8/18, 9/18]]).
Any ideas about how I can achieve that?
You don't need those functions and loops to do that. Those will not be efficient. When using numpy, go for vectorized operations whenever is possible (in most cases it is possible). numpy broadcasting rules are used to perform mathematical operation between arrays of different dimensions, when possible, such that you can use vectorization, which is much more efficient than python loops.
In your case, say that your array arr is:
arr = np.arange(1, 10)
arr.shape = (3, 3)
#arr is:
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
you can achieve the desired result with:
res = (arr.T / arr.sum(axis=0)).T
>>> res
array([[0.08333333, 0.16666667, 0.25 ],
[0.26666667, 0.33333333, 0.4 ],
[0.38888889, 0.44444444, 0.5 ]])
numpy sum allows you to sum your array along a given axis if the axis parameter is given. 0 is the inner axis, the one you want to sum.
.T gives the transposed matrix. You need to transpose to perform the division on the correct axis and then transpose back.

How to efficiently sum and mean 2D NumPy arrays by id?

I have a 2d array a and a 1d array b. I want to compute the sum of rows in array a group by each id in b. For example:
import numpy as np
a = np.array([[1,2,3],[2,3,4],[4,5,6]])
b = np.array([0,1,0])
count = len(b)
ls = list(set(b))
res = np.zeros((len(ls),a.shape[1]))
for i in ls:
res[i] = np.array([a[x] for x in range(0,count) if b[x] == i]).sum(axis=0)
print res
I got the printed result as:
[[ 5. 7. 9.]
[ 2. 3. 4.]]
What I want to do is, since the 1st and 3rd elements of b are 0, I perform a[0]+a[2], which is [5, 7, 9] as one row of the results. Similarly, the 2nd element of b is 1, so that I perform a[1], which is [2, 3, 4] as another row of the results.
But it seems my implementation is quite slow for large array. Is there any better implementation?
I know there is a bincount function in numpy. But it seems only supports 1d array.
Thank you all for helping me!
The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:
import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)
Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.
import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)
Approach #1
You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -
np.add.at(res, b, a)
Sample run -
In [40]: a
Out[40]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [41]: b
Out[41]: array([0, 1, 0])
In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)
In [46]: np.add.at(res, b, a)
In [47]: res
Out[47]:
array([[5, 7, 9],
[2, 3, 4]])
To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -
In [49]: res/np.bincount(b)[:,None].astype(float)
Out[49]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -
def groupby_addat(a, b, out="sum"):
unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
np.add.at(res, tags, a)
if out=="mean":
return unqb, res/counts[:,None].astype(float)
elif out=="sum":
return unqb, res
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [204]: b_ids, means = groupby_addat(a, b, out="mean")
In [205]: b_ids
Out[205]: array([ 5, 10])
In [206]: means
Out[206]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Approach #2
We could also make use of np.add.reduceat and might be more performant -
def groupby_addreduceat(a, b, out="sum"):
sidx = b.argsort()
sb = b[sidx]
spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
sums = np.add.reduceat(a[sidx],spt_idx[:-1])
if out=="mean":
counts = spt_idx[1:] - spt_idx[:-1]
return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
elif out=="sum":
return sb[spt_idx[:-1]], sums
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")
In [208]: b_ids
Out[208]: array([ 5, 10])
In [209]: means
Out[209]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])

averaging matrix efficiently

in Python, given an n x p matrix, e.g. 4 x 4, how can I return a matrix that's 4 x 2 that simply averages the first two columns and the last two columns for all 4 rows of the matrix?
e.g. given:
a = array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
return a matrix that has the average of a[:, 0] and a[:, 1] and the average of a[:, 2] and a[:, 3].
I want this to work for an arbitrary matrix of n x p assuming that the number of columns I am averaging of n is obviously evenly divisible by n.
let me clarify: for each row, I want to take the average of the first two columns, then the average of the last two columns. So it would be:
1 + 2 / 2, 3 + 4 / 2 <- row 1 of new matrix
5 + 6 / 2, 7 + 8 / 2 <- row 2 of new matrix, etc.
which should yield a 4 by 2 matrix rather than 4 x 4.
thanks.
How about using some math? You can define a matrix M = [[0.5,0],[0.5,0],[0,0.5],[0,0.5]] so that A*M is what you want.
from numpy import array, matrix
A = array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
M = matrix([[0.5,0],
[0.5,0],
[0,0.5],
[0,0.5]])
print A*M
Generating M is pretty simple too, entries are 1/n or zero.
reshape - get mean - reshape
>>> a.reshape(-1, a.shape[1]//2).mean(1).reshape(a.shape[0],-1)
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
is supposed to work for any array size, and reshape doesn't make a copy.
It's a bit unclear what should happen for matrices with n > 4, but this code will do what you want:
a = N.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]], dtype=float)
avg = N.vstack((N.average(a[:,0:2], axis=1), N.average(a[:,2:4], axis=1))).T
This yields avg =
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
Here's a way to do it. You only need to change groupsize to make it work with other sizes like you said, though I'm not fully sure what you want.
groupsize = 2
out = np.hstack([np.mean(x,axis=1,out=np.zeros((a.shape[0],1))) for x in np.hsplit(a,groupsize)])
yields
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
for out. Hopefully it gives you some ideas on how to do exactly what it is that you want to do. You can make groupsize dependent on the dimensions of a for instance.

Categories