How to stack uneven numpy arrays? - python

how can I stack the elements from the same respective index from each array in a list of arrays?
arrays = [np.array([1,2,3,4,5]),
np.array([6,7,8,9]),
np.array([11,22,33,44,55]),
np.array([2,4])]
output = [[1,6,11,2],
[2,7,22,4],
[3,8,33],
[4,9,44],
[5,55]]
arrays is a list of arrays of uneven lengths. The output has a first array (don't mind if it's a list too) that contains all possible index 0s from each array. The next array within output contains all possible index 1s and so on...
Closest thing I can find (but requires same shape arrays) is:
a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
np.stack((a, b), axis=-1)
# which gives
array([[1, 2],
[2, 3],
[3, 4]])
Thanks.

This gets you close. You can't really have a 2D sparse array as shown in your example output.
import numpy as np
arrays = [np.array([1,2,3,4,5]),
np.array([6,7,8,9]),
np.array([11,22,33,44,55]),
np.array([2,4])]
maxx = max(x.shape[0] for x in arrays)
for x in arrays:
x.resize(maxx,refcheck=False)
output = np.stack(arrays, axis=1)
print(output)
C:\tmp>python x.py
[[ 1 6 11 2]
[ 2 7 22 4]
[ 3 8 33 0]
[ 4 9 44 0]
[ 5 0 55 0]]

You could just wrap it in a DataFrame first:
arr = pd.DataFrame(arrays).values.T
Output:
array([[ 1., 6., 11., 2.],
[ 2., 7., 22., 4.],
[ 3., 8., 33., nan],
[ 4., 9., 44., nan],
[ 5., nan, 55., nan]])
Though if you really want it with different sizes, go with:
arr = [x.dropna().values for _, x in pd.DataFrame(arrays).iteritems()]
Output:
[array([ 1, 6, 11, 2]),
array([ 2, 7, 22, 4]),
array([ 3., 8., 33.]),
array([ 4., 9., 44.]),
array([ 5., 55.])]

Related

find infinity values and replace with maximum per vector in a numpy array

Suppose I have the following array with shape (3, 5) :
array = np.array([[1, 2, 3, inf, 5],
[10, 9, 8, 7, 6],
[4, inf, 2, 6, inf]])
Now I want to find the infinity values per vector and replace them with the maximum of that vector, with a lower limit of 1.
So the output for this example shoud be:
array_solved = np.array([[1, 2, 3, 5, 5],
[10, 9, 8, 7, 6],
[4, 6, 2, 6, 6]])
I could do this by looping over every vector of the array and apply:
idx_inf = np.isinf(array_vector)
max_value = np.max(np.append(array_vector[~idx_inf], 1.0))
array_vector[idx_inf] = max_value
But I guess there is a faster way.
Anyone an idea?
One way is to first convert infs to NaNs with np.isinf masking and then NaNs to max values of rows with np.nanmax:
array[np.isinf(array)] = np.nan
array[np.isnan(array)] = np.nanmax(array, axis=1)
to get
>>> array
array([[ 1., 2., 3., 5., 5.],
[10., 9., 8., 7., 6.],
[ 4., 10., 2., 6., 6.]])
import numpy as np
array = np.array([[1, 2, 3, np.inf, 5],
[10, 9, 8, 7, 6],
[4, np.inf, 2, 6, np.inf]])
n, m = array.shape
array[np.isinf(array)] = -np.inf
mx_array = np.repeat(np.max(array, axis=1), m).reshape(n, m)
ind = np.where(np.isinf(array))
array[ind] = mx_array[ind]
Output array:
array([[ 1., 2., 3., 5., 5.],
[10., 9., 8., 7., 6.],
[ 4., 6., 2., 6., 6.]])

Unique entries in columns of a 2D numpy array

I have an array of integers:
import numpy as np
demo = np.array([[1, 2, 3],
[1, 5, 3],
[4, 5, 6],
[7, 8, 9],
[4, 2, 3],
[4, 2, 12],
[10, 11, 13]])
And I want an array of unique values in the columns, padded with something if necessary (e.g. nan):
[[1, 4, 7, 10, nan],
[2, 5, 8, 11, nan],
[3, 6, 9, 12, 13]]
It does work when I iterate over the transposed array and use a boolean_indexing solution from a previous question. But I was hoping there would be a built-in method:
solution = []
for row in np.unique(demo.T, axis=1):
solution.append(np.unique(row))
def boolean_indexing(v, fillval=np.nan):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.full(mask.shape,fillval)
out[mask] = np.concatenate(v)
return out
print(boolean_indexing(solution))
AFAIK, there are no builtin solution for that. That being said, your solution seems a bit complex to me. You could create an array with initialized values and fill it with a simple loop (since you already use loops anyway).
solution = [np.unique(row) for row in np.unique(demo.T, axis=1)]
result = np.full((len(solution), max(map(len, solution))), np.nan)
for i,arr in enumerate(solution):
result[i][:len(arr)] = arr
If you want to avoid the loop you could do:
demo = demo.astype(np.float32) # nan only works on floats
sort = np.sort(demo, axis=0)
diff = np.diff(sort, axis=0)
np.place(sort[1:], diff == 0, np.nan)
sort.sort(axis=0)
edge = np.argmax(sort, axis=0).max()
result = sort[:edge]
print(result.T)
Output:
array([[ 1., 4., 7., 10., nan],
[ 2., 5., 8., 11., nan],
[ 3., 6., 9., 12., 13.]], dtype=float32)
Not sure if this is any faster than the solution given by Jérôme.
EDIT
A slightly better solution
demo = demo.astype(np.float32)
sort = np.sort(demo, axis=0)
mask = np.full(sort.shape, False, dtype=bool)
np.equal(sort[1:], sort[:-1], out=mask[1:])
np.place(sort, mask, np.nan)
edge = (~mask).sum(0).max()
result = np.sort(sort, axis=0)[:edge]
print(result.T)
Output:
array([[ 1., 4., 7., 10., nan],
[ 2., 5., 8., 11., nan],
[ 3., 6., 9., 12., 13.]], dtype=float32)

Which dim to use on tf.metrics.mean_cosine_distance?

I'm confused about which dim refers to which actual dimension in Tensorflow in general, but concretely, when using tf.metrics.mean_cosine_distance
Given
x = [
[1, 2, 3, 4, 5],
[0, 2, 3, 4, 5],
]
I'd like to calculate the distance column-wise. In other words, which dimension resolves to (pseudo code):
mean([
cosine_distance(x[0][0], x[1][0]),
cosine_distance(x[0][1], x[1][1]),
cosine_distance(x[0][2], x[1][2]),
cosine_distance(x[0][3], x[1][3]),
cosine_distance(x[0][4], x[1][4]),
])
It is along dim 0 for your input x. It's intuitive to see this once you construct your input x as a numpy array.
In [49]: x_arr = np.array(x, dtype=np.float32)
In [50]: x_arr
Out[50]:
array([[ 1., 2., 3., 4., 5.],
[ 0., 2., 3., 4., 5.]], dtype=float32)
# compute (mean) cosine distance between `x[0]` & `x[1]`
# where `x[0]` can be considered as `labels`
# while `x[1]` can be considered as `predictions`
In [51]: cosine_dist_axis0 = tf.metrics.mean_cosine_distance(x_arr[0], x_arr[1], 0)
This dim corresponds to the name axis in NumPy terminology. For example, a simple sum operation can be done along axis 0 like:
In [52]: x_arr
Out[52]:
array([[ 1., 2., 3., 4., 5.],
[ 0., 2., 3., 4., 5.]], dtype=float32)
In [53]: np.sum(x_arr, axis=0)
Out[53]: array([ 1., 4., 6., 8., 10.], dtype=float32)
When you compute the tf.metrics.mean_cosine_distance, you're essentially computing the cosine distance between the vectors labels and predictions along dim 0 (and then taking mean) if your inputs are of shape (n, ) where n is the length of each vector (i.e. number of entries in labels/prediction)
But, if you're passing the labels and predictions as a column vector, then the tf.metrics.mean_cosine_distance has to be calculated along dim 1
Example:
If your input label and prediction are column vectors,
# if your `label` is a column vector
In [66]: (x_arr[0])[:, None]
Out[66]:
array([[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]], dtype=float32)
# if your `prediction` is a column vector
In [67]: (x_arr[1])[:, None]
Out[67]:
array([[ 0.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]], dtype=float32)
Then, tf.metrics.mean_cosine_distance has to computed along dim 1
# inputs
In [68]: labels = (x_arr[0])[:, None]
In [69]: predictions = (x_arr[1])[:, None]
# compute mean cosine distance between them
In [70]: cosine_dist_dim1 = tf.metrics.mean_cosine_distance(labels, predictions, 1)
This tf.metrics.mean_cosine_distance is more or less doing the same thing as scipy.spatial.distance.cosine but it also takes mean.
For your example case:
In [77]: x
Out[77]: [[1, 2, 3, 4, 5], [0, 2, 3, 4, 5]]
In [78]: import scipy
In [79]: scipy.spatial.distance.cosine(x[0], x[1])
Out[79]: 0.009132

Python: properly iterating through a dictionary of numpy arrays

Given the following numpy arrays:
import numpy
a=numpy.array([[1,1,1],[1,1,1],[1,1,1]])
b=numpy.array([[2,2,2],[2,2,2],[2,2,2]])
c=numpy.array([[3,3,3],[3,3,3],[3,3,3]])
and this dictionary containing them all:
mydict={0:a,1:b,2:c}
What is the most efficient way of iterating through mydict so to compute the average numpy array that has (1+2+3)/3=2 as values?
My attempt fails as I am giving it too many values to unpack. It is also extremely inefficient as it has an O(n^3) time complexity:
aver=numpy.empty([a.shape[0],a.shape[1]])
for c,v in mydict.values():
for i in range(0,a.shape[0]):
for j in range(0,a.shape[1]):
aver[i][j]=mydict[c][i][j] #<-too many values to unpack
The final result should be:
In[17]: aver
Out[17]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
EDIT
I am not looking for an average value for each numpy array. I am looking for an average value for each element of my colleciton of numpy arrays. This is a minimal example, but the real thing I am working on has over 120,000 elements per array, and for the same position the values change from array to array.
I think you're making this harder than it needs to be. Either sum them and divide by the number of terms:
In [42]: v = mydict.values()
In [43]: sum(v) / len(v)
Out[43]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
Or stack them into one big array -- which it sounds like is the format they probably should have been in to start with -- and take the mean over the stacked axis:
In [44]: np.array(list(v)).mean(axis=0)
Out[44]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
You really shouldn't be using a dict of numpy.arrays. Just use a multi-dimensional array:
>>> bigarray = numpy.array([arr.tolist() for arr in mydict.values()])
>>> bigarray
array([[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]],
[[3, 3, 3],
[3, 3, 3],
[3, 3, 3]]])
>>> bigarray.mean(axis=0)
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
>>>
You should modify your code to not even work with a dict. Especially not a dict with integer keys...

Simple way of stacking arrays with index offset

I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)
You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])
There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])
pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])

Categories