Element wise mean of numpy arrays of different sizes - python

So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.
df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})
row e.g. [0, 100, -200, 300, -150...]
There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution.
I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this.
But for the last step of the solution using np.nanmean I'm running into a new error which is
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
My code looks like this so far:
import pandas as pd
import numpy as np
import itertools
df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})
datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)
the arrays print like this
And the error is pointing towards the last line
The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.

Make a ragged array:
In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]:
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
array([0., 0., 0.])], dtype=object)
Note the shape and dtype.
Try to use mean on it:
In [25]: np.mean(arr)
Traceback (most recent call last):
Input In [25] in <cell line: 1>
np.mean(arr)
File <__array_function__ internals>:180 in mean
File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)
Apply mean to each element array works:
In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]
Trying to use zip_longest:
In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]:
[(array([0, 1, 2, 3, 4]),),
(array([1., 1., 1., 1., 1.]),),
(array([0., 0., 0.]),)]
No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:
In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]
zip_longest can be used to pad lists, but it takes more thought than this.
If we make an array from that list:
In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]:
array([[ 0., 1., 0.],
[ 1., 1., 0.],
[ 2., 1., 0.],
[ 3., 1., nan],
[ 4., 1., nan]])
and transpose it, we can take the nanmean:
In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]:
array([[ 0., 1., 2., 3., 4.],
[ 1., 1., 1., 1., 1.],
[ 0., 0., 0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])

Related

Trying to understand signature in numpy.vectorize

I am trying to understand the signature functionality in numpy.vectorize. I have some examples but did not help much in the understanding.
>>import scipy.stats
>>pearsonr = np.vectorize(scipy.stats.pearsonr, signature='(n),(n)->(),()')
>>pearsonr([[0, 1, 2, 3]], [[1, 2, 3, 4], [4, 3, 2, 1]])
(array([ 1., -1.]), array([ 0., 0.]))
>>convolve = np.vectorize(np.convolve, signature='(n),(m)->(k)')
>>convolve(np.eye(4), [1, 2, 1])
array([[1., 2., 1., 0., 0., 0.],
[0., 1., 2., 1., 0., 0.],
[0., 0., 1., 2., 1., 0.],
[0., 0., 0., 1., 2., 1.]])
>>>import numpy as np
>>>qr = np.vectorize(np.linalg.qr, signature='(m,n)->(m,k),(k,n)')
>>>qr(np.random.normal(size=(1, 3, 2)))
(array([[-0.31622777, -0.9486833 ],
[-0.9486833 , 0.31622777]]),
array([[-3.16227766, -4.42718872, -5.69209979],
[ 0. , -0.63245553, -1.26491106]]))
>>>import scipy
>>>logm = np.vectorize(scipy.linalg.logm, signature='(m,m)->(m,m)')
>>>logm(np.random.normal(size=(1, 3, 2)))
array([[[ 1.08226288, -2.29544602],
[ 2.12599894, -1.26335203]]])
Can you please someone explain the functionality-syntax of the signatures
signature='(n),(n)->(),()'
signature='(n),(m)->(k)'
signature='(m,n)->(m,k),(k,n)'
signature='(m,m)->(m,m)'
used in the aforementioned examples? If we didn't use the signatures, how the examples would have been implemented in a more easy-naive way?
Any help is highly appreciated.
The aforementioned examples can be found here and here.
I think the explanation would be clearer if we knew the 'signature' of the individual functions - what they expect, and what they produce. But I can make some deductions from the code you show.
>>pearsonr = np.vectorize(scipy.stats.pearsonr, signature='(n),(n)->(),()')
>>pearsonr([[0, 1, 2, 3]], [[1, 2, 3, 4], [4, 3, 2, 1]])
(array([ 1., -1.]), array([ 0., 0.]))
This is called with a (4,) and (2,4) arrays (well, lists that become such arrays). They broadcast together to (2,4). The stats function is then called twice, once for each row of the pair, getting two (4,) arrays, and returning 2 scalar values (maybe the mean and std?)
>>convolve = np.vectorize(np.convolve, signature='(n),(m)->(k)')
>>convolve(np.eye(4), [1, 2, 1])
array([[1., 2., 1., 0., 0., 0.],
[0., 1., 2., 1., 0., 0.],
[0., 0., 1., 2., 1., 0.],
[0., 0., 0., 1., 2., 1.]])
This called with (4,4) and (3,) arrays. I think convolve gets called 4 times, once for each row of the eye, and getting the same [1,2,1] each time. The result is a 4 row array (with 6 columns - determined by convolve itself, not vectorize.
>>>import numpy as np
>>>qr = np.vectorize(np.linalg.qr, signature='(m,n)->(m,k),(k,n)')
>>>qr(np.random.normal(size=(1, 3, 2)))
(array([[-0.31622777, -0.9486833 ],
[-0.9486833 , 0.31622777]]),
array([[-3.16227766, -4.42718872, -5.69209979],
[ 0. , -0.63245553, -1.26491106]]))
Signature: np.linalg.qr(a, mode='reduced')
a : array_like, shape (M, N)
'reduced' : returns q, r with dimensions (M, K), (K, N) (default)
vectorize signature just repeats the information in the docs.
a is (1,3,2) shape array; so qr is called once (1st dimension), with a (3,2) array. The result is 2 arrays, (2,k) and (k,3) shapes. When I run it I get an added size 1 dimension (1,2,3) and (1,2,2). Different numbers because of random:
In [120]: qr = np.vectorize(np.linalg.qr, signature='(m,n)->(m,k),(k,n)')
...: qr(np.random.normal(size=(1, 3,2)))
Out[120]:
(array([[[-0.61362528, 0.09161174],
[ 0.63682861, -0.52978942],
[-0.46681188, -0.84316692]]]),
array([[[-0.65301725, -1.00494992],
[ 0. , 0.8068886 ]]]))
>>>import scipy
>>> logm = np.vectorize(scipy.linalg.logm, signature='(m,m)->(m,m)')
>>>logm(np.random.normal(size=(1, 3, 2)))
array([[[ 1.08226288, -2.29544602],
[ 2.12599894, -1.26335203]]])
scipy.linalg.logm expects square array, and returns the same.
Calling logm with a (1,3,2) produces an error, because (3,2) is not a square array:
ValueError: inconsistent size for core dimension 'm': 2 vs 3
Calling scipy.linalg.logm directly produces the same error, worded differently:
linalg.logm(np.random.normal(size=(3, 2)))
ValueError: expected square array_like input
When I say the function is called twice, or something like that, I'm ignoring the test call that's used to determine the return dtype.

Pythonic way to access the shifted version of numpy array?

What is a pythonic way to access the shifted, either right or left, of a numpy array? A clear example:
a = np.array([1.0, 2.0, 3.0, 4.0])
Is there away to access:
a_shifted_1_left = np.array([2.0, 3.0, 4.0, 1.0])
from the numpy library?
You are looking for np.roll -
np.roll(a,-1) # shifted left
np.roll(a,1) # shifted right
Sample run -
In [28]: a
Out[28]: array([ 1., 2., 3., 4.])
In [29]: np.roll(a,-1) # shifted left
Out[29]: array([ 2., 3., 4., 1.])
In [30]: np.roll(a,1) # shifted right
Out[30]: array([ 4., 1., 2., 3.])
If you want more shifts, just go np.roll(a,-2) and np.roll(a,2) and so on.

Simple way of stacking arrays with index offset

I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)
You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])
There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])
pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])

Split NumPy array according to values in the array (a condition)

I have an array:
arr = [(1,1,1), (1,1,2), (1,1,3), (1,1,4)...(35,1,22),(35,1,23)]
I want to split my array according to the third value in each ordered pair. I want each third value of 1 to be the start
of a new array. The results should be:
[(1,1,1), (1,1,2),...(1,1,35)][(1,2,1), (1,2,2),...(1,2,46)]
and so on. I know numpy.split should do the trick but I'm lost as to how to write the condition for the split.
Here's a quick idea, working with a 1d array. It can be easily extended to work with your 2d array:
In [385]: x=np.arange(10)
In [386]: I=np.where(x%3==0)
In [387]: I
Out[387]: (array([0, 3, 6, 9]),)
In [389]: np.split(x,I[0])
Out[389]:
[array([], dtype=float64),
array([0, 1, 2]),
array([3, 4, 5]),
array([6, 7, 8]),
array([9])]
The key is to use where to find the indecies where you want split to act.
For a 2d arr
First make a sample 2d array, with something interesting in the 3rd column:
In [390]: arr=np.ones((10,3))
In [391]: arr[:,2]=np.arange(10)
In [392]: arr
Out[392]:
array([[ 1., 1., 0.],
[ 1., 1., 1.],
...
[ 1., 1., 9.]])
Then use the same where and boolean to find indexes to split on:
In [393]: I=np.where(arr[:,2]%3==0)
In [395]: np.split(arr,I[0])
Out[395]:
[array([], dtype=float64),
array([[ 1., 1., 0.],
[ 1., 1., 1.],
[ 1., 1., 2.]]),
array([[ 1., 1., 3.],
[ 1., 1., 4.],
[ 1., 1., 5.]]),
array([[ 1., 1., 6.],
[ 1., 1., 7.],
[ 1., 1., 8.]]),
array([[ 1., 1., 9.]])]
I cannot think of any numpy functions or tricks to do this . A simple solution using for loop would be -
In [48]: arr = [(1,1,1), (1,1,2), (1,1,3), (1,1,4),(1,2,1),(1,2,2),(1,2,3),(1,3,1),(1,3,2),(1,3,3),(1,3,4),(1,3,5)]
In [49]: result = []
In [50]: for i in arr:
....: if i[2] == 1:
....: tempres = []
....: result.append(tempres)
....: tempres.append(i)
....:
In [51]: result
Out[51]:
[[(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 1, 4)],
[(1, 2, 1), (1, 2, 2), (1, 2, 3)],
[(1, 3, 1), (1, 3, 2), (1, 3, 3), (1, 3, 4), (1, 3, 5)]]
From looking at the documentation it seems like specifying the index of where to split on will work best. For your specific example the following works if arr is already a 2dimensional numpy array:
np.split(arr, np.where(arr[:,2] == 1)[0])
arr[:,2] returns a list of the 3rd entry in each tuple. The colon says to take every row and the 2 says to take the 3rd column, which is the 3rd component.
We then use np.where to return all the places where the 3rd coordinate is a 1. We have to do np.where()[0] to get at the array of locations directly.
We then plug in the indices we've found where the 3rd coordinate is 1 to np.split which splits at the desired locations.
Note that because the first entry has a 1 in the 3rd coordinate it will split before the first entry. This gives us one extra "split" array which is empty.

Efficient way to add a singleton dimension to a NumPy vector so that slice assignments work

In NumPy, how can you efficiently make a 1-D object into a 2-D object where the singleton dimension is inferred from the current object (i.e. a list should go to either a 1xlength or lengthx1 vector)?
# This comes from some other, unchangeable code that reads data files.
my_list = [1,2,3,4]
# What I want to do:
my_numpy_array[some_index,:] = numpy.asarray(my_list)
# The above doesn't work because of a broadcast error, so:
my_numpy_array[some_index,:] = numpy.reshape(numpy.asarray(my_list),(1,len(my_list)))
# How to do the above without the call to reshape?
# Is there a way to directly convert a list, or vector, that doesn't have a
# second dimension, into a 1 by length "array" (but really it's still a vector)?
In the most general case, the easiest way to add extra dimensions to an array is by using the keyword None when indexing at the position to add the extra dimension. For example
my_array = numpy.array([1,2,3,4])
my_array[None, :] # shape 1x4
my_array[:, None] # shape 4x1
Why not simply add square brackets?
>> my_list
[1, 2, 3, 4]
>>> numpy.asarray([my_list])
array([[1, 2, 3, 4]])
>>> numpy.asarray([my_list]).shape
(1, 4)
.. wait, on second thought, why is your slice assignment failing? It shouldn't:
>>> my_list = [1,2,3,4]
>>> d = numpy.ones((3,4))
>>> d
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
>>> d[0,:] = my_list
>>> d[1,:] = numpy.asarray(my_list)
>>> d[2,:] = numpy.asarray([my_list])
>>> d
array([[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.]])
even:
>>> d[1,:] = (3*numpy.asarray(my_list)).T
>>> d
array([[ 1., 2., 3., 4.],
[ 3., 6., 9., 12.],
[ 1., 2., 3., 4.]])
import numpy as np
a = np.random.random(10)
sel = np.at_least2d(a)[idx]
What about expand_dims?
np.expand_dims(np.array([1,2,3,4]), 0)
has shape (1,4) while
np.expand_dims(np.array([1,2,3,4]), 1)
has shape (4,1).
You can always use dstack() to replicate your array:
import numpy
my_list = array([1,2,3,4])
my_list_2D = numpy.dstack((my_list,my_list));

Categories