Related
So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.
df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})
row e.g. [0, 100, -200, 300, -150...]
There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution.
I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this.
But for the last step of the solution using np.nanmean I'm running into a new error which is
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
My code looks like this so far:
import pandas as pd
import numpy as np
import itertools
df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})
datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)
the arrays print like this
And the error is pointing towards the last line
The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.
Make a ragged array:
In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]:
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
array([0., 0., 0.])], dtype=object)
Note the shape and dtype.
Try to use mean on it:
In [25]: np.mean(arr)
Traceback (most recent call last):
Input In [25] in <cell line: 1>
np.mean(arr)
File <__array_function__ internals>:180 in mean
File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)
Apply mean to each element array works:
In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]
Trying to use zip_longest:
In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]:
[(array([0, 1, 2, 3, 4]),),
(array([1., 1., 1., 1., 1.]),),
(array([0., 0., 0.]),)]
No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:
In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]
zip_longest can be used to pad lists, but it takes more thought than this.
If we make an array from that list:
In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]:
array([[ 0., 1., 0.],
[ 1., 1., 0.],
[ 2., 1., 0.],
[ 3., 1., nan],
[ 4., 1., nan]])
and transpose it, we can take the nanmean:
In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]:
array([[ 0., 1., 2., 3., 4.],
[ 1., 1., 1., 1., 1.],
[ 0., 0., 0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])
Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.
Basically, I need to be able to take arrays like this:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
And then "add" them so they look like this:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.
Thanks!
Edit: Ready for some speedtests?
Intel win 64bit machine. All of the following values are in seconds; 20000 loops.
collections.Counter results:
2.131000, 2.125000, 2.125000
Divakar's union1d + masking results:
1.641000, 1.633000, 1.625000
Divakar's union1d + indexing results:
0.625000, 0.625000, 0.641000
Histogram results:
1.844000, 1.938000, 1.858000
Pandas results:
16.659000, 16.686000, 16.885000
Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!
Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).
Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.
Here's one approach with np.union1d and masking -
def app1(a,b):
c0 = np.union1d(a[:,0],b[:,0])
out = np.zeros((len(c0),2))
out[:,0] = c0
mask1 = np.in1d(c0,a[:,0])
out[mask1,1] = a[:,1]
mask2 = np.in1d(c0,b[:,0])
out[mask2,1] += b[:,1]
return out
Sample run -
In [174]: a
Out[174]:
array([[ 0., 2.],
[ 12., 2.],
[ 23., 1.]])
In [175]: b
Out[175]:
array([[ 0., 3.],
[ 1., 1.],
[ 12., 5.]])
In [176]: app1(a,b)
Out[176]:
array([[ 0., 5.],
[ 1., 1.],
[ 12., 7.],
[ 23., 1.]])
Here's another with np.union1d and indexing -
def app2(a,b):
n = np.maximum(a[:,0].max(), b[:,0].max())+1
c0 = np.union1d(a[:,0],b[:,0])
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out = out0[c0.astype(int)]
out[:,0] = c0
return out
For the case where all indices are covered by the first column values in a and b -
def app2_specific(a,b):
c0 = np.union1d(a[:,0],b[:,0])
n = c0[-1]+1
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out0[:,0] = c0
return out0
Sample run -
In [234]: a
Out[234]:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
In [235]: b
Out[235]:
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
In [236]: app2_specific(a,b)
Out[236]:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If you know the number of fields, use np.bincount.
c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T
This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.
out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])
You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.
import numpy as np
x = np.array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
y = np.array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.],
[ 5., 3.]])
c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w,
bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0. 5.]
# [ 1. 1.]
# [ 2. 7.]
# [ 3. 1.]
# [ 4. 0.]
# [ 5. 3.]]
r_nonzero = r[r[:,1]!=0]
Pandas have some functions doing exactly what you intend
import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)
Edit: If you actually need the data back in numpy format, just do
array_res = result.reset_index(name=1).values
This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:
import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])
Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.
I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)
You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])
There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])
pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])
I have n equal length arrays whose transpose corresponds to the coordinates in an n dimensional parameter space:
x = np.array([800,800,800,800,900,900,900,900,900,1000,1000,1000,1000,1000])
y = np.array([4.5,5.0,4.5,5.0,4.5,5.0,5.5,5.0,5.5,4.5,5.0,5.5,5.0,5.5])
z = np.array([2,2,4,4,2,2,4,4,4,2,2,4,4,4])
Each coordinate in parameter space also has a value:
v = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14])
I want to interpolate between the grid points to get the v value at given arbitrary xyz coordinate, e.g. [934,5.1,3.3].
I've been trying to use scipy.RegularGridInterpolator, which takes (x,y,z) as the first argument, but I can't figure out how to construct the second argument of the values at each point.
Any suggestions would be greatly appreciated! Thanks!
Your input would fit better with LinearNDInterpolator or NearestNDInterpolator:
from scipy.interpolate import LinearNDInterpolator
ex = LinearNDInterpolator((x, y, z), v)
ex((800, 4.5, 2))
#array(1.0)
ex([[800, 4.5, 2], [800, 4.5, 3]])
#array([ 1., 2.])
To use RegularGridInterpolator you need to define v as a regular array. For example, assume that:
x = np.array([800., 900., 1000.])
y = np.array([4.5, 5.0, 5.5, 6.0])
z = np.array([2., 4.])
The array v could be something like:
v = np.array([[[ 1., 2.],
[ 1., 2.],
[ 1., 2.],
[ 1., 2.]],
[[10., 20.],
[10., 20.],
[10., 20.],
[10., 20.]],
[[100., 200.],
[100., 200.],
[100., 200.],
[100., 200.]]])
And then you would be able to interpolate:
form scipy.interpolate import RegularGridInterpolator
rgi = RegularGridInterpolator((x, y, z), v)
rgi((850., 4.5, 3.))
#array(8.25)
rgi([[850., 4.5, 3.], [800, 4.5, 3]])
#array([ 8.25, 1.5 ])
In NumPy, how can you efficiently make a 1-D object into a 2-D object where the singleton dimension is inferred from the current object (i.e. a list should go to either a 1xlength or lengthx1 vector)?
# This comes from some other, unchangeable code that reads data files.
my_list = [1,2,3,4]
# What I want to do:
my_numpy_array[some_index,:] = numpy.asarray(my_list)
# The above doesn't work because of a broadcast error, so:
my_numpy_array[some_index,:] = numpy.reshape(numpy.asarray(my_list),(1,len(my_list)))
# How to do the above without the call to reshape?
# Is there a way to directly convert a list, or vector, that doesn't have a
# second dimension, into a 1 by length "array" (but really it's still a vector)?
In the most general case, the easiest way to add extra dimensions to an array is by using the keyword None when indexing at the position to add the extra dimension. For example
my_array = numpy.array([1,2,3,4])
my_array[None, :] # shape 1x4
my_array[:, None] # shape 4x1
Why not simply add square brackets?
>> my_list
[1, 2, 3, 4]
>>> numpy.asarray([my_list])
array([[1, 2, 3, 4]])
>>> numpy.asarray([my_list]).shape
(1, 4)
.. wait, on second thought, why is your slice assignment failing? It shouldn't:
>>> my_list = [1,2,3,4]
>>> d = numpy.ones((3,4))
>>> d
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
>>> d[0,:] = my_list
>>> d[1,:] = numpy.asarray(my_list)
>>> d[2,:] = numpy.asarray([my_list])
>>> d
array([[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.]])
even:
>>> d[1,:] = (3*numpy.asarray(my_list)).T
>>> d
array([[ 1., 2., 3., 4.],
[ 3., 6., 9., 12.],
[ 1., 2., 3., 4.]])
import numpy as np
a = np.random.random(10)
sel = np.at_least2d(a)[idx]
What about expand_dims?
np.expand_dims(np.array([1,2,3,4]), 0)
has shape (1,4) while
np.expand_dims(np.array([1,2,3,4]), 1)
has shape (4,1).
You can always use dstack() to replicate your array:
import numpy
my_list = array([1,2,3,4])
my_list_2D = numpy.dstack((my_list,my_list));