removing bad data pairs in numpy array

removing bad data pairs in numpy array - python

I am working with a large array of data, but every so often I wind up with a nan instead of a value. I need to remove these somehow. Here is an example of my dataset
1 2
3 4
nan 5
6 7
8 nan
9 10
and I would to remove the bad data to become:
1 2
3 4
6 7
9 10

If you're just using numpy, use logical indexing:
import numpy as np
x = np.array([[ 1., 2.],
[ 3., 4.],
[ np.nan, 5.],
[ 6., 7.],
[ 8., np.nan],
[ 9., 10.]])
# find which rows contain nans
ix = np.any(np.isnan(x), axis=1)
# remove them
x = x[~ix]
Which gives:
array([[ 1., 2.],
[ 3., 4.],
[ 6., 7.],
[ 9., 10.]])
This will work for arrays of any number of columns: if a row contains a NaN in at least one column, it is removed.
Alternatively, if you're using pandas, simply use dropna:
import pandas as pd
df = pd.DataFrame(x)
df.dropna()

You can do:
my_numpy_arr = my_numpy_arr[(my_numpy_arr==my_numpy_arr).all(1)]

Related

Numpy version of Pandas dropna with subset - trying to remove rows if the last column in my array contains NaN

I'm trying to remove rows from a numpy array if a certain column (ie. the last column in my array) contains NaN. NaN values in other columns are acceptable, just not the last column.
I know this is possible by converting to a pandas dataframe and using df.dropna(subset=['lastcolumn']). I am wondering if it is possible to do this in numpy since converting to Pandas and using dropna is quite slow.

Using np.isnan() works but need to specify which column can't have NaN:
a = np.array([[1,np.nan,3], [4,5,np.nan], [7,8,9]])
print(a)
[[1.0000 nan 3.0000]
[4.0000 5.0000 nan]
[7.0000 8.0000 9.0000]]
b = a[~np.isnan(a[:,2:3]).any(axis=1)]
print(b)
[[1.0000 nan 3.0000]
[7.0000 8.0000 9.0000]]

Something like this might work:
In [1856]: import numpy as np
In [1857]: a = np.array([[1,2,3], [4,5,np.nan], [7,8,9]])
In [1858]: a
Out[1858]:
array([[ 1., 2., 3.],
[ 4., 5., nan],
[ 7., 8., 9.]])
In [1859]: a[~np.isnan(a).any(axis=1)]
Out[1859]:
array([[1., 2., 3.],
[7., 8., 9.]])
EDIT
If NaN needs to be removed from specific column only, you need this:
In [1870]: a[~np.isnan(a[:, 1:2]).any(axis=1)]
Out[1870]:
array([[ 4., 5., nan],
[ 7., 8., 9.]])
This will remove NaN from first two columns only.

Simple way of stacking arrays with index offset

I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)

You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])

There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])

pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])

optimizing python file related to heapq.nlargest and extend using loop

my objective is to find a few (=3 in this example) largest values in one list, fourire, identify positions in the list, and obtain corresponding (position_wise) values in the other list, freq, so the print out should be like
2. 27.
9. 25.
4. 22.
the attached python is working fine....sort of.
** note that i am dealing with numpy array so index() is not working....
is there way to improve the followings?
import heapq
freq = [ 2., 8., 1., 6., 9., 3., 6., 9., 4., 8., 12.]
fourire = [ 27., 3., 2., 7., 4., 9., 10., 25., 22., 5., 3.]
out = heapq.nlargest(3, enumerate(fourire), key=lambda x:x[1])
elem_fourire = []
elem_freq = []
for i in range(len(out)):
(key, value) = out[i]
elem_freq.extend([freq[key]])
elem_fourire.extend([value])
for i in range(len(out)):
print elem_freq[i], elem_fourire[i]

import numpy as np
fourire = np.array(fourire)
freq = np.array(freq)
ix = fourire.argsort(kind='heapsort')[-3:][::-1]
for a, b in zip(freq[ix],fourire[ix]):
print a, b
prints
2.0 27.0
9.0 25.0
4.0 22.0
If you want to use heapq instead of numpy, a slight modification of your code above yields:
ix = heapq.nlargest(3,range(len(freq)),key=lambda x: fourire[x])
for x in ix:
print freq[x], fourire[x]
results in the same output

Finding items in one array based upon a second array

I have two arrays A and B:
A=array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B=array([[ 1., 1., 2.],
[ 3., 2., 1.]])
Anywhere there is a "1" in B I want to sum the same row and column locations in A.
So for example for this one the answer would be 5+5+9=10
I would want this to continue for 2,3....n (all unique values in B)
So for the 2's... it would be 9+5=14 and for the 3's it would be 8
I found the unique values by using:
numpy.unique(B)
I realize this make take multiple steps but I can't really wrap my head around using the index matrix to sum those locations in another matrix.

For each unique value x, you can do
A[B == x].sum()
Example:
>>> A[B == 1.0].sum()
19.0

I thinknumpy.bincount is what you want. If B is an array of small integers like in you example you can do something like this:
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1, 1, 2],
[ 3, 2, 1]])
print numpy.bincount(B.ravel(), weights=A.ravel())
# [ 0. 19. 14. 8.]
or if B has anything but small integers you can do something like this
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1., 1., 2.],
[ 3., 2., 1.]])
uniqB, inverse = numpy.unique(B, return_inverse=True)
print uniqB, numpy.bincount(inverse, weights=A.ravel())
# [ 1. 2. 3.] [ 19. 14. 8.]

[(val, np.sum(A[B==val])) for val in np.unique(B)] gives you a list of tuples where the first element is one of the unique values in B, and the second element is the sum of elements in A where the corresponding value in B is that value.
>>> [(val, np.sum(A[B==val])) for val in np.unique(B)]
[(1.0, 19.0), (2.0, 14.0), (3.0, 8.0)]
The key is that you can use A[B==val] to access items in A at positions where B equals val.
Edit: If you just want the sums, just do [np.sum(A[B==val]) for val in np.unique(B)].

I'd use numpy masked arrays. These are standard numpy arrays with a mask associated with them blocking off certain values. The process is pretty straight forward, create a masked array using
numpy.ma.masked_array(data, mask)
where mask is generated by using a masked function
mask = numpy.ma.masked_not_equal(B, 1).mask
and data is A
for i in numpy.unique(B):
print numpy.ma.masked_array(A, numpy.ma.masked_not_equal(B, i).mask).sum()
19.0
14.0
8.0

i found old question here
one of the answer
def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
in your case, you can flatten your array
aflat = A.flatten()
bflat = B.flatten()
sum_by_group(aflat, bflat)

assign index dependant value to each index in numpy array

I want to center multi-dimensional data in a n x m matrix (<class 'numpy.matrixlib.defmatrix.matrix'>), let's say X . I defined a new array ones(645), lets say centVector to produce the mean for every row in matrix X. And now I want to iterate every row in X, compute the mean and assign this value to the corresponding index in centVector. Isn't this possible in a single row in scipy/numpy? I am not used to this language and think about something like:
centVector = ones(645)
for key, val in X:
centVector[key] = centVector[key] * (val.sum/val.size)
Afterwards I just need to subtract the mean in every Row:
X = X - centVector
How can I simplify this?
EDIT: And besides, the above code is not actually working - for a key-value loop I need something like enumerate(X). And I am not sure if X - centVector is returning the proper solution.

First, some example data:
>>> import numpy as np
>>> X = np.matrix(np.arange(25).reshape((5,5)))
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
numpy conveniently has a mean function. By default however, it'll give you the mean over all the values in the array. Since you want the mean of each row, you need to specify the axis of the operation:
>>> np.mean(X, axis=1)
matrix([[ 2.],
[ 7.],
[ 12.],
[ 17.],
[ 22.]])
Note that axis=1 says: find the mean along the columns (for each row), where 0 = rows and 1 = columns (and so on). Now, you can subtract this mean from your X, as you did originally.
Unsolicited advice
Usually, it's best to avoid the matrix class (see docs). If you remove the np.matrix call from the example data, then you get a normal numpy array.
Unfortunately, in this particular case, using an array slightly complicates things because np.mean will return a 1D array:
>>> X = np.arange(25).reshape((5,5))
>>> r_means = np.mean(X, axis=1)
>>> print r_means
[ 2. 7. 12. 17. 22.]
If you try to subtract this from X, r_means gets broadcast to a row vector, instead of a column vector:
>>> X - r_means
array([[ -2., -6., -10., -14., -18.],
[ 3., -1., -5., -9., -13.],
[ 8., 4., 0., -4., -8.],
[ 13., 9., 5., 1., -3.],
[ 18., 14., 10., 6., 2.]])
So, you'll have to reshape the 1D array into an N x 1 column vector:
>>> X - r_means.reshape((-1, 1))
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])
The -1 passed to reshape tells numpy to figure out this dimension based on the original array shape and the rest of the dimensions of the new array. Alternatively, you could have reshaped the array using r_means[:, np.newaxis].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

removing bad data pairs in numpy array - python

I am working with a large array of data, but every so often I wind up with a nan instead of a value. I need to remove these somehow. Here is an example of my dataset 1 2 3 4 nan 5 6 7 8 nan 9 10 and I would to remove the bad data to become: 1 2 3 4 6 7 9 10

You can do: my_numpy_arr = my_numpy_arr[(my_numpy_arr==my_numpy_arr).all(1)]

Related

Numpy version of Pandas dropna with subset - trying to remove rows if the last column in my array contains NaN

Simple way of stacking arrays with index offset

optimizing python file related to heapq.nlargest and extend using loop

Finding items in one array based upon a second array

assign index dependant value to each index in numpy array

Categories

Resources