Replace values in numpy 2D array based on pandas dataframe - python

>>> arr
array([[ 0., 10., 0., ..., 0., 0., 0.],
[ 0., 4., 0., ..., 6., 0., 9.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 2., 0., 0.],
[ 0., 0., 0., ..., 0., 3., 0.]])
In the numpy array above, I would like to replace every value that matches the column country_codes in the dataframe (df_A) with the value from the column continent_codes in df_A. df_A looks like:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
Right now, I loop through dataframe and replace using numpy indexing notation. Given that iterrows() tends to be slow, is there a more direct/vectorized way to do this?
for index, row in self.df_A.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']

Approach #1 : One vectorized approach using np.searchsorted and np.in1d would be as listed below -
# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
# Mask of elements to be changed
mask = np.in1d(arr,oldval)
# Indices for each match from oldval in arr
idx = np.searchsorted(oldval,arr.ravel()[mask])
# Using the mask put selective elements from continent_codes column into arr
arr.ravel()[mask] = newval[idx]
Sample run -
>>> arr # Original 2D array
array([[23, 4, 23, 5, 8],
[ 3, 6, 8, 5, 11],
[16, 24, 15, 4, 10],
[ 4, 16, 10, 8, 1]])
>>> df
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
>>> oldval = np.array(df['country_codes'])
>>> newval = np.array(df['continent_codes'])
>>> mask = np.in1d(arr,oldval)
>>> idx = np.searchsorted(oldval,arr.ravel()[mask])
>>> arr.ravel()[mask] = newval[idx]
>>> mask.reshape(arr.shape) # Mask array depiciting which elements were updated
array([[False, True, False, False, True],
[False, False, True, False, False],
[ True, True, False, True, False],
[ True, True, False, True, False]], dtype=bool)
>>> arr # Updated 2D array
array([[23, 4, 23, 5, 3],
[ 3, 6, 3, 5, 11],
[ 6, 5, 15, 4, 10],
[ 4, 6, 10, 3, 1]])
Approach #2 : As a variant, you can also create the mask with a comparison between np.searchsorted(oldval,arr,'left') and np.searchsorted(oldval,arr,'right') as discussed in the solutions for this question and re-use np.searchsorted(oldval,arr,'left') again later on while putting values into arr for a more efficient solution, like so -
# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
# Left and right indices for each match from oldval in arr
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
# Mask of elements to be changed
mask = left_idx!=right_idx
# Using the mask put selective elements from continent_codes column into arr
arr[mask] = newval[left_idx[mask]]
Runtime tests and verify outputs
Function definitions -
def original_app(arr,df):
for index, row in df.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']
def vectorized_app1(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
mask = np.in1d(arr,oldval)
idx = np.searchsorted(oldval,arr.ravel()[mask])
arr.ravel()[mask] = newval[idx]
def vectorized_app2(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
mask = left_idx!=right_idx
arr[mask] = newval[left_idx[mask]]
Verify outputs -
In [195]: # Input array
...: arr = np.random.randint(0,100000,(1000,1000))
...:
...: # Setup input dataframe
...: N = 1000
...: oldvals = np.unique(np.random.randint(0,100000,N))
...: newvals = np.random.randint(0,9,(oldvals.size))
...: df=pd.DataFrame({'country_codes':oldvals,'continent_codes':newvals})
...: df = df.reindex_axis(sorted(df.columns)[::-1], axis=1)
...:
...: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:
In [196]: # Verify outputs
...: original_app(arrc1,df)
...: vectorized_app1(arrc2,df)
...: vectorized_app2(arrc3,df)
...:
In [197]: np.allclose(arrc1,arrc2)
Out[197]: True
In [198]: np.allclose(arrc1,arrc3)
Out[198]: True
Timings -
In [199]: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:
In [200]: %timeit original_app(arrc1,df)
1 loops, best of 3: 2.79 s per loop
In [201]: %timeit vectorized_app1(arrc2,df)
1 loops, best of 3: 360 ms per loop
In [202]: %timeit vectorized_app2(arrc3,df)
1 loops, best of 3: 213 ms per loop

with this data as exemple, with at most N countries,
N=10**5
values=np.random.randint(0,N,(1000,1000))
exemple={'country':np.arange(N//2),'continent':randint(1,5,N//2)}
df=pd.DataFrame.from_dict(exemple)
You can just do :
v=np.arange(N)
v[df['country']]=df['continent']
v.take(values,out=values)
probably not optimal, but efficient (20ms).

Related

numpy where condition on index

I have a numpy 1D array of numbers representing columns e.g: [0,0,2,1]
And a matrix e.g:
[[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1]]
Now I a want to change the values in the matrix to 0 where the column index is bigger than the value given in the 1D array:
[[1,0,0],
[1,0,0],
[1,1,1],
[1,1,0]]
How can I achieve this? I think I need a condition based on the index, not on the value
Explanation:
The first row in the matrix has indices [0,0 ; 0,1 ; 0,2] where the second index is the column. For indices 0,0 ; 0,1 and 0,2 the value 0 is given. 1 and 2 are bigger than 0. Thus only 0,0 is not changed to zero.
Assuming a the 2D array and v the 1D vector, you can create a mask of the same size and use numpy.where:
x,y = a.shape
np.where(np.tile(np.arange(y), (x,1)) <= v[:,None], a, 0)
Input:
a = np.array([[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1]])
v = np.array([0,0,2,1])
Output:
array([[1, 0, 0],
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]])
Intermediates:
>>> np.tile(np.arange(y), (x,1))
[[0 1 2]
[0 1 2]
[0 1 2]
[0 1 2]]
>>> np.tile(np.arange(y), (x,1)) <= v[:,None]
[[ True False False]
[ True False False]
[ True True True]
[ True True False]]
Construct a 2D array whose elements are the corresponding column index, and then mask the elements greater than the corresponding value of the 1D array.
Taking advantage of broadcasting, you can do:
>>> arr = np.ones((4,3))
>>> arr
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
>>> col_thr_idx = np.array([0,0,2,1])
>>> col_thr_idx
array([0, 0, 2, 1])
>>> mask = np.arange(arr.shape[1])[None,:] > col_thr_idx[:,None]
>>> mask
array([[False, True, True],
[False, True, True],
[False, False, False],
[False, False, True]])
>>> arr[mask] = 0
>>> arr
array([[1., 0., 0.],
[1., 0., 0.],
[1., 1., 1.],
[1., 1., 0.]])

Numpy replace specific rows and columns of one array with specific rows and columns of another array

I am trying to replace specific rows and columns of a Numpy array as given below.
The values of array a and b are as below initially:
a = [[1 1 1 1]
[1 1 1 1]
[1 1 1 1]]
b = [[2 3 4 5]
[6 7 8 9]
[0 2 3 4]]
Now, based on a certain probability, I need to perform elementwise replacing of a with the values of b (say, after generating a random number, r, between 0 and 1 for each element, I will replace the element of a with that of b if r > 0.8).
How can I use numpy/scipy to do this in Python with high performance?
With masking. We first generate a matrix with the same dimensions, of random numbers, and check if these are larger than 0.8:
mask = np.random.random(a.shape) > 0.8
Now we can assign the values of b where mask is True to the corresponding indices of a:
a[mask] = b[mask]
For example:
>>> a
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
>>> b
array([[2, 3, 4, 5],
[6, 7, 8, 9],
[0, 2, 3, 4]])
>>> mask = np.random.random(a.shape) > 0.8
>>> mask
array([[ True, False, False, False],
[ True, False, False, False],
[False, False, False, False]])
>>> a[mask] = b[mask]
>>> a
array([[2., 1., 1., 1.],
[6., 1., 1., 1.],
[1., 1., 1., 1.]])
So here where the mask is True (since 0.8 is rather high, we expect on average 2.4 such values), we assign the corresponding value of b.

mask only where consecutive nans exceeds x

I was answering a question about pandas interpolation method. The OP wanted to use only interpolate where the number of consecutive np.nans was one. The limit=1 option for interpolate will interpolate the first np.nan and stop there. OP wanted to be able to tell that there were in fact more than one np.nan and not even bother with the first one.
I boiled this down to just executing the interpolate as is and mask the consecutive np.nan after the fact.
The question is: What is a generalized solution that takes a 1-d array a and an integer x and produces a boolean mask with False in the positions of x or more consecutive np.nan
Consider the 1-d array a
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
I'd expect that for x = 2 the mask would look like this
# assume 1 for True and 0 for False
# a is [ 1. nan nan nan 1. nan 1. 1. nan nan 1. 1.]
# mask [ 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1.]
# ^
# |
# Notice that this is not masked because there is only one np.nan
I'd expect that for x = 3 the mask would look like this
# assume 1 for True and 0 for False
# a is [ 1. nan nan nan 1. nan 1. 1. nan nan 1. 1.]
# mask [ 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
# ^ ^ ^
# | | |
# Notice that this is not masked because there is less than 3 np.nan's
I look forward to learning from others ideas ;-)
I really like numba for such easy to grasp but hard to "numpyfy" problems! Even though that package might be a bit too heavy for most libraries it allows to write such "python"-like functions without loosing too much speed:
import numpy as np
import numba as nb
import math
#nb.njit
def mask_nan_if_consecutive(arr, limit): # I'm not good at function names :(
result = np.ones_like(arr)
cnt = 0
for idx in range(len(arr)):
if math.isnan(arr[idx]):
cnt += 1
# If we just reached the limit we need to backtrack,
# otherwise just mask current.
if cnt == limit:
for subidx in range(idx-limit+1, idx+1):
result[subidx] = 0
elif cnt > limit:
result[idx] = 0
else:
cnt = 0
return result
At least if you worked with pure-python this should be quite easy to understand and it should work:
>>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
>>> mask_nan_if_consecutive(a, 1)
array([ 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1.])
>>> mask_nan_if_consecutive(a, 2)
array([ 1., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1.])
>>> mask_nan_if_consecutive(a, 3)
array([ 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.])
>>> mask_nan_if_consecutive(a, 4)
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
But the really nice thing about #nb.njit-decorator is, that this function will be fast:
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
i = 2
res1 = mask_nan_if_consecutive(a, i)
res2 = mask_knans(a, i)
np.testing.assert_array_equal(res1, res2)
%timeit mask_nan_if_consecutive(a, i) # 100000 loops, best of 3: 6.03 µs per loop
%timeit mask_knans(a, i) # 1000 loops, best of 3: 302 µs per loop
So for short arrays this is approximatly 50 times faster, even though the difference gets lower it's still faster for longer arrays:
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
i = 2
%timeit mask_nan_if_consecutive(a, i) # 10 loops, best of 3: 20.9 ms per loop
%timeit mask_knans(a, i) # 10 loops, best of 3: 154 ms per loop
I created this generalized solution
import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided
def mask_knans(a, x):
a = np.asarray(a)
k = a.shape[0]
# I will stride n. I want to pad with 1 less False than
# the required number of np.nan's
n = np.append(np.isnan(a), [False] * (x - 1))
# prepare the mask and fill it with True
m = np.empty(k, np.bool8)
m.fill(True)
# stride n into a number of columns equal to
# the required number of np.nan's to mask
# this is essentially a rolling all operation on isnull
# also reshape with `[:, None]` in preparation for broadcasting
# np.where finds the indices where we successfully start
# x consecutive np.nan's
s = n.strides[0]
i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
# since I prepped with `[:, None]` when I add `np.arange(x)`
# I'm including the subsequent indices where the remaining
# x - 1 np.nan's are
i = i + np.arange(x)
# I use `pd.unique` because it doesn't sort and I don't need to sort
i = pd.unique(i[i < k])
m[i] = False
return m
w/o comments
import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided
def mask_knans(a, x):
a = np.asarray(a)
k = a.shape[0]
n = np.append(np.isnan(a), [False] * (x - 1))
m = np.empty(k, np.bool8)
m.fill(True)
s = n.strides[0]
i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
i = i + np.arange(x)
i = pd.unique(i[i < k])
m[i] = False
return m
demo
mask_knans(a, 2)
[ True False False False True True True True False False True True]
mask_knans(a, 3)
[ True False False False True True True True True True True True]

Numpy indexing 3-dimensional array into 2-dimensional array

I have a three-dimensional array of the following structure:
x = np.array([[[1,2],
[3,4]],
[[5,6],
[7,8]]], dtype=np.double)
Additionally, I have an index array
idx = np.array([[0,1],[1,3]], dtype=np.int)
Each row of idx defines the row/column indices for the placement of each sub-array along the 0 axis in x into a two-dimensional array K that is initialized as
K = np.zeros((4,4), dtype=np.double)
I would like to use fancy indexing/broadcasting to performing the indexing without a for loop. I currently do it this way:
for i, id in enumerate(idx):
idx_grid = np.ix_(id,id)
K[idx_grid] += x[i]
Such that the result is:
>>> K = array([[ 1., 2., 0., 0.],
[ 3., 9., 0., 6.],
[ 0., 0., 0., 0.],
[ 0., 7., 0., 8.]])
Is this possible to do with fancy indexing?
Here's one alternative way. With x, idx and K defined as in your question:
indices = (idx[:,None] + K.shape[1]*idx).ravel('f')
np.add.at(K.ravel(), indices, x.ravel())
Then we have:
>>> K
array([[ 1., 2., 0., 0.],
[ 3., 9., 0., 6.],
[ 0., 0., 0., 0.],
[ 0., 7., 0., 8.]])
To perform unbuffered inplace addition on NumPy arrays you need to use np.add.at (to avoid using += in a for loop).
However, it's slightly probelmatic to pass a list of 2D index arrays, and corresponding arrays to add at these indices, to np.add.at. This is because the function interprets these lists of arrays as higher-dimensional arrays and IndexErrors are raised.
It's much simpler to pass in 1D arrays. You can temporarily ravel K and x to give you a 1D array of zeros and a 1D array of values to add to those zeros. The only fiddly part is constructing a corresponding 1D array of indices from idx at which to add the values. This can be done via broadcasting with arithmetical operators and then ravelling, as shown above.
The intended operation is one of an accumulation of values from x into places indexed by idx. You could think of those idx places as bins of a histogram data and the x values as the weights that you need to accumulate for those bins. Now, to perform such a binning operation, np.bincount could be used. Here's one such implementation with it -
# Get size info of expected output
N = idx.max()+1
# Extend idx to cover two axes, equivalent to `np.ix_`
idx1 = idx[:,None,:] + N*idx[:,:,None]
# "Accumulate" values from x into places indexed by idx1
K = np.bincount(idx1.ravel(),x.ravel()).reshape(N,N)
Runtime tests -
1) Create inputs:
In [361]: # Create x and idx, with idx having unique elements in each row of idx,
...: # as otherwise the intended operation is not clear
...:
...: nrows = 100
...: max_idx = 100
...: ncols_idx = 2
...:
...: x = np.random.rand(nrows,ncols_idx,ncols_idx)
...: idx = np.random.randint(0,max_idx,(nrows,ncols_idx))
...:
...: valid_mask = ~np.any(np.diff(np.sort(idx,axis=1),axis=1)==0,axis=1)
...:
...: x = x[valid_mask]
...: idx = idx[valid_mask]
...:
2) Define functions:
In [362]: # Define the original and proposed (bincount based) approaches
...:
...: def org_approach(x,idx):
...: N = idx.max()+1
...: K = np.zeros((N,N), dtype=np.double)
...: for i, id in enumerate(idx):
...: idx_grid = np.ix_(id,id)
...: K[idx_grid] += x[i]
...: return K
...:
...:
...: def bincount_approach(x,idx):
...: N = idx.max()+1
...: idx1 = idx[:,None,:] + N*idx[:,:,None]
...: return np.bincount(idx1.ravel(),x.ravel()).reshape(N,N)
...:
3) Finally time them:
In [363]: %timeit org_approach(x,idx)
100 loops, best of 3: 2.13 ms per loop
In [364]: %timeit bincount_approach(x,idx)
10000 loops, best of 3: 32 µs per loop
I do not think it is efficiently possible, since you have += in the loop. This means, you would have to "blow up" your array idx by one dimension and reduce it again by utilizing np.sum(x[...], axis=...).
A minor optimization would be:
import numpy as np
xx = np.array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]], dtype=np.double)
idx = np.array([[0, 1], [1, 3]], dtype=np.int)
K0, K1 = np.zeros((4, 4), dtype=np.double), np.zeros((4, 4), dtype=np.double)
for k, i in enumerate(idx):
idx_grid = np.ix_(i, i)
K0[idx_grid] += xx[k]
for x, i in zip(xx, idx):
K1[np.ix_(i, i)] += x
print("K1 == K0:", np.allclose(K1, K0)) # prints: K1 == K0: True
PS: Do not use id as a variable name, since it is a Python keyword.

Map arrays with duplicate indexes?

Assume three arrays in numpy:
a = np.zeros(5)
b = np.array([3,3,3,0,0])
c = np.array([1,5,10,50,100])
b can now be used as an index for a and c. For example:
In [142]: c[b]
Out[142]: array([50, 50, 50, 1, 1])
Is there any way to add up the values connected to the duplicate indexes with this kind of slicing? With
a[b] = c
Only the last values are stored:
array([ 100., 0., 0., 10., 0.])
I would like something like this:
a[b] += c
which would give
array([ 150., 0., 0., 16., 0.])
I'm mapping very large vectors onto 2D matrices and would really like to avoid loops...
The += operator for NumPy arrays simply doesn't work the way you are hoping, and I'm not aware of a away of making it work that way. As a work-around I suggest using numpy.bincount():
>>> numpy.bincount(b, c)
array([ 150., 0., 0., 16.])
Just append zeros as needed.
You could do something like:
def sum_unique(label, weight):
order = np.lexsort(label.T)
label = label[order]
weight = weight[order]
unique = np.ones(len(label), 'bool')
unique[:-1] = (label[1:] != label[:-1]).any(-1)
totals = weight.cumsum()
totals = totals[unique]
totals[1:] = totals[1:] - totals[:-1]
return label[unique], totals
And use it like this:
In [110]: coord = np.random.randint(0, 3, (10, 2))
In [111]: coord
Out[111]:
array([[0, 2],
[0, 2],
[2, 1],
[1, 2],
[1, 0],
[0, 2],
[0, 0],
[2, 1],
[1, 2],
[1, 2]])
In [112]: weights = np.ones(10)
In [113]: uniq_coord, sums = sum_unique(coord, weights)
In [114]: uniq_coord
Out[114]:
array([[0, 0],
[1, 0],
[2, 1],
[0, 2],
[1, 2]])
In [115]: sums
Out[115]: array([ 1., 1., 2., 3., 3.])
In [116]: a = np.zeros((3,3))
In [117]: x, y = uniq_coord.T
In [118]: a[x, y] = sums
In [119]: a
Out[119]:
array([[ 1., 0., 3.],
[ 1., 0., 3.],
[ 0., 2., 0.]])
I just thought of this, it might be easier:
In [120]: flat_coord = np.ravel_multi_index(coord.T, (3,3))
In [121]: sums = np.bincount(flat_coord, weights)
In [122]: a = np.zeros((3,3))
In [123]: a.flat[:len(sums)] = sums
In [124]: a
Out[124]:
array([[ 1., 0., 3.],
[ 1., 0., 3.],
[ 0., 2., 0.]])

Categories