Retrieve indexes of multiple values with Numpy in a vectorization way - python

In order to get the index corresponding to the "99" value in a numpy array, we do :
mynumpy=([5,6,9,2,99,3,88,4,7))
np.where(my_numpy==99)
What if, I want to get the index corresponding to the following values 99,55,6,3,7? Obviously, it's possible to do it with a simple loop but I'm looking for a more vectorization solution. I know Numpy is very powerful so I think it might exist something like that.
desired output :
searched_values=np.array([99,55,6,3,7])
np.where(searched_values in mynumpy)
[(4),(),(1),(5),(8)]

Here's one approach with np.searchsorted -
def find_indexes(ar, searched_values, invalid_val=-1):
sidx = ar.argsort()
pidx = np.searchsorted(ar, searched_values, sorter=sidx)
pidx[pidx==len(ar)] = 0
idx = sidx[pidx]
idx[ar[idx] != searched_values] = invalid_val
return idx
Sample run -
In [29]: find_indexes(mynumpy, searched_values, invalid_val=-1)
Out[29]: array([ 4, -1, 1, 5, 8])
For a generic invalid value specifier, we could use np.where -
def find_indexes_v2(ar, searched_values, invalid_val=-1):
sidx = ar.argsort()
pidx = np.searchsorted(ar, searched_values, sorter=sidx)
pidx[pidx==len(ar)] = 0
idx = sidx[pidx]
return np.where(ar[idx] == searched_values, idx, invalid_val)
Sample run -
In [35]: find_indexes_v2(mynumpy, searched_values, invalid_val=None)
Out[35]: array([4, None, 1, 5, 8], dtype=object)
# For list output
In [36]: find_indexes_v2(mynumpy, searched_values, invalid_val=None).tolist()
Out[36]: [4, None, 1, 5, 8]

Related

Numpy: How to subtract every other element in array

I have the following numpy array
u = np.array([a1,b1,a2,b2...,an,bn])
where I would like to subtract the a and b elements from each other and end up with a numpy array:
u_result = np.array([(a2-a1),(b2-b1),(a3-a2),(b3-b2),....,(an-a_(n-1)),(an-a_(n-1))])
How can I do this without too much array splitting and for loops? I'm using this in a larger loop so ideally, I would like to do this efficiently (and learn something new)
(I hope the indexing of the resulting array is clear)
Or simply, perform a substraction :
u = np.array([3, 2, 5, 3, 7, 8, 12, 28])
u[2:] - u[:-2]
Output:
array([ 2, 1, 2, 5, 5, 20])
you can use ravel torearrange as your original vector.
Short answer:
u_r = np.ravel([np.diff(u[::2]),
np.diff(u[1::2])], 'F')
Here a long and moore detailed explanation:
separate a from b in u this can be achieved indexing
differentiate a and b you can use np.diff for easiness of code.
ravel again the differentiated values.
#------- Create u---------------
import numpy as np
a_aux = np.array([50,49,47,43,39,34,28])
b_aux = np.array([1,2,3,4,5,6,7])
u = np.ravel([a_aux,b_aux],'F')
print(u)
#-------------------------------
#1)
# get a as elements with index 0, 2, 4 ....
a = u[::2]
b = u[1::2] #get b as 1,3,5,....
#2)
#differentiate
ad = np.diff(a)
bd = np.diff(b)
#3)
#ravel putting one of everyone
u_result = np.ravel([ad,bd],'F')
print(u_result)
You can try in this way. Firstly, split all a and b elements using array[::2], array[1::2]. Finally, subtract from b to a (np.array(array[1::2] - array[::2])).
import numpy as np
array = np.array([7,8,9,6,5,2])
u_result = np.array(array[1::2] - array[::2] )
print(u_result)
Looks like you need to use np.roll:
shift = 2
u = np.array([1, 11, 2, 12, 3, 13, 4, 14])
shifted_u = np.roll(u, -shift)
(shifted_u - u)[:-shift]
Returns:
array([1, 1, 1, 1, 1, 1])

Update values in numpy array with other values in Python

Given the following array:
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
[[1 2 3]
[4 5 6]
[7 8 9]]
How can I replace certain values with other values?
bad_vals = [4, 2, 6]
update_vals = [11, 1, 8]
I currently use:
for idx, v in enumerate(bad_vals):
a[a==v] = update_vals[idx]
Which gives:
[[ 1 1 3]
[11 5 8]
[ 7 8 9]]
But it is rather slow for large arrays with many values to be replaced. Is there any good alternative?
The input array can be changed to anything (list of list/tuples) if this might be necessary to access certain speedy black magic.
EDIT:
Based on the great answers from #Divakar and #charlysotelo did a quick comparison for my real use-case date using the benchit package. My input data array has more or less a of ratio 100:1 (rows:columns) where the length of array of replacement values are in order of 3 x rows size.
Functions:
# current approach
def enumerate_values(a, bad_vals, update_vals):
for idx, v in enumerate(bad_vals):
a[a==v] = update_vals[idx]
return a
# provided solution #Divakar
def map_values(a, bad_vals, update_vals):
N = max(a.max(), max(bad_vals))+1
mapar = np.empty(N, dtype=int)
mapar[a] = a
mapar[bad_vals] = update_vals
out = mapar[a]
return out
# provided solution #charlysotelo
def vectorize_values(a, bad_vals, update_vals):
bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
bad_to_good_map[bad_val] = update_vals[idx]
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
a = f(a)
return a
# define benchit input functions
import benchit
funcs = [enumerate_values, map_values, vectorize_values]
# define benchit input variables to bench against
in_ = {
n: (
np.random.randint(0,n*10,(n,int(n * 0.01))), # array
np.random.choice(n*10, n*3,replace=False), # bad_vals
np.random.choice(n*10, n*3) # update_vals
)
for n in [300, 1000, 3000, 10000, 30000]
}
# do the bench
# btw: timing of bad approaches (my own function here) take time
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, grid=False)
Here's one way based on the hinted mapping array method for positive numbers -
def map_values(a, bad_vals, update_vals):
N = max(a.max(), max(bad_vals))+1
mapar = np.empty(N, dtype=int)
mapar[a] = a
mapar[bad_vals] = update_vals
out = mapar[a]
return out
Sample run -
In [94]: a
Out[94]:
array([[1, 2, 1],
[4, 5, 6],
[7, 1, 1]])
In [95]: bad_vals
Out[95]: [4, 2, 6]
In [96]: update_vals
Out[96]: [11, 1, 8]
In [97]: map_values(a, bad_vals, update_vals)
Out[97]:
array([[ 1, 1, 1],
[11, 5, 8],
[ 7, 1, 1]])
Benchmarking
# Original soln
def replacevals(a, bad_vals, update_vals):
out = a.copy()
for idx, v in enumerate(bad_vals):
out[out==v] = update_vals[idx]
return out
The given sample had the 2D input of nxn with n samples to be replaced. Let's setup input datasets with the same structure.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [replacevals, map_values]
in_ = {n:(np.random.randint(0,n*10,(n,n)),np.random.choice(n*10,n,replace=False),np.random.choice(n*10,n)) for n in [3,10,100,1000,2000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, save='timings.png')
Plot :
This really depends on the size of your array, and the size of your mappings from bad to good integers.
For a larger number of bad to good integers - the method below is better:
import numpy as np
import time
ARRAY_ROWS = 10000
ARRAY_COLS = 1000
NUM_MAPPINGS = 10000
bad_vals = np.random.rand(NUM_MAPPINGS)
update_vals = np.random.rand(NUM_MAPPINGS)
bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
bad_to_good_map[bad_val] = update_vals[idx]
# np.vectorize with mapping
# Takes about 4 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
print (time.time())
a = f(a)
print (time.time())
# Your way
# Takes about 60 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
print (time.time())
for idx, v in enumerate(bad_vals):
a[a==v] = update_vals[idx]
print (time.time())
Running the code above it took less than 4 seconds for the np.vectorize(lambda) way to finish - whereas your way took almost 60 seconds. However, setting the NUM_MAPPINGS to 100, your method takes less than a second for me - faster than the 2 seconds for the np.vectorize way.

How to create a list from a dictionary that contains numpy array as value

I have a dictionary like this
{0: array([-6139.66579119, -8102.82498701, -8424.43378713, -8699.96492463,
-9411.35741859]),
1: array([ -7679.11144698, -16699.49166421, -3057.05148494, -10657.0539235 ,
-3091.04936367]),
2: array([ -7316.47405724, -15367.98445067, -6660.88963907, -9634.54357714,
-6667.05832509]),
3: array([-7609.14675848, -9894.14708559, -4040.51364199, -8661.16152946,
-4363.71589143]),
4: array([-5068.85919923, -6691.36104136, -6659.66791024, -6666.66570889,
-5365.35153533]),
5: array([ -8341.96211464, -13495.42783124, -4782.52084352, -10355.98002 ,
-5424.48813488]),
6: array([ -7740.36341878, -16165.48430318, -5169.42471878, -12369.79859385,
-5807.66380805]),
7: array([-10645.12432969, -5465.30533986, -6756.65159092, -4146.34937333,
-6765.69595854]),
8: array([ -7765.04423986, -11679.3889257 , -4218.9629257 , -6565.64225892,
-4538.09199979]),
9: array([-5869.18259848, -7809.21110907, -3272.33611955, -3881.64743889,
-3275.54657818])}
What I want to do is:
compare the first value in each array, in this case, -6139, -7649......and find the max value (-5068), then return the key 4 in a list.
compare the second value in each array, -8102, -16699......find the max and return the key , append to the list.
How can I do that?
My code is like this:
def predict(trainingData, testData):
pred = {}
maxLabel = None
prediction=[]
maxValue = -9999999999
pred = postProb(trainingData, testData)
for key, value in pred.items():
for i in range(value.shape[0]):
for j in range(10):
if pred[j][i] > maxValue:
maxValue = pred[key][i]
maxLabel = key
prediction.append(maxLabel)
return prediction
pred is the dictionary. It seems that the first loop is not necessary but I need it to get through the elements in the dictionary
You can use numpy array's argmax method to get what you want.
np.array(list(abc.values())).argmax(axis=0)
Out: array([4, 7, 1, 9, 1])
This works only if your keys are consecutive integers like in your example. IF you want a more fool proof method, You could use pandas.
import pandas as pd
df = pd.DataFrame(my_dict)
my_list = list(df.idxmax(axis=1))
print(my_list)
Out: [4, 7, 1, 9, 1]

numpy.searchsorted for multiple instances of the same entry - python

I have the following variables:
import numpy as np
gens = np.array([2, 1, 2, 1, 0, 1, 2, 1, 2])
p = [0,1]
I want to return the entries of gens that match each element of p.
So ideally I would like it to return:
result = [[4],[2,3,5,7],[0,2,6,8]]
#[[where matched 0], [where matched 1], [the rest]]
--
My attempts so far only work with one variable:
indx = gens.argsort()
res = np.searchsorted(gens[indx], [0])
gens[res] #gives 4, which is the position of 0
But I try with with
indx = gens.argsort()
res = np.searchsorted(gens[indx], [1])
gens[res] #gives 1, which is the position of the first 1.
So:
how can I search for an entry that has multiple occurrences
how can I search for multiple entries each of which have multiple occurrences?
You can use np.where
>>> np.where(gens == p[0])[0]
array([4])
>>> np.where(gens == p[1])[0]
array([1, 3, 5, 7])
>>> np.where((gens != p[0]) & (gens != p[1]))[0]
array([0, 2, 6, 8])
Or np.in1d and np.nonzero
>>> np.nonzero(np.in1d(gens, p[0]))[0]
>>> np.nonzero(np.in1d(gens, p[1]))[0]
>>> np.nonzero(~np.in1d(gens, p))[0]

Get index as int in pandas dataframe

I have a pandas dataframe which is indexed by strings. Let's say my index looks like df.index = ['AA','AB','AC',...] and I want to access df.loc['AC':'AE'], which works well.
Is there any way to get the position of these indices, giving me ['AC':'AE'] => [2,3,4]? I know there is df.index.get_loc('AC') => 2 but this works only for single values and not for lists.
Use:
df = pd.DataFrame({'a': [5,6,7,8, 10]}, index=['AA','AB','AC','AD','AE'])
pos = list(range(df.index.get_loc('AC'), df.index.get_loc('AE') + 1))
print (pos)
[2, 3, 4]
Another solutions with Index.searchsorted:
pos = list(range(df.index.searchsorted('AC'), df.index.searchsorted('AE') + 1))
print (pos)
[2, 3, 4]
a = df.index.searchsorted(['AC', 'AE'])
pos = list(range(a[0], a[1] + 1))
print (pos)
[2, 3, 4]
You can define a function to extract the integer range:
df = pd.DataFrame(np.arange(7), index=['AA','AB','AC','AD','AE','AF','AG'])
def return_index(df, a, b):
col_map = df.index.get_loc
return np.arange(col_map(a), col_map(b)+1)
res = return_index(df, 'AC', 'AE')
print(res)
array([2, 3, 4])

Categories