Vectorization of lookup - python

I've an array of values and want to map each value with one from another array. The mapped value is the largest found which is lower or equal (I make the assumption it always exists).
For example from the values [6, 15, 4, 12, 10, 5] and the lookup table [4, 6, 7, 8, 10, 12] I would print:
6 is between 6 and 7
15 is between 12 and None
4 is between 4 and 6
12 is between 12 and None
10 is between 10 and 12
5 is between 4 and 6
I do this like this:
import numpy as np
def last_smallest(values, limits):
count = values.shape[0]
value = np.zeros(count, dtype='int')
for i in range(count):
found = np.where(limits <= values[i])
value[i] = found[-1][-1]
return value
lookup_table = np.array([4, 6, 7, 8, 10, 12])
samples = np.array([6, 15, 4, 12, 10, 5])
result = last_smallest(samples, lookup_table)
for i, value in enumerate(samples):
index = result[i]
high = lookup_table[index+1] if index < lookup_table.shape[0] - 1 else None
print(f'{value} is between {lookup_table[index]} and {high}')
This works, however last_smallest function is really not elegant. I've tried to vectorize it, but I can't.
Is it possible to replace result = last_smallest(samples, lookup_table) by pure numpy array operations?

np.digitize can be used here:
lookup_table = np.array([4, 6, 7, 8, 10, 12])
samples = np.array([6, 15, 4, 12, 10, 5])
res = np.digitize(samples, lookup_table)
lookup_table = np.append(lookup_table, None) # you might want to change this line
for sample, idx in zip(samples, res):
print(f'{sample} is between {lookup_table[idx-1]} and {lookup_table[idx]}')
Output:
6 is between 6 and 7
15 is between 12 and None
4 is between 4 and 6
12 is between 12 and None
10 is between 10 and 12
5 is between 4 and 6

Related

How to set some values with an interval in a vector to be another vector

I have a vector with size, for example, (1,16) which is x = [1,2,3,4,.....16] and another vector y = [1,2,3,4] whose size is( 1,4)
I want to set the values in the vector x with interval 4 to be the vector y. it means it will be like that x(1:4:16) = y ; In python, how can I do that?
The expected output is to be x = [1 2 3 4 2 6 7 8 3 10 11 12 4 14 15 16].
Try using slice assignment:
x[::len(y)] = y
And now:
print(x)
Will give:
[1, 2, 3, 4, 2, 6, 7, 8, 3, 10, 11, 12, 4, 14, 15, 16]

Is it okey to use lambda in this case?

I'm trying to figure out a way to loop over a panda DataFrame to generate a new key.
Here's an example of the dataframe:
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
And now I want to create a new key, that basically takes the first and last number of df["range"] and full the list with the numbers in the middle (ie, the first one will be [1 2 3 4 5 6 7 8 9 10])
So far I think that I should be using something like this, but I could be completely wrong:
df["total"] = df["range"].map(lambda x: #and here I should append all the "x" that are betwen df["range"][0] and df["range"][1]
Here's an example of the result that I'm looking for:
pdb beg end range total
0 a 1 10 1 10 1 2 3 4 5 6 7 8 9 10
1 b 2 11 2 11 2 3 4 5 6 7 8 9 10 11
I could use some help with the lambda function, I get really confused with the syntax.
Try with apply
df['new'] = df.apply(lambda x : list(range(x['beg'],x['end']+1)),axis=1)
Out[423]:
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
1 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
dtype: object
This should work:
df['total'] = df['range'].apply(lambda x: [n for n in range(x[0], x[1]+1)])
As per your output, you need
In [18]: df['new'] = df.apply(lambda x : " ".join(list(map(str,range(x['beg'],x['end']+1)))),axis=1)
In [19]: df
Out[19]:
pdb beg end range new
0 a 1 10 [1, 10] 1 2 3 4 5 6 7 8 9 10
1 b 2 11 [2, 11] 2 3 4 5 6 7 8 9 10 11
If you want to use iterrows then you can do it in the loop itself as follows:
Code :
import pandas as pd
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
df['total'] = [range(*x) for x in zip(df['beg'], df['end'])]
Output:
pdb beg end range total
0 a 1 10 [1, 10] (1, 2, 3, 4, 5, 6, 7, 8, 9)
1 b 2 11 [2, 11] (2, 3, 4, 5, 6, 7, 8, 9, 10)

non fixed rolling window

I am looking to implement a rolling window on a list, but instead of a fixed length of window, I would like to provide a rolling window list:
Something like this:
l1 = [5, 3, 8, 2, 10, 12, 13, 15, 22, 28]
l2 = [1, 2, 2, 2, 3, 4, 2, 3, 5, 3]
get_custom_roling( l1, l2, np.average)
and the result would be:
[5, 4, 5.5, 5, 6.67, ....]
6.67 is calculated as average of 3 elements 10, 2, 8.
I implemented a slow solution, and every idea is welcome to make it quicker :):
import numpy as np
def get_the_list(end_point, number_points):
"""
example: get_the_list(6, 3) ==> [4, 5, 6]
example: get_the_list(9, 5) ==> [5, 6, 7, 8, 9]
"""
if np.isnan(number_points):
return []
number_points = int( number_points)
return list(range(end_point, end_point - number_points, -1 ))
def get_idx(s):
ss = list(enumerate(s) )
sss = (get_the_list(*elem) for elem in ss )
return sss
def get_custom_roling(s, ss, funct):
output_get_idx = get_idx(ss)
agg_stuff = [s[elem] for elem in output_get_idx]
res_agg_stuff = [ funct(elem) for elem in agg_stuff ]
res_agg_stuff = eiu.pd.Series(data=res_agg_stuff, index = s.index)
return res_agg_stuff
Pandas custom window rolling allows you to modify size of window.
Simple explanation: start and end arrays hold values of indexes to make slices of your data.
#start = [0 0 1 2 2 2 5 5 4 7]
#end = [1 2 3 4 5 6 7 8 9 10]
Arguments passed to get_window_bounds are given by BaseIndexer.
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
end = np.arange(1, num_values+1, dtype=np.int64)
start = end - np.array(self.custom_name_whatever, dtype=np.int64)
return start, end
df = pd.DataFrame({"l1": [5, 3, 8, 2, 10, 12, 13, 15, 22, 28],
"l2": [1, 2, 2, 2, 3, 4, 2, 3, 5, 3]})
indexer = CustomIndexer(custom_name_whatever=df.l2)
df["variable_mean"] = df.l1.rolling(indexer).mean()
print(df)
Outputs:
l1 l2 variable_mean
0 5 1 5.000000
1 3 2 4.000000
2 8 2 5.500000
3 2 2 5.000000
4 10 3 6.666667
5 12 4 8.000000
6 13 2 12.500000
7 15 3 13.333333
8 22 5 14.400000
9 28 3 21.666667

What is the most efficient way of writing [1:20,25:30] in python

In matlab, we can write a list from 1 to 30 excluding 21-24 using [1:20,25:30].
What is the most efficient way of doing that in python?
Another question is, is there a efficient way of removing one element in a list or a column in a ndarray in python? Is that the same as in matlab by simply setting A[:,1]=[] ?
In MATLAB/Octave
[1:20,25:30]
3 things happen - 1:20, and 25:30 generate matrices, and the [ ] unites them into one matrix.
>> [1:20,25:30]
ans =
Columns 1 through 16:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Columns 17 through 26:
17 18 19 20 25 26 27 28 29 30
>> A = 1:20;
>> B = 25:30;
>> [A, B]
ans =
Columns 1 through 16:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Columns 17 through 26:
17 18 19 20 25 26 27 28 29 30
The equivalent in numpy:
In [193]: A = np.arange(1,21);
In [194]: B = np.arange(25,31);
In [195]: np.concatenate((A,B))
Out[195]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 25, 26, 27, 28, 29, 30])
There are other functions that do the samething, but all end up using concatenate, np.block, np.hstack, np.r_ etc. concatenate is the basic numpy function for joining arrays along one dimension or another.
In Python, you can remove elements from a list with a similar syntax:
In [201]: alist = list(range(10))
In [202]: alist
Out[202]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [203]: alist[3:6] = []
In [204]: alist
Out[204]: [0, 1, 2, 6, 7, 8, 9]
But that does not work with numpy arrays. They are fixed in size. The best you can do is create a new list without selected parts. There is a np.delete that does it for you, but it's a convenience rather than a speed tool.
In [205]: arr = np.arange(10)
In [207]: np.delete(arr, slice(3,6))
Out[207]: array([0, 1, 2, 6, 7, 8, 9])
delete does various things depending on the removal object. I think in this case it will copy slices to a new array
In [208]: res = np.zeros(10-3, arr.dtype)
In [209]: res[:3]=arr[:3]
In [210]: res[3:]=arr[6:]
In [211]: res
Out[211]: array([0, 1, 2, 6, 7, 8, 9])
or maybe just:
In [212]: np.concatenate([arr[:3], arr[6:]])
Out[212]: array([0, 1, 2, 6, 7, 8, 9])
Especially if the removal values are a list, rather than a slice, delete uses a mask:
In [213]: mask = np.ones(arr.shape, dtype=bool)
In [214]: mask[3:6]=0
In [215]: mask
Out[215]:
array([ True, True, True, False, False, False, True, True, True,
True])
In [216]: arr[mask]
Out[216]: array([0, 1, 2, 6, 7, 8, 9])
MATLAB may do some of these things faster by moving more of the action to compiled code. But the logic will, I expect, be similar.
The most efficient way in Python is only one character longer if you use a function with a one letter name:
l(1,20,25,30)
You need to define the function l somewhere in your library of utility routines:
def l(*args):
pairs = args[:]
res = []
while len(pairs) > 1:
res += range(pairs[0], pairs[1]+1)
pairs = pairs[2:]
assert len(pairs) == 0
return res
The assert is there to make sure you hand in an even number of arguments.
The copy of args to pairs is there to make sure you don't accidently modify a variable handed in to l, as in
mypairs = [1,20,25,30]
print(l(*mypairs))

Intersect multiple 2D np arrays for determining zones

Using this small reproducible example, I've so far been unable to generate a new integer array from 3 arrays that contains unique groupings across all three input arrays.
The arrays are related to topographic properties:
import numpy as np
asp = np.array([8,1,1,2,7,8,2,3,7,6,4,3,6,5,5,4]).reshape((4,4)) #aspect
slp = np.array([9,10,10,9,9,12,12,9,10,11,11,9,9,9,9,9]).reshape((4,4)) #slope
elv = np.array([13,14,14,13,14,15,16,14,14,15,16,14,13,14,14,13]).reshape((4,4)) #elevation
The idea is that the geographic contours are broken into 3 different properties using GIS routines:
1-8 for aspect (1=north facing, 2=northeast facing, etc.)
9-12 for slope (9=gentle slope...12=steepest slope)
13-16 for elevation (13=lowest elevations...16=highest elevations)
The small graphic below attempts to depict the kind of result I'm after (array shown in lower left). Note, the "answer" given in the graphic is but one possible answer. I'm not concerned about the final arrangement of integers in the resulting array so long as the final array contains an integer at each row/column index that identifies unique groupings.
For example, the array indexes at [0,1] and [0,2] have the same aspect, slope, and elevation and therefore receive the same integer identifier in the resulting array.
Does numpy have a built in routine for this kind of thing?
Each location in the grid is associated with a tuple composed of one value from
asp, slp and elv. For example, the upper left corner has tuple (8,9,13).
We would like to map this tuple to a number which uniquely identifies this tuple.
One way to do that would be to think of (8,9,13) as the index into the 3D array
np.arange(9*13*17).reshape(9,13,17). This particular array was chosen
to accommodate the largest values in asp, slp and elv:
In [107]: asp.max()+1
Out[107]: 9
In [108]: slp.max()+1
Out[108]: 13
In [110]: elv.max()+1
Out[110]: 17
Now we can map the tuple (8,9,13) to the number 1934:
In [113]: x = np.arange(9*13*17).reshape(9,13,17)
In [114]: x[8,9,13]
Out[114]: 1934
If we do this for each location in the grid, then we get a unique number for each location.
We could end right here, letting these unique numbers serve as labels.
Or, we can generate smaller integer labels (starting at 0 and increasing by 1)
by using np.unique with
return_inverse=True:
uniqs, labels = np.unique(vals, return_inverse=True)
labels = labels.reshape(vals.shape)
So, for example,
import numpy as np
asp = np.array([8,1,1,2,7,8,2,3,7,6,4,3,6,5,5,4]).reshape((4,4)) #aspect
slp = np.array([9,10,10,9,9,12,12,9,10,11,11,9,9,9,9,9]).reshape((4,4)) #slope
elv = np.array([13,14,14,13,14,15,16,14,14,15,16,14,13,14,14,13]).reshape((4,4)) #elevation
x = np.arange(9*13*17).reshape(9,13,17)
vals = x[asp, slp, elv]
uniqs, labels = np.unique(vals, return_inverse=True)
labels = labels.reshape(vals.shape)
yields
array([[11, 0, 0, 1],
[ 9, 12, 2, 3],
[10, 8, 5, 3],
[ 7, 6, 6, 4]])
The above method works fine as long as the values in asp, slp and elv are small integers. If the integers were too large, the product of their maximums could overflow the maximum allowable value one can pass to np.arange. Moreover, generating such a large array would be inefficient.
If the values were floats, then they could not be interpreted as indices into the 3D array x.
So to address these problems, use np.unique to convert the values in asp, slp and elv to unique integer labels first:
indices = [ np.unique(arr, return_inverse=True)[1].reshape(arr.shape) for arr in [asp, slp, elv] ]
M = np.array([item.max()+1 for item in indices])
x = np.arange(M.prod()).reshape(M)
vals = x[indices]
uniqs, labels = np.unique(vals, return_inverse=True)
labels = labels.reshape(vals.shape)
which yields the same result as shown above, but works even if asp, slp, elv were floats and/or large integers.
Finally, we can avoid the generation of np.arange:
x = np.arange(M.prod()).reshape(M)
vals = x[indices]
by computing vals as a product of indices and strides:
M = np.r_[1, M[:-1]]
strides = M.cumprod()
indices = np.stack(indices, axis=-1)
vals = (indices * strides).sum(axis=-1)
So putting it all together:
import numpy as np
asp = np.array([8,1,1,2,7,8,2,3,7,6,4,3,6,5,5,4]).reshape((4,4)) #aspect
slp = np.array([9,10,10,9,9,12,12,9,10,11,11,9,9,9,9,9]).reshape((4,4)) #slope
elv = np.array([13,14,14,13,14,15,16,14,14,15,16,14,13,14,14,13]).reshape((4,4)) #elevation
def find_labels(*arrs):
indices = [np.unique(arr, return_inverse=True)[1] for arr in arrs]
M = np.array([item.max()+1 for item in indices])
M = np.r_[1, M[:-1]]
strides = M.cumprod()
indices = np.stack(indices, axis=-1)
vals = (indices * strides).sum(axis=-1)
uniqs, labels = np.unique(vals, return_inverse=True)
labels = labels.reshape(arrs[0].shape)
return labels
print(find_labels(asp, slp, elv))
# [[ 3 7 7 0]
# [ 6 10 12 4]
# [ 8 9 11 4]
# [ 2 5 5 1]]
This can be done using numpy.unique() and then a mapping like:
Code:
combined = 10000 * asp + 100 * slp + elv
unique = dict(((v, i + 1) for i, v in enumerate(np.unique(combined))))
combined_unique = np.vectorize(unique.get)(combined)
Test Code:
import numpy as np
asp = np.array([8, 1, 1, 2, 7, 8, 2, 3, 7, 6, 4, 3, 6, 5, 5, 4]).reshape((4, 4)) # aspect
slp = np.array([9, 10, 10, 9, 9, 12, 12, 9, 10, 11, 11, 9, 9, 9, 9, 9]).reshape((4, 4)) # slope
elv = np.array([13, 14, 14, 13, 14, 15, 16, 14, 14, 15, 16, 14, 13, 14, 14, 13]).reshape((4, 4))
combined = 10000 * asp + 100 * slp + elv
unique = dict(((v, i + 1) for i, v in enumerate(np.unique(combined))))
combined_unique = np.vectorize(unique.get)(combined)
print(combined_unique)
Results:
[[12 1 1 2]
[10 13 3 4]
[11 9 6 4]
[ 8 7 7 5]]
This seems like a similar problem to labeling unique regions in an image. This is a function I've written to do this, though you would first need to concatenate your 3 arrays to 1 3D array.
def labelPix(pix):
height, width, _ = pix.shape
pixRows = numpy.reshape(pix, (height * width, 3))
unique, counts = numpy.unique(pixRows, return_counts = True, axis = 0)
unique = [list(elem) for elem in unique]
labeledPix = numpy.zeros((height, width), dtype = int)
offset = 0
for index, zoneArray in enumerate(unique):
index += offset
zone = list(zoneArray)
zoneArea = (pix == zone).all(-1)
elementsArray, numElements = scipy.ndimage.label(zoneArea)
elementsArray[elementsArray!=0] += offset
labeledPix[elementsArray!=0] = elementsArray[elementsArray!=0]
offset += numElements
return labeledPix
This will label unique 3-value combinations, while also assigning separate labels to zones which have the same 3-value combination, but are not in contact with one another.
asp = numpy.array([8,1,1,2,7,8,2,3,7,6,4,3,6,5,5,4]).reshape((4,4)) #aspect
slp = numpy.array([9,10,10,9,9,12,12,9,10,11,11,9,9,9,9,9]).reshape((4,4)) #slope
elv = numpy.array([13,14,14,13,14,15,16,14,14,15,16,14,13,14,14,13]).reshape((4,4)) #elevation
pix = numpy.zeros((4,4,3))
pix[:,:,0] = asp
pix[:,:,1] = slp
pix[:,:,2] = elv
print(labelPix(pix))
returns:
[[ 0 1 1 2]
[10 12 3 4]
[11 9 6 4]
[ 8 7 7 5]]
Here's a plain Python technique using itertools.groupby. It requires the input to be 1D lists, but that shouldn't be a major issue. The strategy is to zip the lists together, along with an index number, then sort the resulting columns. We then group identical columns together, ignoring the index number when comparing columns. Then we gather the index numbers from each group, and use them to build the final output list.
from itertools import groupby
def show(label, seq):
print(label, ' '.join(['{:2}'.format(u) for u in seq]))
asp = [8, 1, 1, 2, 7, 8, 2, 3, 7, 6, 4, 3, 6, 5, 5, 4]
slp = [9, 10, 10, 9, 9, 12, 12, 9, 10, 11, 11, 9, 9, 9, 9, 9]
elv = [13, 14, 14, 13, 14, 15, 16, 14, 14, 15, 16, 14, 13, 14, 14, 13]
size = len(asp)
a = sorted(zip(asp, slp, elv, range(size)))
groups = sorted([u[-1] for u in g] for _, g in groupby(a, key=lambda t:t[:-1]))
final = [0] * size
for i, g in enumerate(groups, 1):
for j in g:
final[j] = i
show('asp', asp)
show('slp', slp)
show('elv', elv)
show('out', final)
output
asp 8 1 1 2 7 8 2 3 7 6 4 3 6 5 5 4
slp 9 10 10 9 9 12 12 9 10 11 11 9 9 9 9 9
elv 13 14 14 13 14 15 16 14 14 15 16 14 13 14 14 13
out 1 2 2 3 4 5 6 7 8 9 10 7 11 12 12 13
There's no need to do that second sort, we could just use a plain list comp
groups = [[u[-1] for u in g] for _, g in groupby(a, key=lambda t:t[:-1])]
or generator expression
groups = ([u[-1] for u in g] for _, g in groupby(a, key=lambda t:t[:-1]))
I only did it so that my output matches the output in the question.
Here's one way to solve this problem using a dictionary based lookup.
from collections import defaultdict
import itertools
group_dict = defaultdict(list)
idx_count = 0
for a, s, e in np.nditer((asp, slp, elv)):
asp_tuple = (a.tolist(), s.tolist(), e.tolist())
if asp_tuple not in group_dict:
group_dict[asp_tuple] = [idx_count+1]
idx_count += 1
else:
group_dict[asp_tuple].append(group_dict[asp_tuple][-1])
list1d = list(itertools.chain(*list(group_dict.values())))
np.array(list1d).reshape(4, 4)
# result
array([[ 1, 2, 2, 3],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[11, 12, 12, 13]])

Categories