How to update matrix based on multiple maximum value per row?

How to update matrix based on multiple maximum value per row? - python

I am a newbie to Python. I have an NxN matrix and I want to know the maximum value per each row. Next, I want to nullify(update as zero) all other values except this maximum value. If the row contains multiple maximum values, all those maximum values should be preserved.
Using DataFrame, I tried to get the maximum of each row.Then I tried to get indices of these max values. Code is given below.
matrix = [(22, 16, 23),
(12, 6, 43),
(24, 67, 11),
(87, 9,11),
(66, 36,66)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
maxValuesObj = dfObj.max(axis=1)
maxValueIndexObj = dfObj.idxmax(axis=1)
The above code doesn't consider multiple maximum values. Only the first occurrence is returned.
Also,I am stuck with how to update the matrix accordingly. My expected output is:
matrix = [(0, 0, 23),
(0, 0, 43),
(0, 67, 0),
(87, 0,0),
(66, 0,66)
]
Can you please help me to sort out this?

Using df.where():
dfObj.where(dfObj.eq(dfObj.max(1),axis=0),0)
x y z
a 0 0 23
b 0 0 43
c 0 67 0
d 87 0 0
e 66 0 66
For an ND array instead of a dataframe , call .values after the above code:
dfObj.where(dfObj.eq(dfObj.max(1),axis=0),0).values
Or better is to_numpy():
dfObj.where(dfObj.eq(dfObj.max(1),axis=0),0).to_numpy()
Or np.where:
np.where(dfObj.eq(dfObj.max(1),axis=0),dfObj,0)
array([[ 0, 0, 23],
[ 0, 0, 43],
[ 0, 67, 0],
[87, 0, 0],
[66, 0, 66]], dtype=int64)

I'll show how to do it with a Python built-ins instead of Pandas, since you're new to Python and should know how to do it outside of Pandas (and the Pandas syntax isn't as clean).
matrix = [(22, 16, 23),
(12, 6, 43),
(24, 67, 11),
(87, 9,11),
(66, 36,66)
]
new_matrix = []
for row in matrix:
row_max = max(row)
new_row = tuple(element if element == row_max else 0 for element in row)
new_matrix.append(new_row)

You can do this with a short for loop pretty easily:
import numpy as np
matrix = np.array([(22, 16, 23), (12, 6, 43), (24, 67, 11), (87, 9,11), (66, 36,66)])
for i in range(0, len(matrix)):
matrix[i] = [x if x == max(matrix[i]) else 0 for x in matrix[i]]
print(matrix)
output:
[[ 0 0 23]
[ 0 0 43]
[ 0 67 0]
[87 0 0]
[66 0 66]]
I would also use numpy for matrices not pandas.

This isn't the most performant solution, but you can write a function for the row operation then apply it to each row:
def max_row(row):
row.loc[row != row.max()] = 0
return row
dfObj.apply(max_row, axis=1)
Out[17]:
x y z
a 0 0 23
b 0 0 43
c 0 67 0
d 87 0 0
e 66 0 66

Related

How can I replace pd intervals with integers in python

How can I replace pd intervals with integers
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
output:
age age_bands
0 43 (40, 50]
1 76 (70, 80]
2 27 (20, 30]
3 8 (0, 10]
4 57 (50, 60]
5 32 (30, 40]
6 12 (10, 20]
7 22 (20, 30]
now I want to add another column to replace the bands with a single number (int). but I could not
for example this did not work :
df['age_code']= df['age_bands'].replace({'(40, 50]':4})
how can I get a column looks like this?
age_bands age_code
0 (40, 50] 4
1 (70, 80] 7
2 (20, 30] 2
3 (0, 10] 0
4 (50, 60] 5
5 (30, 40] 3
6 (10, 20] 1
7 (20, 30] 2

Assuming you want to the first digit from every interval, then, you can use pd.apply to achieve what you want as follows:
df["age_code"] = df["age_bands"].apply(lambda band: str(band)[1])
However, note this may not be very efficient for a large dataframe,
To convert the column values to int datatype, you can use pd.to_numeric,
df["age_code"] = pd.to_numeric(df['age_code'])

As the column contains pd.Interval objects, use its property left
df['age_code'] = df['age_bands'].apply(lambda interval: interval.left // 10)

You can do that by simply adding a second pd.cut and define labels argument.
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
#This is the part of code you need to add
age_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
df['age_code']= pd.cut(df['age'], bins=age_band, labels=age_labels, ordered=True)
>>> print(df)

You can create a dictionary of bins and map it to the age_bands column:
bins_sorted = sorted(pd.cut(df['age'], bins=age_band, ordered=True).unique())
bins_dict = {key: idx for idx, key in enumerate(bins_sorted)}
df['age_code'] = df.age_bands.map(bins_dict).astype(int)

How to develope an algorithm in python to make specific pairs out of values in a numpy array

I want automatically create some pairs based on the data stored as numpy arrays. In fact, the numbers in my first arrays are numbers of some lines. I want to connect the lines and create surfaces using created pairs. This is the array of lines:
line_no= np.arange (17, 25)
These lines are in two perpendicular directions. I uploaded a fig to show it (they are as blue and red colors). I know where the direction of my lines change and call it as sep.
sep=20
Another data which should be useable is the number of points creating lines. I call it rep.
rep = np.array([3,3,1])
Then, I used the following code to achieve my goal but it is not correct:
start =line_no[0]
N_x = len(rep) - 1
N_y = max(rep) - 1
grid = np.zeros((N_y + 1, N_x + 1, 2))
kxs = [0] + [min(rep[i], rep[i+1]) for i in range(len(rep)-1)]
T_x = sum(kxs)
T_y = sum(rep) - len(rep)
T_total = T_x + T_y
lines = np.arange(start, start+T_total)
lines_before = 0
for i in range(N_x):
for j in range(N_y, -1, -1):
if j >= kxs[i+1]:
continue
grid[j,i,0] = lines[lines_before]
lines_before += 1
for i in range(N_x+1):
for j in range(N_y-1, -1, -1):
if j < rep[i] - 1:
grid[j,i,1] = lines[lines_before]
lines_before += 1
joints=np.array([])
for i in range(N_x - 1):
for j in range(N_y - 1):
square = np.append(grid[j:j+2, i, 0], grid[j, i:i+2, 1])
if all(square):
new_joints = square
joints=np.append (new_joints,joints)
In my fig I have two scenarios: A (rep = np.array([3,3,1])) and B (rep = np.array([1,3,3])). For A I want to have the following pairs:
17, 21, 18, 23
18, 22, 19, 24
And for B:
18, 21, 19, 23
19, 22, 20, 24
In reality the distribution of my lines can change. For example, in scenario A, the last line is not creating any surface and in B the first one is not in any surface and in case I may have several lines that are not part of any surface. For example I may have another red line bellow line number 21 which do not make any surface. Thanks for paying attention to my problem. I do appreciate any help in advance.
A more complicated case is also shown in the following. In scenario C I have:
line_no= np.arange (17, 42)
sep=29
rep = np.array([5,4,4,2,2,1])
In scenario D I have:
line_no= np.arange (17, 33)
sep=24
rep = np.array([1,3,4,4])

Sorry, but I couldn't go through your implementation. Tip for next time onwards- please try to comment your code, it helps.
Anyway, here is somewhat of a readable implementation that gets the job done. But, I advice you to check with more scenarios to verify the scripts validity before making any conclusions.
import numpy as np
line_no = np.arange(17, 25)
sep = 20 # this information is redundant for the problem
nodes = np.array(np.array([1,3,4,4]))
# for generalised implementation hlines start from 0 and vlines start where hlines end
# offset parameter can be used to change the origin or start number of hlines and hence changed the vlines also
offset = 17
# calculate the number of horizontal lines and vertical lines in sequence
hlines = np.array([min(a, b) for a, b in zip(nodes[:-1], nodes[1:])])
# vlines = np.array([max(a, b) - 1 for a, b in zip(nodes[:-1], nodes[1:])])
vlines = nodes - 1
print(f"hlines: {hlines}, vlines: {vlines}")
# nodes = np.array([3, 3, 1]) ---> hlines: [3, 1], vlines: [2, 2]
# nodes = np.array([1, 3, 3]) ---> hlines: [1, 3], vlines: [2, 2]
hlines_no = list(range(sum(hlines)))
vlines_no = list(range(sum(hlines), sum(hlines)+sum(vlines)))
print(f"hlines numbers: {hlines_no}, vlines numbers: {vlines_no}")
# nodes = np.array([3, 3, 1]) ---> hlines numbers: [0, 1, 2, 3], vlines numbers: [4, 5, 6, 7]
# nodes = np.array([1, 3, 3]) ---> hlines numbers: [0, 1, 2, 3], vlines numbers: [4, 5, 6, 7]
cells = [] # to store complete cell tuples
hidx = 0 # to keep track of horizontal lines index
vidx = 0 # to keep track of vertical lines index
previous_cells = 0
current_cells = 0
for LN, RN in zip(nodes[:-1], nodes[1:]):
# if either the left or right side nodes is equal to 1, implies only 1 horizontal line exists
# and the horizontal index is updated
if LN == 1 or RN == 1:
hidx += 1
else:
# to handle just a blank vertical line
if LN - RN == 1:
vidx += 1
# iterate 'cell' number of times
# number of cells are always 1 less than the minimum of left and right side nodes
current_cells = min(LN, RN)-1
if previous_cells != 0 and previous_cells > current_cells:
vidx += previous_cells - current_cells
for C in range(current_cells):
cell = (offset + hlines_no[hidx],
offset + vlines_no[vidx],
offset + hlines_no[hidx+1],
offset + vlines_no[vidx+current_cells])
hidx += 1
vidx += 1
cells.append(cell)
# skip the last horizontal line in a column
hidx += 1
previous_cells = min(LN, RN)-1
print(cells)
Results
# nodes = np.array([3, 3, 1]) ---> [(17, 21, 18, 23), (18, 22, 19, 24)]
# nodes = np.array([1, 3, 3]) ---> [(18, 21, 19, 23), (19, 22, 20, 24)]
# nodes = np.array([5,4,4,2,2,1]) ---> [(17, 31, 18, 34),
# (18, 32, 19, 35),
# (19, 33, 20, 36),
# (21, 34, 22, 37),
# (22, 35, 23, 38),
# (23, 36, 24, 39),
# (25, 39, 26, 40),
# (27, 40, 28, 41)]
# nodes = np.array([1,3,4,4]) ---> [(18, 25, 19, 27),
# (19, 26, 20, 28),
# (21, 27, 22, 30),
# (22, 28, 23, 31),
# (23, 29, 24, 32)]
Edit: Updated the code to account for the special case scenarios

Pandas - column median applied on lambda function

Given the dataset:
matrix = [(222, 34, 23),
(333, 31, 11),
(444, 16, 21),
(555, 32, 22),
(666, 33, 27),
(777, 35, 11)
]
dfObj = pd.DataFrame(matrix, columns=list('abc'))
I want to apply the formula (value - column median) ^ 2. I am trying to do with lambda and functions, but I am not being successful, the issue is the column median.
value = each cell;
how could I apply that function?
Edit
dfObj['d'] = dfObj['c'].apply(lambda x : math.pow(x, 2) / 10)

Is this what you need ?
dfObj.div(dfObj.median())**2
Out[116]:
a b c
0 0.197531 1.094438 1.144402
1 0.444444 0.909822 0.261763
2 0.790123 0.242367 0.954029
3 1.234568 0.969467 1.047052
4 1.777778 1.031006 1.577069
5 2.419753 1.159763 0.261763

extract all vertical slices from numpy array

I want to extract a complete slice from a 3D numpy array using ndeumerate or something similar.
arr = np.random.rand(4, 3, 3)
I want to extract all possible arr[:, x, y] where x, y range from 0 to 2

ndindex is a convenient way of generating the indices corresponding to a shape:
In [33]: arr = np.arange(36).reshape(4,3,3)
In [34]: for xy in np.ndindex((3,3)):
...: print(xy, arr[:,xy[0],xy[1]])
...:
(0, 0) [ 0 9 18 27]
(0, 1) [ 1 10 19 28]
(0, 2) [ 2 11 20 29]
(1, 0) [ 3 12 21 30]
(1, 1) [ 4 13 22 31]
(1, 2) [ 5 14 23 32]
(2, 0) [ 6 15 24 33]
(2, 1) [ 7 16 25 34]
(2, 2) [ 8 17 26 35]
It uses nditer, but doesn't have any speed advantages over a nested pair of for loops.
In [35]: for x in range(3):
...: for y in range(3):
...: print((x,y), arr[:,x,y])
ndenumerate uses arr.flat as the iterator, but using it to
In [38]: for xy, _ in np.ndenumerate(arr[0,:,:]):
...: print(xy, arr[:,xy[0],xy[1]])
does the same thing, iterating on the elements of a 3x3 subarray. As with ndindex it generates the indices. The element won't be the size 4 array that you want, so I ignored that.
A different approach is to flatten the later axes, transpose, and then just iterate on the (new) first axis:
In [43]: list(arr.reshape(4,-1).T)
Out[43]:
[array([ 0, 9, 18, 27]),
array([ 1, 10, 19, 28]),
array([ 2, 11, 20, 29]),
array([ 3, 12, 21, 30]),
array([ 4, 13, 22, 31]),
array([ 5, 14, 23, 32]),
array([ 6, 15, 24, 33]),
array([ 7, 16, 25, 34]),
array([ 8, 17, 26, 35])]
or with the print as before:
In [45]: for a in arr.reshape(4,-1).T:print(a)

Why not just
[arr[:, x, y] for x in range(3) for y in range(3)]

How to subset a `numpy.ndarray` where another one is max along some axis?

In python/numpy, how can I subset a multidimensional array where another one, of the same shape, is maximum along some axis (e.g. the first one)?
Suppose I have two 3*2*4 arrays, a and b. I want to obtain a 2*4 array containing the values of b at the locations where a has its maximal values along the first axis.
import numpy as np
np.random.seed(7)
a = np.random.rand(3*2*4).reshape((3,2,4))
b = np.random.rand(3*2*4).reshape((3,2,4))
print a
#[[[ 0.07630829 0.77991879 0.43840923 0.72346518]
# [ 0.97798951 0.53849587 0.50112046 0.07205113]]
#
# [[ 0.26843898 0.4998825 0.67923 0.80373904]
# [ 0.38094113 0.06593635 0.2881456 0.90959353]]
#
# [[ 0.21338535 0.45212396 0.93120602 0.02489923]
# [ 0.60054892 0.9501295 0.23030288 0.54848992]]]
print a.argmax(axis=0) #(I would like b at these locations along axis0)
#[[1 0 2 1]
# [0 2 0 1]]
I can do this really ugly manual subsetting:
index = zip(a.argmax(axis=0).flatten(),
[0]*a.shape[2]+[1]*a.shape[2], # a.shape[2] = 4 here
range(a.shape[2])+range(a.shape[2]))
# [(1, 0, 0), (0, 0, 1), (2, 0, 2), (1, 0, 3),
# (0, 1, 0), (2, 1, 1), (0, 1, 2), (1, 1, 3)]
Which would allow me to obtain my desired result:
b_where_a_is_max_along0 = np.array([b[i] for i in index]).reshape(2,4)
# For verification:
print a.max(axis=0) == np.array([a[i] for i in index]).reshape(2,4)
#[[ True True True True]
# [ True True True True]]
What is the smart, numpy way to achieve this? Thanks :)

Use advanced-indexing -
m,n = a.shape[1:]
b_out = b[a.argmax(0),np.arange(m)[:,None],np.arange(n)]
Sample run -
Setup input array a and get its argmax along first axis -
In [185]: a = np.random.randint(11,99,(3,2,4))
In [186]: idx = a.argmax(0)
In [187]: idx
Out[187]:
array([[0, 2, 1, 2],
[0, 1, 2, 0]])
In [188]: a
Out[188]:
array([[[49*, 58, 13, 69], # * are the max positions
[94*, 28, 55, 86*]],
[[34, 17, 57*, 50],
[48, 73*, 22, 80]],
[[19, 89*, 42, 71*],
[24, 12, 66*, 82]]])
Verify results with b -
In [193]: b
Out[193]:
array([[[18*, 72, 35, 51], # Mark * at the same positions in b
[74*, 57, 50, 84*]], # and verify
[[58, 92, 53*, 65],
[51, 95*, 43, 94]],
[[85, 23*, 13, 17*],
[17, 64, 35*, 91]]])
In [194]: b[a.argmax(0),np.arange(2)[:,None],np.arange(4)]
Out[194]:
array([[18, 23, 53, 17],
[74, 95, 35, 84]])

You could use ogrid
>>> x = np.random.random((2,3,4))
>>> x
array([[[ 0.87412737, 0.11069105, 0.86951092, 0.74895912],
[ 0.48237622, 0.67502597, 0.11935148, 0.44133397],
[ 0.65169681, 0.21843482, 0.52877862, 0.72662927]],
[[ 0.48979028, 0.97103611, 0.36459645, 0.80723839],
[ 0.90467511, 0.79118429, 0.31371856, 0.99443492],
[ 0.96329039, 0.59534491, 0.15071331, 0.52409446]]])
>>> y = np.argmax(x, axis=1)
>>> y
array([[0, 1, 0, 0],
[2, 0, 0, 1]])
>>> i, j = np.ogrid[:2,:4]
>>> x[i ,y, j]
array([[ 0.87412737, 0.67502597, 0.86951092, 0.74895912],
[ 0.96329039, 0.97103611, 0.36459645, 0.99443492]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to update matrix based on multiple maximum value per row? - python

This isn't the most performant solution, but you can write a function for the row operation then apply it to each row: def max_row(row): row.loc[row != row.max()] = 0 return row dfObj.apply(max_row, axis=1) Out[17]: x y z a 0 0 23 b 0 0 43 c 0 67 0 d 87 0 0 e 66 0 66

Related

How can I replace pd intervals with integers in python

How to develope an algorithm in python to make specific pairs out of values in a numpy array

Pandas - column median applied on lambda function

extract all vertical slices from numpy array

How to subset a `numpy.ndarray` where another one is max along some axis?

Categories

Resources