groupby common values in two columns - python

I need to extract a common max value from pairs of rows that have common values in two columns.
The commonality is between values in columns A and B. Rows 0 and 1 are common, 2 and 3, and 4 is on its own.
f = DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]], columns=['A', 'B', 'Value'])
f
A B Value
0 1 2 30
1 2 1 20
2 2 6 15
3 6 2 70
4 7 10 35
The goal is to extract max values, so the end result is:
f_final = DataFrame([[1, 2, 30, 30], [2, 1, 20, 30], [2, 6, 15, 70], [6, 2, 70, 70], [7, 10, 35, 35]], columns=['A', 'B', 'Value', 'Max'])
f_final
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
I could do this if there is a way to assign a common, non-repeating key:
f_key = DataFrame([[1, 1, 2, 30], [1, 2, 1, 20], [2, 2, 6, 15], [2, 6, 2, 70], [3, 7, 10, 35]], columns=['key', 'A', 'B', 'Value'])
f_key
key A B Value
0 1 1 2 30
1 1 2 1 20
2 2 2 6 15
3 2 6 2 70
4 3 7 10 35
Following up with the groupby and transform:
f_key['Max'] = f_key.groupby(['key'])['Value'].transform(lambda x: x.max())
f_key.drop('key', 1, inplace=True)
f_key
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
Question 1:
How would one assign this common key?
Question 2:
Is there a better way of doing this, skipping the common key step
Cheers...

You could sort the values in columns A and B so that for each row the value in A is less than or equal to the value in B. Once the values are ordered, then you could apply groupby-transform-max as usual:
import pandas as pd
df = pd.DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]],
columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
print(df)
yields
A B Value Max
0 1 2 30 30
1 1 2 20 30
2 2 6 15 70
3 2 6 70 70
4 7 10 35 35
The above method will still work even if the values in A and B are strings. For example,
df = DataFrame([['ab', 'ac', 30], ['ac', 'ab', 20],
['cb', 'ca', 15], ['ca', 'cb', 70],
['ff', 'zz', 35]], columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
yields
In [267]: df
Out[267]:
A B Value Max
0 ab ac 30 30
1 ab ac 20 30
2 ca cb 15 70
3 ca cb 70 70
4 ff zz 35 35

Related

How to create a new columns with the top 3 maximum values in each row from specific columns in python df?

I have this df:
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'A': [20, 21, 19, 18],
'B': [3, 6, 2, 1],
'C': [6, 14, 5, 17],
'D': [2, 10, 9, 98]
}
people = pd.DataFrame(data)
people["max_1"]=people[['A','B','C','D']].max(axis=1)
people
So I've added a new column - max_1 for the maximum value in each row from columns A, B, C, and D.
My question is how can I create new columns (max_2 and max_3) for the 2nd highest value and for the third highest value?
Additional question - is it possible to add another condition on top of it? For example, find the maximum values but only when the names are 'Tom'/'nick'/'krish' -> otherwise, set 0 for those rows.
Thanks in advance.
A solution with apply and nlargest.
import pandas as pd
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'A': [20, 21, 19, 18],
'B': [3, 6, 2, 1],
'C': [6, 14, 5, 17],
'D': [2, 10, 9, 98]
}
people = pd.DataFrame(data)
# Solution
# Set Name to index. So it does not interfere when we do things with numbers.
people = people.set_index("Name")
# To select specific columns
# columns = ["A", "C", "D"]
# people = people[columns]
# Apply nlargest to each row.
# Not efficient because we us apply. But the good part that there is not much code.
top3 = people.apply(lambda x: pd.Series(x.nlargest(3).values), axis=1)
people[["N1", "N2", "N3"]] = top3
Result
A B C D N1 N2 N3
Name
Tom 20 3 6 2 20 6 3
nick 21 6 14 10 21 14 10
krish 19 2 5 9 19 9 5
jack 18 1 17 98 98 18 17
n = 3
idx = [f'max_{i}' for i in range(1, 1 + n)]
df = people.iloc[:, 1:].apply(lambda x: x.nlargest(n).set_axis(idx), axis=1)
people.join(df)
result:
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 19 9 5
3 jack 18 1 17 98 98 18 17
change n to what you want
Use:
#number of columns
N = 3
#columns names
cols = ['A','B','C','D']
#new columns names
new = [f'max_{i+1}' for i in range(N)]
#condition for test membership
mask = people['Name'].isin(['Tom','nick'])
#new columns filled 0
people[new] = 0
#for filtered rows get top N values
people.loc[mask, new] = np.sort(people.loc[mask, cols].to_numpy(), axis=1)[:, -N:][:, ::-1]
print (people)
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 0 0 0
3 jack 18 1 17 98 0 0 0
Soluton with numpy.where and broadcasting:
N = 3
cols = ['A','B','C','D']
new = [f'max_{i+1}' for i in range(N)]
mask = people['Name'].isin(['Tom','nick'])
people[new] = np.where(mask.to_numpy()[:, None],
np.sort(people[cols].to_numpy(), axis=1)[:, -N:][:, ::-1],
0)
print (people)
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 0 0 0
3 jack 18 1 17 98 0 0 0
You can do :
# to get max_2
people['max_2'] = [np.sort(people[['A','B','C','D']].iloc[:])[i][2] for i in range(len(people))]
# to get max_3
people['max_3'] = [np.sort(people[['A','B','C','D']].iloc[:])[i][1] for i in range(len(people))]
Use
import numpy as np
people[['max_1','max_2','max_3']] = \
people[['A','B','C','D']].apply(lambda x: -np.sort(-x), axis=1, raw=True).iloc[:, 0:3]
people
# Out:
# Name A B C D max_1 max_2 max_3
# 0 Tom 20 3 3 2 20 3 3
# 1 nick 21 6 14 10 21 14 10
# 2 krish 19 2 5 9 19 9 5
# 3 jack 18 1 17 98 98 18 17
Note that I changed the data a bit to show what happens in case of duplicate values
# data = {
# 'Name': ['Tom', 'nick', 'krish', 'jack'],
# 'A': [20, 21, 19, 18],
# 'B': [3, 6, 2, 1],
# 'C': [3, 14, 5, 17],
# 'D': [2, 10, 9, 98]
# }
# people = pd.DataFrame(data)
people
# Out:
# Name A B C D
# 0 Tom 20 3 3 2
# 1 nick 21 6 14 10
# 2 krish 19 2 5 9
# 3 jack 18 1 17 98

Shaping a Pandas DataFrame (multiple columns into 2)

I have a simular dataframe to below and require it to be shaped as per expected output.
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
'col2': [1, 3, 5, 7, 9, 11],
'col3': [2, 4, 6, 8, 10, 12]
})
col1 col2 col3
0 A 1 2
1 A 3 4
2 A 5 6
3 B 7 8
4 B 9 10
5 B 11 12
Expected Output
df_expected = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12]
})
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
So far I have tried pack, unpack & pivot without getting the desired result
Thanks for your help!
pd.DataFrame(df.groupby('col1').agg(list).T.sum().to_dict())
Use Numpy to reshape the data then package back up into a dataframe.
cols = (df['col2'],df['col3'])
data = np.stack(cols,axis=1).reshape(len(cols),len(df))
dft = pd.DataFrame(data, index=df['col1'].unique()).T
print(dft)
Result
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12

Doing an almost incomplete pivot table operation in pandas

I have a data frame like the following:
values = random.sample(range(1, 101), 15)
df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 'n': [100, 100, 100, 'reference', 'reference', 'reference', 500, 500, 500, 100, 100, 100, 'reference', 'reference', 'reference'], 'value': values})
The values labeled as 'reference' in the n column are reference values, which I will eventually plot against. To help with this, I need to make a data frame that has the reference values in a different column, so columns = ['x', 'n', 'value', 'value_reference']
Value reference is the reference value for all values of n as long as x is the same. Therefore, I want to make a data frame like the following:
desired_df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 4, 4, 4], 'n': [100, 100, 100, 500, 500, 500, 100, 100, 100], 'value': [values[i] for i in [0, 1, 2, 6, 7, 8, 9, 10, 11]], 'value_reference':[values[i] for i in [3, 4, 5, 3, 4, 5, 12, 13, 14]]})
I got the result here by hard coding exactly what I want to make a reproducible example. However, I am looking for the correct way of doing this operation.
How can this be done?
Thanks,
Jack
One way might be this:
df["tick"] = df.groupby(["x", "n"]).cumcount()
numbers = df.loc[df["n"] != "reference"]
ref = df.loc[df["n"] == "reference"]
ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
out = numbers.merge(ref).drop("tick", axis=1)
out = out.sort_values(["x", "n"])
which gives me
In [282]: out
Out[282]:
x n value reference
0 3 100 6 67
2 3 100 9 29
4 3 100 34 51
1 3 500 42 67
3 3 500 36 29
5 3 500 12 51
6 4 100 74 5
7 4 100 48 37
8 4 100 7 70
Step by step, first we add a tick column so we know which row of value matches with which row of reference:
In [290]: df
Out[290]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
3 3 reference 67 0
4 3 reference 29 1
5 3 reference 51 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
12 4 reference 5 0
13 4 reference 37 1
14 4 reference 70 2
Then we separate out the value and reference parts of the table:
In [291]: numbers = df.loc[df["n"] != "reference"]
...: ref = df.loc[df["n"] == "reference"]
...: ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
...:
...:
In [292]: numbers
Out[292]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
In [293]: ref
Out[293]:
x reference tick
3 3 67 0
4 3 29 1
5 3 51 2
12 4 5 0
13 4 37 1
14 4 70 2
and then we merge, where the merge will align on the shared columns, which are "x" and "tick". A sort to clean things up and we're done.

Lookup values of one Pandas dataframe in another

I have two dataframes, and I want to do a lookup much like a Vlookup in excel.
df_orig.head()
A
0 3
1 4
2 6
3 7
4 8
df_new
Combined Length Group_name
0 [8, 9, 112, 114, 134, 135] 6 Group 1
1 [15, 16, 17, 18, 19, 20] 6 Group 2
2 [15, 16, 17, 18, 19] 5 Group 3
3 [16, 17, 18, 19, 20] 5 Group 4
4 [15, 16, 17, 18] 4 Group 5
5 [8, 9, 112, 114] 4 Group 6
6 [18, 19, 20] 3 Group 7
7 [28, 29, 30] 3 Group 8
8 [21, 22] 2 Group 9
9 [28, 29] 2 Group 10
10 [26, 27] 2 Group 11
11 [24, 25] 2 Group 12
12 [3, 4] 2 Group 13
13 [6, 7] 2 Group 14
14 [11, 14] 2 Group 15
15 [12, 13] 2 Group 16
16 [0, 1] 2 Group 17
How can I add the values in df_new["Group_name"] to df_orig["A"]?
The "Group_name" must be based on the lookup of the values from df_orig["A"] in df_new["Combined"].
So it would look like:
df_orig.head()
A Looked_up
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Thank you!
Two steps ***unnest*** + merge
df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)})
df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'}))
Out[77]:
A Group_name
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Here is one way which mimics a vlookup. Minimal example below.
import pandas as pd
df_origin = pd.DataFrame({'A': [3, 11, 0, 12, 6]})
df_new = pd.DataFrame({'Combined': [[3, 4, 5], [6, 7], [11, 14, 20],
[12, 13], [3, 1], [0, 4]],
'Group_name': ['Group 13', 'Group 14', 'Group 15',
'Group 16', 'Group 17', 'Group 18']})
df_new['ID'] = list(zip(*df_new['Combined'].tolist()))[0]
df_origin['Group_name'] = df_origin['A'].map(df_new.drop_duplicates('ID')\
.set_index('ID')['Group_name'])
Result
A Group_name
0 3 Group 13
1 11 Group 15
2 0 Group 18
3 12 Group 16
4 6 Group 14
Explanation
Extract the first element of lists in df_new['Combined'] via zip.
Use drop_duplicates and then create a series mapping ID to Group_name.
Finally, use pd.Series.map to map df_origin['A'] to Group_name via this series.

Drop specific multiIndex columns in a pandas dataframe

Suppose one has a dataframe created as such:
tdata = {('A', 50): [1, 2, 3, 4],
('A', 55): [5, 6, 7, 8],
('B', 10): [10, 20, 30, 40],
('B', 20): [50, 60, 70, 80],
('B', 50): [2, 4, 6, 8],
('B', 55): [10, 12, 14, 16]}
tdf = pd.DataFrame(tdata, index=range(0,4))
A B
50 55 10 20 50 55
0 1 5 10 50 2 10
1 2 6 20 60 4 12
2 3 7 30 70 6 14
3 4 8 40 80 8 16
How would one drop specific columns, say ('B', 10) and ('B', 20) from the dataframe?
Is there a way to drop the columns in one command such as tdf.drop(['B', [10,20]])? Note, I know that my example of the command is by no means close to what it should be, but I hope that it gets the gist across.
Is there a way to drop the columns through some logical expression? For example, say I want to drop columns having the sublevel indices less than 50 (again, the 10, 20 columns). Can I do some general command that would encompass column 'A', even though the 10,20 sublevel indices don't exist or must I specifically reference column 'B'?
You can use drop by list of tuples:
print (tdf.drop([('B',10), ('B',20)], axis=1))
A B
50 55 50 55
0 1 5 2 10
1 2 6 4 12
2 3 7 6 14
3 4 8 8 16
For remove columns by level:
mask = tdf.columns.get_level_values(1) >= 50
print (mask)
[ True True False False True True]
print (tdf.loc[:, mask])
A B
50 55 50 55
0 1 5 2 10
1 2 6 4 12
2 3 7 6 14
3 4 8 8 16
If need remove by level is possible specify only one level:
print (tdf.drop([50,55], axis=1, level=1))
B
10 20
0 10 50
1 20 60
2 30 70
3 40 80

Categories