Lookup values of one Pandas dataframe in another

Lookup values of one Pandas dataframe in another - python

I have two dataframes, and I want to do a lookup much like a Vlookup in excel.
df_orig.head()
A
0 3
1 4
2 6
3 7
4 8
df_new
Combined Length Group_name
0 [8, 9, 112, 114, 134, 135] 6 Group 1
1 [15, 16, 17, 18, 19, 20] 6 Group 2
2 [15, 16, 17, 18, 19] 5 Group 3
3 [16, 17, 18, 19, 20] 5 Group 4
4 [15, 16, 17, 18] 4 Group 5
5 [8, 9, 112, 114] 4 Group 6
6 [18, 19, 20] 3 Group 7
7 [28, 29, 30] 3 Group 8
8 [21, 22] 2 Group 9
9 [28, 29] 2 Group 10
10 [26, 27] 2 Group 11
11 [24, 25] 2 Group 12
12 [3, 4] 2 Group 13
13 [6, 7] 2 Group 14
14 [11, 14] 2 Group 15
15 [12, 13] 2 Group 16
16 [0, 1] 2 Group 17
How can I add the values in df_new["Group_name"] to df_orig["A"]?
The "Group_name" must be based on the lookup of the values from df_orig["A"] in df_new["Combined"].
So it would look like:
df_orig.head()
A Looked_up
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Thank you!

Two steps ***unnest*** + merge
df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)})
df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'}))
Out[77]:
A Group_name
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1

Here is one way which mimics a vlookup. Minimal example below.
import pandas as pd
df_origin = pd.DataFrame({'A': [3, 11, 0, 12, 6]})
df_new = pd.DataFrame({'Combined': [[3, 4, 5], [6, 7], [11, 14, 20],
[12, 13], [3, 1], [0, 4]],
'Group_name': ['Group 13', 'Group 14', 'Group 15',
'Group 16', 'Group 17', 'Group 18']})
df_new['ID'] = list(zip(*df_new['Combined'].tolist()))[0]
df_origin['Group_name'] = df_origin['A'].map(df_new.drop_duplicates('ID')\
.set_index('ID')['Group_name'])
Result
A Group_name
0 3 Group 13
1 11 Group 15
2 0 Group 18
3 12 Group 16
4 6 Group 14
Explanation
Extract the first element of lists in df_new['Combined'] via zip.
Use drop_duplicates and then create a series mapping ID to Group_name.
Finally, use pd.Series.map to map df_origin['A'] to Group_name via this series.

Related

How to create a new columns with the top 3 maximum values in each row from specific columns in python df?

I have this df:
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'A': [20, 21, 19, 18],
'B': [3, 6, 2, 1],
'C': [6, 14, 5, 17],
'D': [2, 10, 9, 98]
}
people = pd.DataFrame(data)
people["max_1"]=people[['A','B','C','D']].max(axis=1)
people
So I've added a new column - max_1 for the maximum value in each row from columns A, B, C, and D.
My question is how can I create new columns (max_2 and max_3) for the 2nd highest value and for the third highest value?
Additional question - is it possible to add another condition on top of it? For example, find the maximum values but only when the names are 'Tom'/'nick'/'krish' -> otherwise, set 0 for those rows.
Thanks in advance.

A solution with apply and nlargest.
import pandas as pd
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'A': [20, 21, 19, 18],
'B': [3, 6, 2, 1],
'C': [6, 14, 5, 17],
'D': [2, 10, 9, 98]
}
people = pd.DataFrame(data)
# Solution
# Set Name to index. So it does not interfere when we do things with numbers.
people = people.set_index("Name")
# To select specific columns
# columns = ["A", "C", "D"]
# people = people[columns]
# Apply nlargest to each row.
# Not efficient because we us apply. But the good part that there is not much code.
top3 = people.apply(lambda x: pd.Series(x.nlargest(3).values), axis=1)
people[["N1", "N2", "N3"]] = top3
Result
A B C D N1 N2 N3
Name
Tom 20 3 6 2 20 6 3
nick 21 6 14 10 21 14 10
krish 19 2 5 9 19 9 5
jack 18 1 17 98 98 18 17

n = 3
idx = [f'max_{i}' for i in range(1, 1 + n)]
df = people.iloc[:, 1:].apply(lambda x: x.nlargest(n).set_axis(idx), axis=1)
people.join(df)
result:
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 19 9 5
3 jack 18 1 17 98 98 18 17
change n to what you want

Use:
#number of columns
N = 3
#columns names
cols = ['A','B','C','D']
#new columns names
new = [f'max_{i+1}' for i in range(N)]
#condition for test membership
mask = people['Name'].isin(['Tom','nick'])
#new columns filled 0
people[new] = 0
#for filtered rows get top N values
people.loc[mask, new] = np.sort(people.loc[mask, cols].to_numpy(), axis=1)[:, -N:][:, ::-1]
print (people)
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 0 0 0
3 jack 18 1 17 98 0 0 0
Soluton with numpy.where and broadcasting:
N = 3
cols = ['A','B','C','D']
new = [f'max_{i+1}' for i in range(N)]
mask = people['Name'].isin(['Tom','nick'])
people[new] = np.where(mask.to_numpy()[:, None],
np.sort(people[cols].to_numpy(), axis=1)[:, -N:][:, ::-1],
0)
print (people)
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 0 0 0
3 jack 18 1 17 98 0 0 0

You can do :
# to get max_2
people['max_2'] = [np.sort(people[['A','B','C','D']].iloc[:])[i][2] for i in range(len(people))]
# to get max_3
people['max_3'] = [np.sort(people[['A','B','C','D']].iloc[:])[i][1] for i in range(len(people))]

Use
import numpy as np
people[['max_1','max_2','max_3']] = \
people[['A','B','C','D']].apply(lambda x: -np.sort(-x), axis=1, raw=True).iloc[:, 0:3]
people
# Out:
# Name A B C D max_1 max_2 max_3
# 0 Tom 20 3 3 2 20 3 3
# 1 nick 21 6 14 10 21 14 10
# 2 krish 19 2 5 9 19 9 5
# 3 jack 18 1 17 98 98 18 17
Note that I changed the data a bit to show what happens in case of duplicate values
# data = {
# 'Name': ['Tom', 'nick', 'krish', 'jack'],
# 'A': [20, 21, 19, 18],
# 'B': [3, 6, 2, 1],
# 'C': [3, 14, 5, 17],
# 'D': [2, 10, 9, 98]
# }
# people = pd.DataFrame(data)
people
# Out:
# Name A B C D
# 0 Tom 20 3 3 2
# 1 nick 21 6 14 10
# 2 krish 19 2 5 9
# 3 jack 18 1 17 98

Shaping a Pandas DataFrame (multiple columns into 2)

I have a simular dataframe to below and require it to be shaped as per expected output.
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
'col2': [1, 3, 5, 7, 9, 11],
'col3': [2, 4, 6, 8, 10, 12]
})
col1 col2 col3
0 A 1 2
1 A 3 4
2 A 5 6
3 B 7 8
4 B 9 10
5 B 11 12
Expected Output
df_expected = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12]
})
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
So far I have tried pack, unpack & pivot without getting the desired result
Thanks for your help!

pd.DataFrame(df.groupby('col1').agg(list).T.sum().to_dict())

Use Numpy to reshape the data then package back up into a dataframe.
cols = (df['col2'],df['col3'])
data = np.stack(cols,axis=1).reshape(len(cols),len(df))
dft = pd.DataFrame(data, index=df['col1'].unique()).T
print(dft)
Result
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12

Reshape data frame, so the index column values become the columns

I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1

You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4

python shuffle columns in matrix

There is a way to randomly permute the columns of a matrix? I tried to use np.random.permutation but the obtained result is not what i need.
What i would like to obtain is to change randomly the position of the columns of the matrix, without to change the position of the values of each columns.
Es.
starting matrix:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
Resulting matrix
11 7 1 16
12 8 2 17
13 9 3 18
14 10 4 19
15 11 5 20

You could shuffle the transposed array:
q = np.array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
q = q.reshape((5,4))
print(q)
# [[ 1 6 11 16]
# [ 2 7 12 17]
# [ 3 8 13 18]
# [ 4 9 14 19]
# [ 5 10 15 20]]
np.random.shuffle(np.transpose(q))
print(q)
# [[ 1 16 6 11]
# [ 2 17 7 12]
# [ 3 18 8 13]
# [ 4 19 9 14]
# [ 5 20 10 15]]
Another option for general axis is:
q = np.array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
q = q.reshape((5,4))
q = q[:, np.random.permutation(q.shape[1])]
print(q)
# [[ 6 11 16 1]
# [ 7 12 17 2]
# [ 8 13 18 3]
# [ 9 14 19 4]
# [10 15 20 5]]

groupby common values in two columns

I need to extract a common max value from pairs of rows that have common values in two columns.
The commonality is between values in columns A and B. Rows 0 and 1 are common, 2 and 3, and 4 is on its own.
f = DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]], columns=['A', 'B', 'Value'])
f
A B Value
0 1 2 30
1 2 1 20
2 2 6 15
3 6 2 70
4 7 10 35
The goal is to extract max values, so the end result is:
f_final = DataFrame([[1, 2, 30, 30], [2, 1, 20, 30], [2, 6, 15, 70], [6, 2, 70, 70], [7, 10, 35, 35]], columns=['A', 'B', 'Value', 'Max'])
f_final
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
I could do this if there is a way to assign a common, non-repeating key:
f_key = DataFrame([[1, 1, 2, 30], [1, 2, 1, 20], [2, 2, 6, 15], [2, 6, 2, 70], [3, 7, 10, 35]], columns=['key', 'A', 'B', 'Value'])
f_key
key A B Value
0 1 1 2 30
1 1 2 1 20
2 2 2 6 15
3 2 6 2 70
4 3 7 10 35
Following up with the groupby and transform:
f_key['Max'] = f_key.groupby(['key'])['Value'].transform(lambda x: x.max())
f_key.drop('key', 1, inplace=True)
f_key
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
Question 1:
How would one assign this common key?
Question 2:
Is there a better way of doing this, skipping the common key step
Cheers...

You could sort the values in columns A and B so that for each row the value in A is less than or equal to the value in B. Once the values are ordered, then you could apply groupby-transform-max as usual:
import pandas as pd
df = pd.DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]],
columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
print(df)
yields
A B Value Max
0 1 2 30 30
1 1 2 20 30
2 2 6 15 70
3 2 6 70 70
4 7 10 35 35
The above method will still work even if the values in A and B are strings. For example,
df = DataFrame([['ab', 'ac', 30], ['ac', 'ab', 20],
['cb', 'ca', 15], ['ca', 'cb', 70],
['ff', 'zz', 35]], columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
yields
In [267]: df
Out[267]:
A B Value Max
0 ab ac 30 30
1 ab ac 20 30
2 ca cb 15 70
3 ca cb 70 70
4 ff zz 35 35

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lookup values of one Pandas dataframe in another - python

Two steps unnest + merge df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)}) df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'})) Out[77]: A Group_name 0 3 Group 13 1 4 Group 13 2 6 Group 14 3 7 Group 14 4 8 Group 1

Related

How to create a new columns with the top 3 maximum values in each row from specific columns in python df?

Shaping a Pandas DataFrame (multiple columns into 2)

Reshape data frame, so the index column values become the columns

python shuffle columns in matrix

groupby common values in two columns

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lookup values of one Pandas dataframe in another - python

Two steps ***unnest*** + merge df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)}) df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'})) Out[77]: A Group_name 0 3 Group 13 1 4 Group 13 2 6 Group 14 3 7 Group 14 4 8 Group 1

Related

How to create a new columns with the top 3 maximum values in each row from specific columns in python df?

Shaping a Pandas DataFrame (multiple columns into 2)

Reshape data frame, so the index column values become the columns

python shuffle columns in matrix

groupby common values in two columns

Categories

Resources

Two steps unnest + merge df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)}) df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'})) Out[77]: A Group_name 0 3 Group 13 1 4 Group 13 2 6 Group 14 3 7 Group 14 4 8 Group 1