I have two dataframes, and I want to do a lookup much like a Vlookup in excel.
df_orig.head()
A
0 3
1 4
2 6
3 7
4 8
df_new
Combined Length Group_name
0 [8, 9, 112, 114, 134, 135] 6 Group 1
1 [15, 16, 17, 18, 19, 20] 6 Group 2
2 [15, 16, 17, 18, 19] 5 Group 3
3 [16, 17, 18, 19, 20] 5 Group 4
4 [15, 16, 17, 18] 4 Group 5
5 [8, 9, 112, 114] 4 Group 6
6 [18, 19, 20] 3 Group 7
7 [28, 29, 30] 3 Group 8
8 [21, 22] 2 Group 9
9 [28, 29] 2 Group 10
10 [26, 27] 2 Group 11
11 [24, 25] 2 Group 12
12 [3, 4] 2 Group 13
13 [6, 7] 2 Group 14
14 [11, 14] 2 Group 15
15 [12, 13] 2 Group 16
16 [0, 1] 2 Group 17
How can I add the values in df_new["Group_name"] to df_orig["A"]?
The "Group_name" must be based on the lookup of the values from df_orig["A"] in df_new["Combined"].
So it would look like:
df_orig.head()
A Looked_up
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Thank you!
Two steps ***unnest*** + merge
df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)})
df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'}))
Out[77]:
A Group_name
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Here is one way which mimics a vlookup. Minimal example below.
import pandas as pd
df_origin = pd.DataFrame({'A': [3, 11, 0, 12, 6]})
df_new = pd.DataFrame({'Combined': [[3, 4, 5], [6, 7], [11, 14, 20],
[12, 13], [3, 1], [0, 4]],
'Group_name': ['Group 13', 'Group 14', 'Group 15',
'Group 16', 'Group 17', 'Group 18']})
df_new['ID'] = list(zip(*df_new['Combined'].tolist()))[0]
df_origin['Group_name'] = df_origin['A'].map(df_new.drop_duplicates('ID')\
.set_index('ID')['Group_name'])
Result
A Group_name
0 3 Group 13
1 11 Group 15
2 0 Group 18
3 12 Group 16
4 6 Group 14
Explanation
Extract the first element of lists in df_new['Combined'] via zip.
Use drop_duplicates and then create a series mapping ID to Group_name.
Finally, use pd.Series.map to map df_origin['A'] to Group_name via this series.
Related
I have this df:
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'A': [20, 21, 19, 18],
'B': [3, 6, 2, 1],
'C': [6, 14, 5, 17],
'D': [2, 10, 9, 98]
}
people = pd.DataFrame(data)
people["max_1"]=people[['A','B','C','D']].max(axis=1)
people
So I've added a new column - max_1 for the maximum value in each row from columns A, B, C, and D.
My question is how can I create new columns (max_2 and max_3) for the 2nd highest value and for the third highest value?
Additional question - is it possible to add another condition on top of it? For example, find the maximum values but only when the names are 'Tom'/'nick'/'krish' -> otherwise, set 0 for those rows.
Thanks in advance.
A solution with apply and nlargest.
import pandas as pd
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'A': [20, 21, 19, 18],
'B': [3, 6, 2, 1],
'C': [6, 14, 5, 17],
'D': [2, 10, 9, 98]
}
people = pd.DataFrame(data)
# Solution
# Set Name to index. So it does not interfere when we do things with numbers.
people = people.set_index("Name")
# To select specific columns
# columns = ["A", "C", "D"]
# people = people[columns]
# Apply nlargest to each row.
# Not efficient because we us apply. But the good part that there is not much code.
top3 = people.apply(lambda x: pd.Series(x.nlargest(3).values), axis=1)
people[["N1", "N2", "N3"]] = top3
Result
A B C D N1 N2 N3
Name
Tom 20 3 6 2 20 6 3
nick 21 6 14 10 21 14 10
krish 19 2 5 9 19 9 5
jack 18 1 17 98 98 18 17
n = 3
idx = [f'max_{i}' for i in range(1, 1 + n)]
df = people.iloc[:, 1:].apply(lambda x: x.nlargest(n).set_axis(idx), axis=1)
people.join(df)
result:
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 19 9 5
3 jack 18 1 17 98 98 18 17
change n to what you want
Use:
#number of columns
N = 3
#columns names
cols = ['A','B','C','D']
#new columns names
new = [f'max_{i+1}' for i in range(N)]
#condition for test membership
mask = people['Name'].isin(['Tom','nick'])
#new columns filled 0
people[new] = 0
#for filtered rows get top N values
people.loc[mask, new] = np.sort(people.loc[mask, cols].to_numpy(), axis=1)[:, -N:][:, ::-1]
print (people)
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 0 0 0
3 jack 18 1 17 98 0 0 0
Soluton with numpy.where and broadcasting:
N = 3
cols = ['A','B','C','D']
new = [f'max_{i+1}' for i in range(N)]
mask = people['Name'].isin(['Tom','nick'])
people[new] = np.where(mask.to_numpy()[:, None],
np.sort(people[cols].to_numpy(), axis=1)[:, -N:][:, ::-1],
0)
print (people)
Name A B C D max_1 max_2 max_3
0 Tom 20 3 6 2 20 6 3
1 nick 21 6 14 10 21 14 10
2 krish 19 2 5 9 0 0 0
3 jack 18 1 17 98 0 0 0
You can do :
# to get max_2
people['max_2'] = [np.sort(people[['A','B','C','D']].iloc[:])[i][2] for i in range(len(people))]
# to get max_3
people['max_3'] = [np.sort(people[['A','B','C','D']].iloc[:])[i][1] for i in range(len(people))]
Use
import numpy as np
people[['max_1','max_2','max_3']] = \
people[['A','B','C','D']].apply(lambda x: -np.sort(-x), axis=1, raw=True).iloc[:, 0:3]
people
# Out:
# Name A B C D max_1 max_2 max_3
# 0 Tom 20 3 3 2 20 3 3
# 1 nick 21 6 14 10 21 14 10
# 2 krish 19 2 5 9 19 9 5
# 3 jack 18 1 17 98 98 18 17
Note that I changed the data a bit to show what happens in case of duplicate values
# data = {
# 'Name': ['Tom', 'nick', 'krish', 'jack'],
# 'A': [20, 21, 19, 18],
# 'B': [3, 6, 2, 1],
# 'C': [3, 14, 5, 17],
# 'D': [2, 10, 9, 98]
# }
# people = pd.DataFrame(data)
people
# Out:
# Name A B C D
# 0 Tom 20 3 3 2
# 1 nick 21 6 14 10
# 2 krish 19 2 5 9
# 3 jack 18 1 17 98
I have a simular dataframe to below and require it to be shaped as per expected output.
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
'col2': [1, 3, 5, 7, 9, 11],
'col3': [2, 4, 6, 8, 10, 12]
})
col1 col2 col3
0 A 1 2
1 A 3 4
2 A 5 6
3 B 7 8
4 B 9 10
5 B 11 12
Expected Output
df_expected = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12]
})
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
So far I have tried pack, unpack & pivot without getting the desired result
Thanks for your help!
pd.DataFrame(df.groupby('col1').agg(list).T.sum().to_dict())
Use Numpy to reshape the data then package back up into a dataframe.
cols = (df['col2'],df['col3'])
data = np.stack(cols,axis=1).reshape(len(cols),len(df))
dft = pd.DataFrame(data, index=df['col1'].unique()).T
print(dft)
Result
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4
There is a way to randomly permute the columns of a matrix? I tried to use np.random.permutation but the obtained result is not what i need.
What i would like to obtain is to change randomly the position of the columns of the matrix, without to change the position of the values of each columns.
Es.
starting matrix:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
Resulting matrix
11 7 1 16
12 8 2 17
13 9 3 18
14 10 4 19
15 11 5 20
You could shuffle the transposed array:
q = np.array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
q = q.reshape((5,4))
print(q)
# [[ 1 6 11 16]
# [ 2 7 12 17]
# [ 3 8 13 18]
# [ 4 9 14 19]
# [ 5 10 15 20]]
np.random.shuffle(np.transpose(q))
print(q)
# [[ 1 16 6 11]
# [ 2 17 7 12]
# [ 3 18 8 13]
# [ 4 19 9 14]
# [ 5 20 10 15]]
Another option for general axis is:
q = np.array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
q = q.reshape((5,4))
q = q[:, np.random.permutation(q.shape[1])]
print(q)
# [[ 6 11 16 1]
# [ 7 12 17 2]
# [ 8 13 18 3]
# [ 9 14 19 4]
# [10 15 20 5]]
I need to extract a common max value from pairs of rows that have common values in two columns.
The commonality is between values in columns A and B. Rows 0 and 1 are common, 2 and 3, and 4 is on its own.
f = DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]], columns=['A', 'B', 'Value'])
f
A B Value
0 1 2 30
1 2 1 20
2 2 6 15
3 6 2 70
4 7 10 35
The goal is to extract max values, so the end result is:
f_final = DataFrame([[1, 2, 30, 30], [2, 1, 20, 30], [2, 6, 15, 70], [6, 2, 70, 70], [7, 10, 35, 35]], columns=['A', 'B', 'Value', 'Max'])
f_final
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
I could do this if there is a way to assign a common, non-repeating key:
f_key = DataFrame([[1, 1, 2, 30], [1, 2, 1, 20], [2, 2, 6, 15], [2, 6, 2, 70], [3, 7, 10, 35]], columns=['key', 'A', 'B', 'Value'])
f_key
key A B Value
0 1 1 2 30
1 1 2 1 20
2 2 2 6 15
3 2 6 2 70
4 3 7 10 35
Following up with the groupby and transform:
f_key['Max'] = f_key.groupby(['key'])['Value'].transform(lambda x: x.max())
f_key.drop('key', 1, inplace=True)
f_key
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
Question 1:
How would one assign this common key?
Question 2:
Is there a better way of doing this, skipping the common key step
Cheers...
You could sort the values in columns A and B so that for each row the value in A is less than or equal to the value in B. Once the values are ordered, then you could apply groupby-transform-max as usual:
import pandas as pd
df = pd.DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]],
columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
print(df)
yields
A B Value Max
0 1 2 30 30
1 1 2 20 30
2 2 6 15 70
3 2 6 70 70
4 7 10 35 35
The above method will still work even if the values in A and B are strings. For example,
df = DataFrame([['ab', 'ac', 30], ['ac', 'ab', 20],
['cb', 'ca', 15], ['ca', 'cb', 70],
['ff', 'zz', 35]], columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
yields
In [267]: df
Out[267]:
A B Value Max
0 ab ac 30 30
1 ab ac 20 30
2 ca cb 15 70
3 ca cb 70 70
4 ff zz 35 35