Pandas column values between values from another dataframe column - python

I have two pandas data-frames as follows:
import pandas as pd
import numpy as np
import string
size = 5
student_names = [''.join(np.random.choice(list(string.ascii_lowercase), size=4)) for i in range(size)]
marks = list(np.random.randint(50, high=100, size=size))
df1 = pd.DataFrame({'Student Names': student_names, 'Total': marks})
grade_leters = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D',
'D-', 'F']
grade_minimum_value = [95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 0]
df2 = pd.DataFrame({'Grade Letters': grade_leters, 'Minimums': grade_minimum_value})
df1
Student Names Total
0 cjpv 83
1 iywm 98
2 jhhb 87
3 qwau 70
4 ppai 82
df2
Grade Letters Minimums
0 A+ 95
1 A 90
2 A- 85
3 B+ 80
4 B 75
5 B- 70
6 C+ 65
7 C 60
8 C- 55
9 D+ 50
10 D 45
11 D- 40
12 F 0
I want to give the grade letter as a new column to df1. For example, student cjpv having a total mark of 83 will receive a grade letter of B+, since 83 is between 80 (inclusive) and 85 (exclusive).
The desired output is as follows.
Student Names Total Grade
0 cjpv 83 B+
1 iywm 98 A+
2 jhhb 87 A-
3 qwau 70 B-
4 ppai 82 B+
Thanks in advance. My apologies if there is a similar question to this, However, I could not find one after a long search.

Use cut with dynamic values bins and labels from df2 columns, also is added right=False for left closed bins:
np.random.seed(123)
size = 20
student_names = [''.join(np.random.choice(list(string.ascii_lowercase), size=4)) for i in range(size)]
marks = list(np.random.randint(50, high=100, size=size))
df1 = pd.DataFrame({'Student Names': student_names, 'Total': marks})
grade_leters = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D',
'D-', 'F']
grade_minimum_value = [95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 0]
df2 = pd.DataFrame({'Grade Letters': grade_leters, 'Minimums': grade_minimum_value})
df2 = df2.sort_values('Minimums')
df1['new'] = pd.cut(df1['Total'],
bins=df2['Minimums'].tolist() + [np.inf],
labels=df2['Grade Letters'],
right=False)
print (df1)
Student Names Total new
0 nccg 70 B-
1 rtkz 99 A+
2 wbar 62 C
3 pjao 68 C+
4 apzt 67 C+
5 oeaq 51 D+
6 erxd 94 A
7 cuhc 91 A
8 upyq 98 A+
9 hjdu 77 B
10 gbvw 99 A+
11 cbmi 72 B-
12 dkfa 53 D+
13 lckw 53 D+
14 nsep 61 C
15 lmug 71 B-
16 ntqg 75 B
17 ouhl 89 A-
18 whbl 91 A
19 fxzs 84 B+
Like #Henry Yik commented here is possible use merge_asof:
df1 = pd.merge_asof(df1.sort_values('Total'), df2, left_on='Total', right_on='Minimums')
print (df1)
Student Names Total new Grade Letters Minimums
0 oeaq 51 D+ D+ 50
1 lckw 53 D+ D+ 50
2 dkfa 53 D+ D+ 50
3 nsep 61 C C 60
4 wbar 62 C C 60
5 apzt 67 C+ C+ 65
6 pjao 68 C+ C+ 65
7 nccg 70 B- B- 70
8 lmug 71 B- B- 70
9 cbmi 72 B- B- 70
10 ntqg 75 B B 75
11 hjdu 77 B B 75
12 fxzs 84 B+ B+ 80
13 ouhl 89 A- A- 85
14 whbl 91 A A 90
15 cuhc 91 A A 90
16 erxd 94 A A 90
17 upyq 98 A+ A+ 95
18 rtkz 99 A+ A+ 95
19 gbvw 99 A+ A+ 95

Related

Count how many times a pair of values in one pandas dataframe appears in another

I have a pandas dataframe df1 that looks like this:
import pandas as pd
d = {'node1': [47, 24, 19, 77, 24, 19, 77, 24, 56, 92, 32, 77], 'node2': [24, 19, 77, 24, 19, 77, 24, 19, 92, 32, 77, 24], 'user': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C']}
df1 = pd.DataFrame(data=d)
df1
node1 node2 user
47 24 A
24 19 A
19 77 A
77 24 A
24 19 A
19 77 B
77 24 B
24 19 B
56 92 C
92 32 C
32 77 C
77 24 C
And a second pandas dataframe df2 that looks like this:
d2 = {'way_id': [4, 3, 1, 8, 5, 2, 7, 9, 6, 10], 'source': [24, 19, 84, 47, 19, 16, 77, 56, 32, 92], 'target': [19, 43, 67, 24, 77, 29, 24, 92, 77, 32]}
df2 = pd.DataFrame(data=d2)
df2
way_id source target
4 24 19
3 19 43
1 84 67
8 47 24
5 19 77
2 16 29
7 77 24
9 56 92
6 32 77
10 92 32
In a new dataframe I would like to count how often the value pairs per row in the columns node1 and node2 in df1 occur in the rows of the source and target columns in df2. The order is relevant, but also the corresponding user should be added to a new column. That's why the desired output should be like this:
way_id source target count user
4 24 19 2 A
3 19 43 0 A
1 84 67 0 A
8 47 24 1 A
5 19 77 1 A
2 16 29 0 A
7 77 24 1 A
9 56 92 0 A
6 32 77 0 A
10 92 32 0 A
4 24 19 1 B
3 19 43 0 B
1 84 67 0 B
8 47 24 0 B
5 19 77 1 B
2 16 29 0 B
7 77 24 1 B
9 56 92 0 B
6 32 77 0 B
10 92 32 0 B
4 24 19 0 C
3 19 43 0 C
1 84 67 0 C
8 47 24 0 C
5 19 77 0 C
2 16 29 0 C
7 77 24 1 C
9 56 92 1 C
6 32 77 1 C
10 92 32 1 C
Since you don't care about the source/target match, you need to duplicate the data then merge :
(pd.concat([df1.rename(columns={'node1':'source','node2':'target'}),
df1.rename(columns={'node2':'source','node1':'target'})]
)
.merge(df2, on=['source','target'], how='outer')
.groupby(['source','target','user'], as_index=False)['way_id'].count()
)

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

I have two data frames.
df1 includes 4 men and 4 women with their weight and height (inches).
#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75
df2 includes 4 men and 4 women with their weight and height (inches).
#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73
What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.
So in this case John's closest match is Abe so in a new column
df1['doppelganger'] = "Abe".
I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.
First is necessary distinguish men and women, here is used new column with repeat 4 times m and f. Then is used DataFrame.merge with outer join by new column for all combinations and created new columns for differences, last column is sum of them. then sorting by 3 columns by DataFrame.sort_values, so first row per groups by A and g are filtered by DataFrame.drop_duplicates:
df = (df1.assign(g = ['m']*4 + ['f']*4)
.merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
.assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
dif2 = lambda x: x['C'].sub(x['C_']).abs(),
sumdiff = lambda x: x['dif1'] + x['dif2'])
.sort_values(['A', 'g','sumdiff'])
.drop_duplicates(['A','g'])
.sort_index()
.rename(columns={'A_':'doppelganger'})
)
print (df)
A B C g doppelganger B_ C_ dif1 dif2 sumdiff
1 John 236 76 m Abe 236 75 0 1 1
7 Jack 204 74 m Adam 195 71 9 3 12
10 Jim 156 71 m Alex 178 72 22 1 23
14 Jared 182 72 m Alex 178 72 4 0 4
16 Suzy 119 60 f Mary 148 66 29 6 35
20 Sally 149 66 f Mary 148 66 1 0 1
25 Sharon 169 65 f Maylee 155 66 14 1 15
31 Sammy 182 75 f Madison 160 73 22 2 24
Input DataFrames:
print (df1)
A B C
0 John 236 76
1 Jack 204 74
2 Jim 156 71
3 Jared 182 72
4 Suzy 119 60
5 Sally 149 66
6 Sharon 169 65
7 Sammy 182 75
print (df2)
A B C
0 Aaron 285 77
1 Abe 236 75
2 Alex 178 72
3 Adam 195 71
4 Mary 148 66
5 Maylee 155 66
6 Marilyn 199 65
7 Madison 160 73

Append columns from a DataFrame to a list

Is it possible to append columns from a dataframe into an empty list?
Example of a random df is produced:
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
The output is:
A B C D
0 25 27 34 77
1 85 62 39 49
2 90 51 2 97
3 39 19 86 59
4 33 79 64 73
5 36 66 29 78
6 22 27 84 41
7 0 26 22 22
8 44 57 29 37
9 0 31 96 90
If I had an empty list or lists, could you append the columns by each row? So A,C to a list and B,Dto a list. An example output would be:
empty_list = [[],[]]
empty_list[0] = [[25,34],
[85,39]
[90,2]
[39,86]
[33,64]
[36,29]
[22,84]
[0,22]
[44,29]
[0,96]]
Or would you have to go through and convert each column to a list with df['A'].tolist() and then go through an append by row?
Try this
d=df[['A','C']]
d.values.tolist()
Output
[[0, 93], [58, 14], [79, 18], [40, 26], [91, 14], [25, 18], [22, 25], [35, 99], [12, 82], [48, 72]]
So the solution would be :
empty_list = [[],[]]
empty_list[0]=df[['A','C']].values.tolist()
empty_list[1]=df[['B','D']].values.tolist()
My df was :
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
df
A B C D
0 0 60 93 94
1 58 52 14 33
2 79 84 18 1
3 40 21 26 32
4 91 19 14 8
5 25 34 18 68
6 22 37 25 10
7 35 58 99 80
8 12 38 82 8
9 48 56 72 66

How to shuffle groups of rows of a Pandas dataframe?

Let's assume I have a dataframe df:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(12,4))
print(df)
0 1 2 3
0 71 64 84 20
1 48 60 83 61
2 48 78 71 46
3 65 88 66 77
4 71 22 42 58
5 66 76 64 80
6 67 28 74 87
7 32 90 55 78
8 80 42 52 14
9 54 76 73 17
10 32 89 42 36
11 85 78 61 12
How do I shuffle the rows of df three-by-three, i.e., how do I randomly shuffle the first three rows (0, 1, 2) with either the second (3, 4, 5), third (6, 7, 8) or fourth (9, 10, 11) group? This could be a possible outcome:
print(df)
0 1 2 3
3 65 88 66 77
4 71 22 42 58
5 66 76 64 80
9 54 76 73 17
10 32 89 42 36
11 85 78 61 12
6 67 28 74 87
7 32 90 55 78
8 80 42 52 14
0 71 64 84 20
1 48 60 83 61
2 48 78 71 46
Thus, the new order has the second group of 3 rows from original dataframe, then the last one, then the third one and finally the first group.
You can reshape into a 3D array splitting the first axis into two with the latter one of length 3 corresponding to the group length and then use np.random.shuffle for such a groupwise in-place shuffle along the first axis, which being of length as the number of groups holds those groups and thus achieves our desired result, like so -
np.random.shuffle(df.values.reshape(-1,3,df.shape[1]))
Explanation
To give it a bit of explanation, let's use np.random.permutation to generate those random indices along the first axis and then index into the 3D array version.
1] Input df :
In [199]: df
Out[199]:
0 1 2 3
0 71 64 84 20
1 48 60 83 61
2 48 78 71 46
3 65 88 66 77
4 71 22 42 58
5 66 76 64 80
6 67 28 74 87
7 32 90 55 78
8 80 42 52 14
9 54 76 73 17
10 32 89 42 36
11 85 78 61 12
2] Get 3D array version :
In [200]: arr_3D = df.values.reshape(-1,3,df.shape[1])
In [201]: arr_3D
Out[201]:
array([[[71, 64, 84, 20],
[48, 60, 83, 61],
[48, 78, 71, 46]],
[[65, 88, 66, 77],
[71, 22, 42, 58],
[66, 76, 64, 80]],
[[67, 28, 74, 87],
[32, 90, 55, 78],
[80, 42, 52, 14]],
[[54, 76, 73, 17],
[32, 89, 42, 36],
[85, 78, 61, 12]]])
3] Get shuffling indices and index into the first axis of 3D version :
In [202]: shuffle_idx = np.random.permutation(arr_3D.shape[0])
In [203]: shuffle_idx
Out[203]: array([0, 3, 1, 2])
In [204]: arr_3D[shuffle_idx]
Out[204]:
array([[[71, 64, 84, 20],
[48, 60, 83, 61],
[48, 78, 71, 46]],
[[54, 76, 73, 17],
[32, 89, 42, 36],
[85, 78, 61, 12]],
[[65, 88, 66, 77],
[71, 22, 42, 58],
[66, 76, 64, 80]],
[[67, 28, 74, 87],
[32, 90, 55, 78],
[80, 42, 52, 14]]])
Then, we are assigning these values back to input dataframe.
With np.random.shuffle, we are just doing everything in-place and hiding away the work needed to explicitly generate shuffling indices and assigning back.
Sample run -
In [181]: df = pd.DataFrame(np.random.randint(11,99,(12,4)))
In [182]: df
Out[182]:
0 1 2 3
0 82 49 80 20
1 19 97 74 81
2 62 20 97 19
3 36 31 14 41
4 27 86 28 58
5 38 68 24 83
6 85 11 25 88
7 21 31 53 19
8 38 45 14 72
9 74 63 40 94
10 69 85 53 81
11 97 96 28 29
In [183]: np.random.shuffle(df.values.reshape(-1,3,df.shape[1]))
In [184]: df
Out[184]:
0 1 2 3
0 85 11 25 88
1 21 31 53 19
2 38 45 14 72
3 82 49 80 20
4 19 97 74 81
5 62 20 97 19
6 36 31 14 41
7 27 86 28 58
8 38 68 24 83
9 74 63 40 94
10 69 85 53 81
11 97 96 28 29
Similar solution to #Divakar, probably simpler as I directly shuffle the index of the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame([np.arange(0, 12)]*4).T
len_group = 3
index_list = np.array(df.index)
np.random.shuffle(np.reshape(index_list, (-1, len_group)))
shuffled_df = df.loc[index_list, :]
Sample output:
shuffled_df
Out[82]:
0 1 2 3
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
This is doing the same as the other two answers, but using integer division to create a group column.
nrows_df = len(df)
nrows_group = 3
shuffled = (
df
.assign(group_var=df.index // nrows_group)
.set_index("group_var")
.loc[np.random.permutation(nrows_df / nrows_group)]
)

Filter pandas DataFrame through list of dicts

I have DataFrame of arbitrary length, with X columns (lets say 10):
>>> names = ['var_' + str(x) for x in range(1, 11)]
>>> names
['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'var_6', 'var_7', 'var_8', 'var_9', 'var_10']
>>> df = pd.DataFrame(np.random.randint(100, size=(10,10)), columns = names)
>>> df
var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
0 39 49 6 39 16 41 8 86 23 52
1 6 16 21 20 81 97 83 25 56 73
2 72 97 43 50 10 46 22 75 7 18
3 20 35 69 59 14 24 57 31 47 20
4 39 93 45 80 74 87 83 50 52 67
5 93 75 83 67 40 46 79 11 31 95
6 75 76 57 82 69 98 74 75 93 13
7 35 19 28 67 39 23 72 16 63 67
8 93 87 52 25 63 29 46 64 78 12
9 81 43 4 90 88 64 1 83 26 22
Now i want to filter this DataFrame rowwise using list of dicts:
>>> test_dict_1 = {'var_1': 89, 'var_2': 12, 'var_3': 34}
>>> test_dict_2 = {'var_7': 3, 'var_2': 11, 'var_4': 19, 'var_1': 9}
>>> test_dict_3 = {'var_3': 31}
>>> filter = [test_dict_1, test_dict_2, test_dict_3]
To have something as result (dict? DataFrame? few DataFrames?), that contains only those rows with at least one of the filter passed (i.e. all of variables are same values in row as in filter). Besides that i ofcourse need to know which filters passed.
I'm quite new to pandas, so i'm a bit confused if i can do it without "for" loops. Any solutions please?
I know about chain solutions like df[(df.A == 1) & (df.D == 6)], but is it somehow possible to have few different filters?
Final goal is to have every row flagged with filters passed, without loops.
I'm not sure if I get it right, but if you want to filter your dataframe by few criteria from a dictionnary you could do something like this :
In [107]: df
Out[107]:
var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
0 45 36 84 24 86 26 44 6 44 15
1 72 16 67 75 87 89 8 68 32 49
2 9 49 0 4 77 75 65 9 45 70
test_dict_1 = {'var_1': 72, 'var_2': 16, 'var_3': 67}
cond = True
for var in test_dict_1.keys():
cond = cond & (df[var] == test_dict_1[var])
df = df.loc[cond]
then you'll get :
In [109]: df
Out[109]:
var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
1 72 16 67 75 87 89 8 68 32 49

Categories