Performing operations on grouped rows in python - python

I have a dataframe where pic_code value may repeat. If it repeats, I want to set the variable "keep" to "t" for the pic_code that is closest to its mpe_wgt.
For example, the second pic_code has "keep" set to t since it has the "weight" closest to its corresponding "mpe_weight". My code results in "keep" staying 'f' for all and "diff" staying "100" for all.
df['keep']='f'
df['diff']=100
def cln_df(data):
if pd.unique(data['mpe_wgt']).shape==(1,):
data['keep'][0:1]='t'
elif pd.unique(data['mpe_wgt']).shape!=(1,):
data['diff']=abs(data['weight']-(data['mpe_wgt']/100))
data['keep'][data['diff']==min(data['diff'])]='t'
return data
df=df.groupby('pic_code').apply(cln_df)
df before
pic_code weight mpe_wgt keep diff
1234 45 34 f 100
1234 32 23 f 100
45344 54 35 f 100
234 76 98 f 100
234 65 12 f 100
df output should be
pic_code weight mpe_wgt keep diff
1234 45 34 f 11
1234 32 23 t 9
45344 54 35 t 100
234 76 98 t 22
234 65 12 f 53
I'm fairly new to python so please keep the solutions as simple as possible. I really want to make my method work so please don't get too fancy. Thanks in advance for your help.

This is one way. Note I am using Boolean values True / False in place of strings "t" and "f". This is just good practice.
Note that all the below operations are vectorised, while groupby.apply with a custom function certainly is not.
Setup
print(df)
pic_code weight mpe_wgt
0 1234 45 34
1 1234 32 23
2 45344 54 35
3 234 76 98
4 234 65 12
Solution
# calculate difference
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
# sort by pic_code, then by diff
df = df.sort_values(['pic_code', 'diff'])
# define keep column as True only for non-duplicates by pic_code
df['keep'] = ~df.duplicated('pic_code')
Result
print(df)
pic_code weight mpe_wgt diff keep
3 234 76 98 22 True
4 234 65 12 53 False
1 1234 32 23 9 True
0 1234 45 34 11 False
2 45344 54 35 19 True

Use:
df['keep'] = df.assign(closest=(df['mpe_wgt']-df['weight']).abs())\
.sort_values('closest').duplicated(subset=['pic_code'])\
.replace({True:'f',False:'t'})
Output:
pic_code weight mpe_wgt keep
0 1234 45 34 f
1 1234 32 23 t
2 45344 54 35 t
3 234 76 98 t
4 234 65 12 f

Maybe you can try cumcount
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
df['keep'] = df.sort_values('diff').groupby('pic_code').cumcount().eq(0)
df
pic_code weight mpe_wgt diff keep
0 1234 45 34 11 False
1 1234 32 23 9 True
2 45344 54 35 19 True
3 234 76 98 22 True
4 234 65 12 53 False

Using eval and assign to execute similar logic as other answers.
m = dict(zip([False, True], 'tf'))
f = lambda d: d.sort_values('diff').duplicated('pic_code').map(m)
df.eval('diff=abs(weight - mpe_wgt)').assign(keep=f)
pic_code weight mpe_wgt keep diff
0 1234 45 34 f 11.0
1 1234 32 23 t 9.0
2 45344 54 35 t 19.0
3 234 76 98 t 22.0
4 234 65 12 f 53.0

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!
First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12

Pandas.DataFrame: How to sort rows by the largest value in each row

I have a dataframe as in the figure (result of a word2vec analysis). I need to sort the rows
descendingly by the largest value in each row. So I want the order of the rows after sorting to be as indicated by the red numbers in the image.
Thanks
Michael
Find max on axis=1 and sort this series of maxes. reindex using this index.
Sample df
A B C D E F
0 95 86 29 38 79 18
1 15 8 34 46 71 50
2 29 9 78 97 83 45
3 88 25 17 83 78 77
4 40 82 3 0 78 38
df_final = df.reindex(df.max(1).sort_values(ascending=False).index)
Out[675]:
A B C D E F
2 29 9 78 97 83 45
0 95 86 29 38 79 18
3 88 25 17 83 78 77
4 40 82 3 0 78 38
1 15 8 34 46 71 50
You can use .max(axis=1) to find the row-wise max and then use .argsort() to return the integer indices that would sort the Series values. Finally, use .loc to arrange the rows in the desired sequence:
df.loc[df.max(axis=1).argsort()[::-1]]
([::-1] added for descending order. Remove it for ascending order)
Input:
1 2 3 4
0 0.32 -1.09 -0.040000 0.600062
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
Output:
1 2 3 4
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
0 0.32 -1.09 -0.040000 0.600062

Pandas Dataframe Lookup using partial column name

Hi I'm trying to lookup a value from selected column's using a value from my Dataframe. My lookup value needs to identify which column name it matches out of the selected columns, for example below I only want to consider columns ending in JT in my vlookup.
Example of dataframe:
Plan1_JT
Plan2_JT
Plan3_JT
Plan1_T
Plan2_T
JT
89
67
25
67
90
Plan1
9
45
7
6
5
Plan3
45
3
2
6
23
Plan1
Outcome:
Plan1_JT
Plan2_JT
Plan3_JT
Plan1_T
Plan2_T
JT
Plan_JT
89
67
25
67
90
Plan1
89
9
45
7
6
5
Plan3
7
45
3
2
6
23
Plan1
45
Example code:
df2['Plan_JT'].astype(str)=df2.loc[:,('Plan1_JT','Plan2_JT','Plan3_JT')].str.contains.iloc[1:5]
Solution for old pandas versions with DataFrame.lookup:
df['new'] = df.lookup(df.index, df['JT'] + '_JT')
print (df)
Plan1_JT Plan2_JT Plan3_JT Plan1_T Plan2_T JT new
0 89 67 25 67 90 Plan1 89
1 9 45 7 6 5 Plan3 7
2 45 3 2 6 23 Plan1 45
And for last versions with DataFrame.melt:
melt = df.melt('JT', ignore_index=False)
df['new'] = melt.loc[melt['JT'] + '_JT' == melt['variable'], 'value']
print (df)
Plan1_JT Plan2_JT Plan3_JT Plan1_T Plan2_T JT new
0 89 67 25 67 90 Plan1 89
1 9 45 7 6 5 Plan3 7
2 45 3 2 6 23 Plan1 45

Find the difference between the max value and 2nd highest value within a subset of pandas columns

I have a fairly large dataframe:
A
B
C
D
0
17
36
45
54
1
18
23
17
17
2
74
47
8
46
3
48
38
96
83
I am trying to create a new column that is the (max value of the columns) - (2nd highest value) / (2nd highest value).
In this example it would look something like:
A
B
C
D
Diff
0
17
36
45
54
.20
1
18
23
17
17
.28
2
74
47
8
46
.57
3
48
38
96
83
.16
I've tried df['diff'] = df.loc[:, 'A': 'D'].max(axis=1) - df.iloc[:df.index.get_loc(df.loc[:, 'A': 'D'].idxmax(axis=1))] / ...
but even that part of the formula returns an error, nevermind including the final division. I'm sure there must be an easier way going about this.
Edit: Additionally, I am also trying to get the difference between the max value and the column that immediately precedes the max value. I know this is a somewhat different question, but I would appreciate any insight. Thank you!
One way using pandas.Series.nlargest with pct_change:
df["Diff"] = df.apply(lambda x: x.nlargest(2).pct_change(-1)[0], axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
One way is to apply a udf:
def get_pct(x):
xmax2, xmax = x.sort_values().tail(2)
return (xmax-xmax2)/xmax2
df['Diff'] = df.apply(get_pct, axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
We can also make use of numpy sort and np.diff :
arr = np.sort(df,axis=1)[:,-2:]
df['Diff'] = np.diff(arr,axis=1)[:,0]/arr[:,0]
print(df)
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
Let us try get the second Max value with mask
Max = df.max(1)
secMax = df.mask(df.eq(Max,0)).max(1)
df['Diff'] = (Max - secMax)/secMax
df
Out[69]:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627

Adding a column to a dataframe through a mapping between 2 dataframes in Python?

I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...

Categories