Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!
First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
I have a dataframe as in the figure (result of a word2vec analysis). I need to sort the rows
descendingly by the largest value in each row. So I want the order of the rows after sorting to be as indicated by the red numbers in the image.
Thanks
Michael
Find max on axis=1 and sort this series of maxes. reindex using this index.
Sample df
A B C D E F
0 95 86 29 38 79 18
1 15 8 34 46 71 50
2 29 9 78 97 83 45
3 88 25 17 83 78 77
4 40 82 3 0 78 38
df_final = df.reindex(df.max(1).sort_values(ascending=False).index)
Out[675]:
A B C D E F
2 29 9 78 97 83 45
0 95 86 29 38 79 18
3 88 25 17 83 78 77
4 40 82 3 0 78 38
1 15 8 34 46 71 50
You can use .max(axis=1) to find the row-wise max and then use .argsort() to return the integer indices that would sort the Series values. Finally, use .loc to arrange the rows in the desired sequence:
df.loc[df.max(axis=1).argsort()[::-1]]
([::-1] added for descending order. Remove it for ascending order)
Input:
1 2 3 4
0 0.32 -1.09 -0.040000 0.600062
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
Output:
1 2 3 4
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
0 0.32 -1.09 -0.040000 0.600062
Hi I'm trying to lookup a value from selected column's using a value from my Dataframe. My lookup value needs to identify which column name it matches out of the selected columns, for example below I only want to consider columns ending in JT in my vlookup.
Example of dataframe:
Plan1_JT
Plan2_JT
Plan3_JT
Plan1_T
Plan2_T
JT
89
67
25
67
90
Plan1
9
45
7
6
5
Plan3
45
3
2
6
23
Plan1
Outcome:
Plan1_JT
Plan2_JT
Plan3_JT
Plan1_T
Plan2_T
JT
Plan_JT
89
67
25
67
90
Plan1
89
9
45
7
6
5
Plan3
7
45
3
2
6
23
Plan1
45
Example code:
df2['Plan_JT'].astype(str)=df2.loc[:,('Plan1_JT','Plan2_JT','Plan3_JT')].str.contains.iloc[1:5]
Solution for old pandas versions with DataFrame.lookup:
df['new'] = df.lookup(df.index, df['JT'] + '_JT')
print (df)
Plan1_JT Plan2_JT Plan3_JT Plan1_T Plan2_T JT new
0 89 67 25 67 90 Plan1 89
1 9 45 7 6 5 Plan3 7
2 45 3 2 6 23 Plan1 45
And for last versions with DataFrame.melt:
melt = df.melt('JT', ignore_index=False)
df['new'] = melt.loc[melt['JT'] + '_JT' == melt['variable'], 'value']
print (df)
Plan1_JT Plan2_JT Plan3_JT Plan1_T Plan2_T JT new
0 89 67 25 67 90 Plan1 89
1 9 45 7 6 5 Plan3 7
2 45 3 2 6 23 Plan1 45
I have a fairly large dataframe:
A
B
C
D
0
17
36
45
54
1
18
23
17
17
2
74
47
8
46
3
48
38
96
83
I am trying to create a new column that is the (max value of the columns) - (2nd highest value) / (2nd highest value).
In this example it would look something like:
A
B
C
D
Diff
0
17
36
45
54
.20
1
18
23
17
17
.28
2
74
47
8
46
.57
3
48
38
96
83
.16
I've tried df['diff'] = df.loc[:, 'A': 'D'].max(axis=1) - df.iloc[:df.index.get_loc(df.loc[:, 'A': 'D'].idxmax(axis=1))] / ...
but even that part of the formula returns an error, nevermind including the final division. I'm sure there must be an easier way going about this.
Edit: Additionally, I am also trying to get the difference between the max value and the column that immediately precedes the max value. I know this is a somewhat different question, but I would appreciate any insight. Thank you!
One way using pandas.Series.nlargest with pct_change:
df["Diff"] = df.apply(lambda x: x.nlargest(2).pct_change(-1)[0], axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
One way is to apply a udf:
def get_pct(x):
xmax2, xmax = x.sort_values().tail(2)
return (xmax-xmax2)/xmax2
df['Diff'] = df.apply(get_pct, axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
We can also make use of numpy sort and np.diff :
arr = np.sort(df,axis=1)[:,-2:]
df['Diff'] = np.diff(arr,axis=1)[:,0]/arr[:,0]
print(df)
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
Let us try get the second Max value with mask
Max = df.max(1)
secMax = df.mask(df.eq(Max,0)).max(1)
df['Diff'] = (Max - secMax)/secMax
df
Out[69]:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...