How to subset pandas dataframe columns with idxmax output? - python

I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).

You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64

Related

Extract indices of a grouped elements in Pandas

The objective is to extract the index number of a randomly selected grouped rows in Pandas.
Specifically, given a df
nval
0 4
1 4
2 0
...
23 0
24 4
...
29 4
30 4
31 0
I would like to extract each 5 random index of the element 0 and 4.
For example, the 5 randomly selected value for
0
can be
3,11,15,16,22
and
4
can be
6 9 7 29 27
Currently, the code below answer the above objective
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,
4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
cgroup=5
df=df.reset_index()
all_df=[]
for idx in [0,4]:
x=df[df['nval']==idx].reset_index(drop=True)
ids = np.random.choice(len(x), size=cgroup, replace=False).tolist()
all_df.append(x.iloc[ids].reset_index(drop=True))
df=pd.concat(all_df).reset_index(drop=True).sort_values(by=['index'])
sel_index=df[['index']]
Which produced
index
0 3
1 6
2 7
3 9
4 11
5 15
6 16
7 22
8 27
9 29
However, I wonder there is compact way of doing this using pandas or numpy?
How about this:
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
df2 = df.groupby('nval').sample(5).reset_index()
print(df2)
output:
index nval
0 31 0
1 22 0
2 14 0
3 8 0
4 17 0
5 29 4
6 13 4
7 1 4
8 19 4
9 12 4
IIUC, you can use
pd.DataFrame({'index': df.groupby('nval').sample(5).index.sort_values()})
I'd just keep the result as an index, so it simplifies to
df.groupby('nval').sample(5).index.sort_values()

How to remove columns after any row has a NaN value in Python pandas dataframe

Toy example code
Let's say I have following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[11,21,31], "B":[12,22,32], "C":[np.nan,23,33], "D":[np.nan,24,34], "E":[15,25,35]})
Which would return:
>>> df
A B C D E
0 11 12 NaN NaN 15
1 21 22 23.0 24.0 25
2 31 32 33.0 34.0 35
Remove all columns with nan values
I know how to remove all the columns which have any row with a nan value like this:
out1 = df.dropna(axis=1, how="any")
Which returns:
>>> out1
A B E
0 11 12 15
1 21 22 25
2 31 32 35
Expected output
However what I expect is to remove all columns after a nan value is found. In the toy example code the expected output would be:
A B
0 11 12
1 21 22
2 31 32
Question
How can I remove all columns after a nan is found within any row in a pandas DataFrame ?
What I would do:
check every element for being null/not null
cumulative sum every row across the columns
check any for every column, across the rows
use that result as an indexer:
df.loc[:, ~df.isna().cumsum(axis=1).any(axis=0)]
Give me:
A B
0 11 12
1 21 22
2 31 32
I could find a way as follows to get the expected output:
colFirstNaN = df.isna().any(axis=0).idxmax() # Find column that has first NaN element in any row
indexColLastValue = df.columns.tolist().index(colFirstNaN) -1
ColLastValue = df.columns[indexColLastValue]
out2 = df.loc[:, :ColLastValue]
And the output would be then:
>>> out2
A B
0 11 12
1 21 22
2 31 32

Pandas groupby and pivot

I have the following pandas data frame
id category counts_mean
0 8 a 23
1 8 b 22
2 8 c 23
3 8 d 30
4 9 a 40
5 9 b 22
6 9 c 11
7 9 d 10
....
And I want to group by the id and transpose the category columns to get something like this:
id a b c d
0 8 23 22 23 30
1 9 40 22 11 10
I tried different things with groupby and pivot, but I'm not sure what should be the aggregation argument for the groupby...
Instead using groupby and pivot, you just need to use the pivot function and set the parameters (index , columns, values) to re-shape your DataFrame.
#Creat the DataFrame
data = {
'id': [8,8,8,8,9,9,9,9],
'catergory': ['a','b','c','d','a','b','c','d'],
'counts_mean': [23,22,23,30,40,22,11,10]
}
df = pd.DataFrame(data)
# Using pivot to reshape the DF
df_reshaped = df.pivot(index='id',columns='catergory',values = 'counts_mean')
print(df_reshaped)
output:
catergory a b c d
id
8 23 22 23 30
9 40 22 11 10

pandas - exponentially weighted moving average - similar to excel

Consider I've a dataframe of 10 rows having two columns A and B as following :
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
In excel I can calculate the rolling mean like this excluding the first row:
How can I do this in pandas?
Here is what I've tried:
import pandas as pd
df = pd.read_clipboard() #copying the dataframe given above and calling read_clipboard will get the df populated
for i in range(1, len(df)):
df.loc[i, 'B'] = df[['A', 'B']].loc[i-1].mean()
This gives me the desired result matching excel. But is there a better pandas way to do it? I've tried using expanding and rolling did not produce desired result.
You have an exponentially weighted moving average, rather than a simple moving average. That's why pd.DataFrame.rolling didn't work. You might be looking for pd.DataFrame.ewm instead.
Starting from
df
Out[399]:
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
df['B'] = df["A"].shift().fillna(df["B"]).ewm(com=1, adjust=False).mean()
df
Out[401]:
A B
0 21 6.000000
1 87 13.500000
2 87 50.250000
3 25 68.625000
4 25 46.812500
5 14 35.906250
6 79 24.953125
7 70 51.976562
8 54 60.988281
9 35 57.494141
Even on just ten rows, doing it this way speeds up the code by about a factor of 10 with %timeit (959 microseconds from 10.3ms). On 100 rows, this becomes a factor of 100 (1.1ms vs 110ms).

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Categories