Pandas groupby function - python

Suppose I have the data set below in a dataframe, df:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B'], 'Date' : ['1-Jan','2-Jan','3-Jan','1-Jan','2-Jan','3-Jan'],'VAL' : [45,23,54,65,76,23]})
I am trying to insert a column, say 'new_col', that calculates the percent change in VAL that is grouped by ID. So, for example, I would want the percent change from 45 to 23, 23 to 54, and then restart for ID 'B'. The below code works but it calculates the percent change regardless of ID.
df['new_col'] = (df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)
I tried adding the group by function in front of it but I am still getting an error:
df['new_col'] = df.groupby('ID')[(df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)]
^^^^^^^^^^^^^^^^

You can't just just stick your expression in brackets onto the groupby like that. What you need to do is use apply to apply a function that calculates what you want. What you want can be calculated more simply using the diff method:
>>> df.groupby('ID')['VAL'].apply(lambda g: g.diff()/g.shift())
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
As DSM notes in a comment, in this case you can do it directly with the pct_change method:
>>> df.groupby('ID')['VAL'].pct_change()
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
However, it is good to be aware of how to do it with apply because you'll need to do things that way if you want to do a more complex operation on the groups (i.e., an operation for which there is no predefined one-shot method).

Related

How do I replace a string-value in a specific column using method chaining?

I have a pandas data frame, where some string values are "NA". I want to replace these values in a specific column (i.e. the 'strCol' in the example below) using method chaining.
How do I do this? (I googled quite a bit without success even though this should be easy?! ...)
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],
'B':['val1','val2','NA','val3']})
df = (
df
.rename(columns={'A':'intCol', 'B':'strCol'}) # method chain example operation 1
.astype({'intCol':float}) # method chain example operation 2
# .where(df['strCol']=='NA', pd.NA) # how to replace the sting 'NA' here? this does not work ...
)
df
You can try replace instead of where:
df.replace({'strCol':{'NA':pd.NA}})
Use lambda in where clause to evaluate the chained dataframe:
df = (df.rename(columns={'A':'intCol', 'B':'strCol'})
.astype({'intCol':float})
.where(lambda x: x['strCol']=='NA', pd.NA))
Output:
>>> df
intCol strCol
0 NaN <NA>
1 NaN <NA>
2 3.0 NA
3 NaN <NA>
Many methods like where, mask, groupby, apply can take a callable or a function so you can pass a lambda function.
pandas.DataFrame.where does
Replace values where the condition is False.
So you need condition to not hold where you want to make replacement, simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9]})
df2 = df.where(df.x%2==0,-1)
print(df2)
gives output
x
0 -1
1 2
2 -1
3 4
4 -1
5 6
6 -1
7 8
8 -1
Observe that odd values were replaced by -1s, whilst condition does hold for even values.

Pandas Groupby with lambda gives some NANs

I have a DF where I'd like to create a new column with the difference of 2 other column values.
name rate avg_rate
A 10 3
B 6 5
C 4 3
I wrote this code to calculate the difference :
result= df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate)
df['rate_diff']=result.reset_index(drop=True)
df.tail(3)
But I notice that some of the values calculated are NANs. What is the best way to handle this?
Output i am getting:
name rate avg_rate rate_diff
A 10 3 NAN
B 6 5 NAN
C 4 3 NAN
If you want to use groupby and apply then following should work,
res = df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate).reset_index().set_index('level_1')
df = pd.merge(df,res,on=['name'],left_index = True, right_index=True).rename({0:'rate_diff'},axis=1)
However, as #sacuL suggested in the comments, you don't need to use groupby to calculate the difference as you are just going to get the difference by simply subtracting columns (side by side), and groupby apply will be overkill for this simple task.
df["rate_diff"] = df.rate - df.avg_rate

Pandas Groupby and apply method with custom function

I built the following function with the aim of estimating an optimal exponential moving average of a pandas' DataFrame column.
from scipy import optimize
from sklearn.metrics import mean_squared_error
import pandas as pd
## Function that finds best alpha and uses it to create ewma
def find_best_ewma(series, eps=10e-5):
def f(alpha):
ewm = series.shift().ewm(alpha=alpha, adjust=False).mean()
return mean_squared_error(series, ewm.fillna(0))
result = optimize.minimize(f,.3, bounds=[(0+eps, 1-eps)])
return series.shift().ewm(alpha=result.x, adjust=False).mean()
Now I want to apply this function to each of the groups created using pandas-groupby on the following test df:
## test
data1 data2 key1 key2
0 -0.018442 -1.564270 a x
1 -0.038490 -1.504290 b x
2 0.953920 -0.283246 a x
3 -0.231322 -0.223326 b y
4 -0.741380 1.458798 c z
5 -0.856434 0.443335 d y
6 -1.416564 1.196244 c z
To do so, I tried the following two ways:
## First way
test.groupby(["key1","key2"])["data1"].apply(find_best_ewma)
## Output
0 NaN
1 NaN
2 -0.018442
3 NaN
4 NaN
5 NaN
6 -0.741380
Name: data1, dtype: float64
## Second way
test.groupby(["key1","key2"]).apply(lambda g: find_best_ewma(g["data1"]))
## Output
key1 key2
a x 0 NaN
2 -0.018442
b x 1 NaN
y 3 NaN
c z 4 NaN
6 -0.741380
d y 5 NaN
Name: data1, dtype: float64
Both ways produce a pandas.core.series.Series but ONLY the second way provides the expected hierarchical index.
I do not understand why the first way does not produce the hierarchical index and instead returns the original dataframe index. Could you please explain me why this happens?
What am I missing?
Thanks in advance for your help.
The first way creates a pandas.core.groupby.DataFrameGroupBy object, which becomes a pandas.core.groupby.SeriesGroupBy object once you select a specific column from it; It is to this object that the 'apply' method is applied to, hence a series is returned.
test.groupby(["key1","key2"])["data1"]#.apply(find_best_ewma)
<pandas.core.groupby.SeriesGroupBy object at 0x7fce51fac790>
The second way remains a DataFrameGroupBy object. The function you apply to that object selects the column, which means the function 'find_best_ewma' is applied to each member of that column, but the 'apply' method is applied to the original DataFrameGroupBy, hence a DataFrame is returned, the 'magic' is that the indexes of the DataFrame are hence still present.

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

How to apply different aggregation functions to same column by using pandas Groupby

It is clear when doing
data.groupby(['A','B']).mean()
We get something multiindex by level 'A' and 'B' and one column with the mean of each group
how could I have the count(), std() simultaneously ?
so result looks like in a dataframe
A B mean count std
The following should work:
data.groupby(['A','B']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])
basically call agg and passing a list of functions will generate multiple columns with those functions applied.
Example:
In [12]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])
Out[12]:
a
mean std count
b
0 -0.769198 0.158049 2
1 0.247708 0.743606 2
2 -0.312705 NaN 1
You can also pass the string of the method names, the common ones work, some of the more obscure ones don't I can't remember which but in this case they work fine, thanks to #ajcr for the suggestion:
In [16]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg(['mean', 'std', 'count'])
Out[16]:
a
mean std count
b
0 -1.037301 0.790498 2
1 -0.495549 0.748858 2
2 -0.644818 NaN 1

Categories