Efficient way to fill missing values using groupby - python

I have a dataframe of a million rows
The dataframe includes the columns ID, FDAT, LACT
For each ID there may be multiple FDAT and LACT. The FDAT should be the same for each LACT for that ID. Occasionally there is a missing FDAT which I want to fill with the matching FDAT from that ID for that LACT
example data
ID FDAT LACT
1 1/1/2020 1
1 1/1/2020 1
1 1/1/2021 2
1 NA 2
1 1/1/2021 2
1 1/1/2022 3
In this example the NA should be 1/1/2021
I am using the following code to do this. Unfortunately it is very slow. I only want to fill the missing values. I do not want to change any of the nonnull FDAT entries.
df.sort_values(["ID",'DATE'], inplace=True)
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
I was looking for code that would do the same thing but run faster.

Here is some vectorized code to handle this. It processes 1 million rows in under a second.
def fillna_fdat(df):
a = df.set_index(['ID', 'LACT'])['FDAT']
b = a.dropna()
return df.assign(
FDAT=a.fillna(b[~b.index.duplicated(keep='first')]).to_numpy()
)
Applied to your example input data:
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1],
'FDAT': [
'1/1/2020', '1/1/2020', '1/1/2021', float('nan'),
'1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3],
})
>>> fillna_fdat(df)
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
3 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
Explanation
The basic idea is to make a clean mapping of (ID, LACT): FDAT. To do that efficiently, we use a version of df where the index is made of [ID, LACT]:
a = df.set_index(['ID', 'LACT'])['FDAT']
>>> a
ID LACT
1 1 1/1/2020
1 1/1/2020
2 1/1/2021
2 NaN
2 1/1/2021
3 1/1/2022
We drop NaN values, and duplicated indices:
b = a.dropna()
c = b[~b.index.duplicated(keep='first')]
>>> c
ID LACT
1 1 1/1/2020
2 1/1/2021
3 1/1/2022
Now, we can replace all NaNs in a by the values from c for the same index ['ID', 'LACT']:
d = a.fillna(b[~b.index.duplicated(keep='first')])
>>> d
ID LACT
1 1 1/1/2020
1 1/1/2020
2 1/1/2021
2 1/1/2021 <-- this was filled from d.loc[(1,2)]
2 1/1/2021
3 1/1/2022
At this point, we just want to get those values, which are in the same order as in the original df, and ignore the index as we replace df['FDAT'] with those (hence the .to_numpy() part). In order to leave the original df unmodified (I strongly resent any code that changes my inputs unless explicitly stated so), we derive a new df using the idiom df.assign(FDAT=...), and return that. Putting it all together, that gives the function above.
Other observations
Note that other columns, if any, are preserved. To show this, and to measure performance, let's write a generator of random df:
def gen(n, k=None):
nhalf = n // 2
k = n // 3 if k is None else k
df = pd.DataFrame({
'ID': np.random.randint(0, k, nhalf),
'FDAT': [f'1/1/{y}' for y in np.random.randint(2010, 2012+k, nhalf)],
'LACT': np.random.randint(0, k, nhalf),
})
df = pd.concat([
df,
df.assign(FDAT=np.nan),
]).sample(frac=1).reset_index(drop=True).assign(
other=np.random.uniform(size=2*nhalf)
)
return df
Small example:
np.random.seed(0) # reproducible example
df = gen(10)
>>> df
ID FDAT LACT other
0 0 1/1/2010 2 0.957155
1 1 1/1/2014 0 0.140351
2 1 1/1/2010 2 0.870087
3 1 NaN 1 0.473608
4 0 NaN 2 0.800911
5 0 1/1/2012 2 0.520477
6 1 NaN 2 0.678880
7 1 NaN 0 0.720633
8 0 NaN 2 0.582020
9 1 1/1/2014 1 0.537373
>>> fillna_fdat(df)
ID FDAT LACT other
0 0 1/1/2010 2 0.957155
1 1 1/1/2014 0 0.140351
2 1 1/1/2010 2 0.870087
3 1 1/1/2014 1 0.473608
4 0 1/1/2010 2 0.800911
5 0 1/1/2012 2 0.520477
6 1 1/1/2010 2 0.678880
7 1 1/1/2014 0 0.720633
8 0 1/1/2010 2 0.582020
9 1 1/1/2014 1 0.537373
Speed
np.random.seed(0)
df = gen(1_000_000)
%timeit fillna_fdat(df)
# 806 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Under a second for 1 million rows.

As you see below, I give you a much faster alternative, together with your orginal and computation times:
import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.sort_values(["ID", "FDAT", "LACT"], inplace=True)
df["FDAT"] = df.groupby(["ID", "LACT"])["FDAT"].transform(lambda x: x.fillna(method="ffill"))
print(df)
end_time = time.time()
print("Execution time:", end_time - start_time, "seconds")
returning:
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
3 1 1/1/2021 2
Execution time: 0.008013486862182617 seconds
while you solution:
import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
print(df)
end_time = time.time()
print("Execution time:", end_time - start_time, "seconds")
returns:
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
3 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
Execution time: 0.011833429336547852 seconds
So, using transform together with fffillis approximately 1.5 times faster. Note, that sort_values() is excluded from the time in your code example. So, I'd reckon it should be up to 2.5 times faster to use the method I suggest.

Related

How do I group into different dates based on change in another column values in Pandas

I have data that looks like this
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'DATE': ['1/1/2015','1/2/2015', '1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015','1/8/2015',
'1/9/2016','1/2/2015','1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015'],
'CD': ['A','A','A','A','B','B','A','A','C','A','A','A','A','A','A']})
What I would like to do is group by ID and CD and get the start and stop change for each change. I tried using groupby and agg function but it will group all A together even though they needs to be separated since there is B in between 2 A.
df1 = df.groupby(['ID','CD'])
df1 = df1.agg(
Start_Date = ('Date',np.min),
End_Date=('Date', np.min)
).reset_index()
What I get is :
I was hoping if some one could help me get the result I need. What I am looking for is :
make grouper for grouping
grouper = df['CD'].ne(df['CD'].shift(1)).cumsum()
grouper:
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 5
14 5
Name: CD, dtype: int32
then use groupby with grouper
df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
output:
min max
ID CD
1 A 1/1/2015 1/4/2015
B 1/5/2015 1/6/2015
A 1/7/2015 1/8/2015
C 1/9/2016 1/9/2016
2 A 1/2/2015 1/7/2015
change column name and use reset_index and so on..for your desired output
(df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
.set_axis(['Start_Date', 'End_Date'], axis=1)
.reset_index()
.assign(CD=lambda x: x.pop('CD')))
result
ID Start_Date End_Date CD
0 1 1/1/2015 1/4/2015 A
1 1 1/5/2015 1/6/2015 B
2 1 1/7/2015 1/8/2015 A
3 1 1/9/2016 1/9/2016 C
4 2 1/2/2015 1/7/2015 A

How can I merge columns `Year`, `Month`, and `Day` into one column of months?

How can I merge columns Year, Month, and Day into one column of months?
import pandas as pd
data = {'Subject': ['A', 'B', 'C', 'D'],
'Year':[1, 0, 0, 2],
'Month':[5,2,8,8],
'Day': [3,22,5,12]}
df = pd.DataFrame(data)
print(df)
My example gives the resulting dataframe:
Subject Year Month Day
0 A 1 5 3
1 B 0 2 22
2 C 0 8 5
3 D 2 8 12
I would like it to look like this:
*note: I rounded this so these numbers are not 100% accurate
Subject Months
0 A 17
1 B 3
2 C 8
3 D 32
Assuming the Gregorian calendar:
365.2425 days/year
30.436875 days/month.
day_year = 365.2425
day_month = 30.436875
df['Days'] = df.Year.mul(day_year) + df.Month.mul(day_month) + df.Day
# You could also skip this step and just do:
# df['Months'] = (df.Year.mul(day_year) + df.Month.mul(day_month) + df.Day).div(day_month)
df['Months'] = df.Days.div(day_month)
print(df.round(2))
Output:
Subject Year Month Day Days Months
0 A 1 5 3 520.43 17.10
1 B 0 2 22 82.87 2.72
2 C 0 8 5 248.50 8.16
3 D 2 8 12 985.98 32.39

Calculate rate of positive values by group

I'm working with a Pandas DataFrame having the following structure:
import pandas as pd
​
df = pd.DataFrame({'brand' : ['A', 'A', 'B', 'B', 'C', 'C'],
'target' : [0, 1, 0, 1, 0, 1],
'freq' : [5600, 220, 5700, 90, 5000, 100]})
​
print(df)
brand target freq
0 A 0 5600
1 A 1 220
2 B 0 5700
3 B 1 90
4 C 0 5000
5 C 1 100
For each brand, I would like to calculate the ratio of positive targets, e.g. for brand A, the percentage of positive target is 220/(220+5600) = 0.0378.
My resulting DataFrame should look like the following:
brand target freq ratio
0 A 0 5600 0.0378
1 A 1 220 0.0378
2 B 0 5700 0.0156
3 B 1 90 0.0156
4 C 0 5000 0.0196
5 C 1 100 0.0196
I know that I should group my DataFrame by brand and then apply some function to each group (since I want to keep all rows in my final result I think I should use transform here). I tested a couple of things but without any success. Any help is appreciated.
First sorting columns by brand and target for last 1 row per group and then divide in GroupBy.transform with lambda function:
df = df.sort_values(['brand','target'])
df['ratio'] = df.groupby('brand')['freq'].transform(lambda x: x.iat[-1] / x.sum())
print (df)
brand target freq ratio
0 A 0 5600 0.037801
1 A 1 220 0.037801
2 B 0 5700 0.015544
3 B 1 90 0.015544
4 C 0 5000 0.019608
5 C 1 100 0.019608
Or divide Series created by functions GroupBy.last and GroupBy.sum:
df = df.sort_values(['brand','target'])
g = df.groupby('brand')['freq']
df['ratio'] = g.transform('last').div(g.transform('sum'))

Pandas - Keeping groups having at least two different codes

I'm working with a DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'group' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4],
'brand' : ['A', 'B', 'X', 'A', 'B', 'C', 'X', 'B', 'C', 'X', 'A', 'B'],
'code' : [2185, 2185, 0, 1410, 1390, 1390, 0, 3670, 4870, 0, 2000, 0]})
print(df)
group brand code
0 1 A 2185
1 1 B 2185
2 1 X 0
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
10 4 A 2000
11 4 B 0
My goal is to view only the groups having a least two different codes. Missing codes labelled with 0's should not be taken into consideration in the filtering criterion. For example, even though the two records from group 4 have different codes, we don't keep this group in the final DataFrame since one of the code is missing.
The resulting DataFrame on the above example should look like this:
group brand code
1 2 A 1410
2 2 B 1390
3 2 C 1390
4 2 X 0
5 3 B 3670
6 3 C 4870
7 3 X 0
I didn't manage to do much with this problem. I think that the first step should be to create a mask to remove the records with a missing (0) code. Something like:
mask = df['code'].eq(0)
df = df[~mask]
print(df)
group brand code
0 1 A 2185
1 1 B 2185
3 2 A 1410
4 2 B 1390
5 2 C 1390
7 3 B 3670
8 3 C 4870
10 4 A 2000
And now only keep the groups having a least two different codes but I don't know how to work this out in Python. Also, this method will remove the records with an missing code in my final DataFrame which I don't want. I want to have a view on the full group.
Any additional help would be appreciated.
This is transform():
mask = (df.groupby('group')['code']
.transform(lambda x: x.mask(x==0) # mask out the 0 values
.nunique() # count the nunique
)
.gt(1)
)
df[mask]
Output:
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
Option 2: Similar idea, but without the lambda function:
mask = (df['code'].mask(df['code']==0) # mask out the 0 values
.groupby(df['group']) # groupby
.transform('nunique') # count uniques
.gt(1) # at least 2
)
We can also use groupby.filter:
df.groupby('group').filter(lambda x: x.code.mask(x.code.eq(0)).nunique()>1)
or surely faster than the previous:
( df.assign(code=df['code'].replace(0,np.nan))
.groupby('group')
.filter(lambda x: x.code.nunique()>1)
.fillna({'code':0}) )
Output
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0

How do I convert a DataFrame from long to wide format, aggregating a column's values by counts

My setup is as follows
import numpy as np
import pandas as pd
df = pd.DataFrame({'user_id':[1, 1, 1, 2, 3, 3], 'action':['b', 'b', 'c', 'a', 'c', 'd']})
df
action user_id
0 b 1
1 b 1
2 c 1
3 a 2
4 c 3
5 d 3
What is the best way to generate a dataframe from this where there is one row for each unique user_id, one column for each unique action and the column values are the count of each action per user_id?
I've tried
df.groupby(['user_id', 'action']).size().unstack('action')
action a b c d
user_id
1 NaN 2 1 NaN
2 1 NaN NaN NaN
3 NaN NaN 1 1
which comes close, but this seems to make user_id the index which is not what I want (I think). Maybe there's a better way involving pivot, pivot_table or even get_dummies?
You could use pd.crosstab:
In [37]: pd.crosstab(index=[df['user_id']], columns=[df['action']])
Out[37]:
action a b c d
user_id
1 0 2 1 0
2 1 0 0 0
3 0 0 1 1
Having user_id as the index seems appropriate to me, but if you'd like to drop the user_id, you could use reset_index:
In [39]: pd.crosstab(index=[df['user_id']], columns=[df['action']]).reset_index(drop=True)
Out[39]:
action a b c d
0 0 2 1 0
1 1 0 0 0
2 0 0 1 1

Categories