Pandas: Sum Previous N Rows by Group - python

I want to sum the prior N periods of data for each group. I have seen how to do each individually (sum by group, or sum prior N periods), but can't figure out a clean way to do both together.
I'm currently doing the following:
import pandas as pd
sample_data = {'user': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],\
'clicks': [0,1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(sample_data)
df['clicks.1'] = df.groupby(['user'])['clicks'].shift(1)
df['clicks.2'] = df.groupby(['user'])['clicks'].shift(2)
df['clicks.3'] = df.groupby(['user'])['clicks'].shift(3)
df['total_clicks_prior3'] = df[['clicks.1','clicks.2', 'clicks.3']].sum(axis=1)
I don't want the 3 intermediate lagged columns, I just want the sum of those, so my desired output is:
>>> df[['clicks','user','total_clicks_prior3']]
clicks user total_clicks_prior3
0 0 a NaN
1 1 a 0.0
2 2 a 1.0
3 3 a 3.0
4 4 a 6.0
5 5 b NaN
6 6 b 5.0
7 7 b 11.0
8 8 b 18.0
9 9 b 21.0
Note: I could obviously drop the 3 columns after creating them, but given that I will be creating multiple columns of different numbers of lagged periods, I feel like there has to be an easier way.

This is groupby + rolling + shift
df.groupby('user')['clicks'].rolling(3, min_periods=1).sum().groupby(level=0).shift()
user
a 0 NaN
1 0.0
2 1.0
3 3.0
4 6.0
b 5 NaN
6 5.0
7 11.0
8 18.0
9 21.0
Name: clicks, dtype: float64

If you have a solution that works for each group, you can use apply to use it on the groupby object. For instance, you linked to a question that has df['A'].rolling(min_periods=1, window=11).sum() as an answer. If that does what you want on the subgroups, you can do
df.groupby('user').apply(lambda x: x['clicks'].rolling(min_periods=1, window=11).sum())

Related

How to pass the value of previous row to the dataframe apply function?

I have the following pandas dataframe and would like to build a new column 'c' which is the summation of column 'b' value and column 'a' previous values. With shifting column 'a' it is possible to do so. However, I would like to know how I can pass the previous values of column 'a' in the apply() function.
l1 = [1,2,3,4,5]
l2 = [3,2,5,4,6]
df = pd.DataFrame(data=l1, columns=['a'])
df['b'] = l2
df['shifted'] = df['a'].shift(1)
df['c'] = df.apply(lambda row: row['shifted']+ row['b'], axis=1)
print(df)
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
I appreciate your help.
Edit: this is a dummy example. I need to use the apply function because I'm passing another function to it which uses previous rows of some columns and checks some condition.
First let's make it clear that you do not need apply for this simple operation, so I'll consider it as a dummy example of a complex function.
Assuming non-duplicate indices, you can generate a shifted Series and reference it in apply using the name attribute:
s = df['a'].shift(1)
df['c'] =df.apply(lambda row: row['b']+s[row.name], axis=1)
output:
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0

Pandas fill row values using previous period

One additional note to address the problem better, in the actual data set there is also a column called store and the table can be grouped by store, date & product, When I tried the pivot solution and the cartesian product solution it did not work, is there a solution that could work for 3 grouping columns? Also the table has millions of rows.
Assuming a data frame with the following format:
d = {'product': ['a', 'b', 'c', 'a', 'b'], 'amount': [1, 2, 3, 5, 2], 'date': ['2020-6-6', '2020-6-6', '2020-6-6',
'2020-6-7', '2020-6-7']}
df = pd.DataFrame(data=d)
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
Product c is no longer present on the date 2020-6-7, I want to be able to calculate things like percent change or difference in the amount of each product.
For example: df['diff'] = df.groupby('product')['amount'].diff()
But in order for this to work and show for example that the difference of c is -3 and -100%, c would need to be present on the next date with the amount set to 0
This is the results I am looking for:
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
5 c 0 2020-6-7
Please note this is just a snipped data frame, in reality there might be many date periods, I am only looking to fill in the product and amount in the first date after it has been removed, not all dates after.
What is the best way to go about this?
Let us try pivot then unstack
out = df.pivot('product','date','amount').fillna(0).unstack().reset_index(name='amount')
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
You could use the complete function from pyjanitor to explicitly expose the missing values and combine with fillna to fill the missing values with 0:
# pip install pyjanitor
# import janitor
df.complete(['date', 'product']).fillna(0)
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
another way is to do create a cartesian product of your products & dates, then join that to your main dataframe to get the missing values.
#df['date'] = pd.to_datetime(df['date'])
#ensure you have a proper datetime object.
s = pd.merge( df[['product']].drop_duplicates().assign(ky=-1),
df[['date']].drop_duplicates().assign(ky=-1),
on=['ky']
).drop('ky',1)
df1 = pd.merge(df,s,
on = ['product','date']
,how='outer'
).fillna(0)
print(df1)
product amount date
0 a 1.0 2020-06-06
1 b 2.0 2020-06-06
2 c 3.0 2020-06-06
3 a 5.0 2020-06-07
4 b 2.0 2020-06-07
5 c 0.0 2020-06-07

How to combine duplicate rows in python pandas

I have a data frame similar to the one listed below. For some reason, each team is listed twice, one listing corresponding to each column.
import pandas as pd
import numpy as np
d = {'Team': ['1', '2', '3', '1', '2', '3'], 'Points for': [5, 10, 15, np.nan,np.nan,np.nan], 'Points against' : [np.nan,np.nan,np.nan, 3, 6, 9]}
df = pd.DataFrame(data=d)
Team Points for Points against
0 1 5 Nan
1 2 10 Nan
2 3 15 Nan
3 1 Nan 3
4 2 Nan 6
5 3 Nan 9
How can I just combine rows of duplicate team names so that there are no missing values? This is what I would like:
Team Points for Points against
0 1 5 3
1 2 10 6
2 3 15 9
I have been trying to figure it out with pandas, but can't seem to get it. Thanks!
I made changes to your code, replacing string 'Nan' with numpy's nan.
One solution is to melt the data, drop the null entries, and pivot back to wide from long:
df = (df
.melt('Team')
.dropna()
.pivot('Team','variable','value')
.reset_index()
.rename_axis(None,axis='columns')
.astype(int)
)
df
Team Points against Points for
0 1 3 5
1 2 6 10
2 3 9 15
One way using groupby. :
df = df.replace("Nan", np.nan)
new_df = df.groupby("Team").first()
print(new_df)
Output:
Points for Points against
Team
1 5.0 3.0
2 10.0 6.0
3 15.0 9.0
You need to groupby the unique identifiers. If there is also a game ID or date or something like that, you might need to group on that as well.
df.groupby('Team').agg({'Points for': 'max', 'Points against': 'max'})
pd.pivot_table(df, values = ['Points for','Points against'],index=['Team'], aggfunc=np.sum)[['Points for','Points against']]
Output
Points for Points against
Team
1 5.0 3.0
2 10.0 6.0
3 15.0 9.0

How to do a Python DataFrame Boolean Mask on nan values [duplicate]

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

How to find which columns contain any NaN value in Pandas dataframe

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Categories