Count rolling window unique values with duplicate dates pandas - python

If I have a pandas DataFrame like this
date
person_active
22/2
John
22/2
Marie
22/2
Mark
23/2
John
24/2
Mark
24/2
Marie
how do I count in a rolling window based on time the unique values in person_active, for example: 2 days rolling window, so it ends up like this:
date
person_active
people_active
22/2
John
3
22/2
Marie
3
22/2
Mark
3
23/2
John
3
24/2
Mark
3
24/2
Marie
3
The main issue here is that I have duplicate entries on date for each person so a simple df.rolling('2d',on='date').count() won't do the job.
EDIT: Please consider implementation in a big dataset and how the time to compute will scale, the solution needs to be ideally applicable in a real-world environment so if it takes too long to compute it's not that useful.

IIUC, try:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"], format="%d/%m")
#convert string name to categorical codes for numerical aggegation
df["people"] = pd.Categorical(df["person_active"]).codes
#compute the rolling unique count
df["people_active"] = (df.rolling("2D", on="date")["people"]
.agg(lambda x: x.nunique())
.groupby(df["date"])
.transform("max")
)
#drop the unneccessary column
df = df.drop("people", axis=1)
>>> df
date person_active people_active
0 1900-02-22 John 3.0
1 1900-02-22 Marie 3.0
2 1900-02-22 Mark 3.0
3 1900-02-23 John 3.0
4 1900-02-24 Mark 3.0
5 1900-02-24 Marie 3.0

Group by date, count unique values and then you're good to go:
df.groupby('date').nunique().rolling('2d').sum()

Related

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Dynamic filtering/ masking in Pandas

I have a pandas data frame containing employee information like this:
df=pd.DataFrame({
'Id':[1,2,3,4],
'Name':['Joe','Henry','Sam','Max'],
'Salary':[70000,80000,60000,90000],
'ManagerId':[3,4,np.nan,np.nan]
})
Id Name Salary ManagerId
0 1 Joe 70000 3.0
1 2 Henry 80000 4.0
2 3 Sam 60000 NaN
3 4 Max 90000 NaN
What I need to do is to filter employees having their salary exceed his manager's (in this case Joe since his salary is larger than his manager, Sam).
0 1 Joe 70000 3.0
Because of the relation between Id and Manager Id, I think I can use loops as the last resort, but that seems to be really manual and looks ugly too. I wonder that if I can do this with masking. As a beginner, I can only do simple masking where the condition is static so far, like to filter employees that have salary exceeding 60000. But in this case, the condition for each employee is different from each other.
I have no idea what this technique is called so I just made up the title.
Thanks for any help.
Idea is match ManagerID by Salary by Id, so possible compare for greater and filter:
df = df[df['Salary'].gt(df['ManagerID'].map(df.set_index(['Id'])['Salary']))]
print (df)
Id Name Salary ManagerID
0 1 Joe 70000 3.0
Details:
print (df['ManagerID'].map(df.set_index(['Id'])['Salary']))
0 60000.0
1 90000.0
2 NaN
3 NaN
Name: ManagerID, dtype: float64

How do I use pandas to count the number of times a name and type occur together within a 60 period from the first instance?

My dataframe is this:
Date Name Type Description Number
2020-07-24 John Doe Type1 NaN NaN
2020-08-10 Jo Doehn Type1 NaN NaN
2020-08-15 John Doe Type1 NaN NaN
2020-09-10 John Doe Type2 NaN NaN
2020-11-24 John Doe Type1 NaN NaN
I want the Number column to have the instance number with the 60 day period. So for entry 1, the Number should just be 1 since it's the first instance - same with entry 2 since it's a different name. Entry 3 however, should have 2 in the Number column since it's the second instance of John Doe and Type 1 in the 60 day period starting 7/24 (the first instance date). Entry 4 would be 1 as well since the Type is different. Entry 5 would also be 1 since it's outside the 60 day period from 7/24. However, any entries after this with John Doe, Type 1 would have a new 60 day period starting 11/24.
Sorry, I know this is a pretty loaded question with a lot of aspects to it, but I'm trying to get up to speed on dataframes again and I'm not sure where to begin.
As a starting point, you could create a pivot table. (The assign statement just creates a temporary column of ones, to support counting.) In the example below, each row is a date, and each column is a (name, type) pair.
Then, use the resample function (to get one row for every calendar day), and the rolling function (to sum the numbers in the 60-day window).
x = (df.assign(temp = 1)
.pivot_table(index='date',
columns=['name', 'type'],
values='temp',
aggfunc='count',
fill_value=0)
)
x.resample('1d').count().rolling(60).sum()
Can you post sample data in text format (for copy/paste)?

Pandas Expanding Mean with Group By and before current row date

I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])

Is there an "ungroup by" operation opposite to .groupby in pandas?

Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.

Categories