Calculate percentage of a values occurence - python

I am using this dataframe:
Car make | Driver's Gender
Ford | m
GMC | m
GMC | f
Ferrari | f
I would like to calculate the percentage of each make's male drivers.
Car make | Male drivers
Ford | 100
GMC | 50
Ferrari | 0

Compare second column for m and then aggregate mean:
df1 = (df["Driver's Gender"].eq('m')
.groupby(df['Car make'], sort=False)
.mean()
.mul(100)
.reset_index(name='Male drivers'))
print (df1)
Car make Male drivers
0 Ford 100.0
1 GMC 50.0
2 Ferrari 0.0
Another idea with crosstab and normalize parameter:
df2 = pd.crosstab(df['Car make'], df["Driver's Gender"], normalize=0).mul(100)
print (df2)
Driver's Gender f m
Car make
Ferrari 100.0 0.0
Ford 0.0 100.0
GMC 50.0 50.0

Here are a few approaches:
Quick and dirty by converting "m" to 100 and "f" to 0and taking a mean
df["Male drivers"] = df["Driver's gender"].apply(lambda x: 100 if x=="m" else 0)
male_freq = df.groupby("Car make").mean(numeric_only=True)
Using groupby and a manual frequency calculation
male_freq = df.groupby("Car make").agg(lambda x: 100*sum(x == "m") / len(x))
Using groupby and value_counts
def get_male_frequency(series):
val_counts = series.value_counts(normalize=True)
return 100 * val_counts.get("m", 0)
male_freq = df.groupby("Car make").agg(get_male_frequency)
Or a more general version of the same:
def get_frequency(value_of_interest):
def _get_frequency(series):
val_counts = series.value_counts(normalize=True)
return 100 * val_counts.get(value_of_interest, 0)
return _get_frequency
x = df.groupby("Car make").agg(get_frequency("m"))
They all output the following:
Driver's gender
Car make
Ferrari 0.0
Ford 100.0
GMC 50.0

Related

How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

I need to apply a function that splits multiple numbers from the fields of a dataframe.
In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.
The name is only formed of alpha-characters. But some kids might have both first name and middle name. (e.g. Vivien Ester)
The height is known to be >= 100 cm for every child.
The weight is known to be < 70 kg for every child.
The unique code is known to be any number, but it is associated with the letters 'AX', for every child. But the AX may not always be stick to the number (e.g. 7771AX), it might be a space next to it. (e.g. 100 AX)
Every kid has its dream career
They could appear in any order, but they always follow the rules from above. However, for some kids some measurements could not appear (e.g.: height or unique code or both are missing or all are missing).
So the dataframe is this:
data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
}
df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])
And it looks like this:
Dream Career Measurements
0 Scientist Rachel 24.3 100.25 100 AX
1 Astronaut 100.5 Tuuli 30.1
2 Software Engineer Michael 28.0 7771AX 113.75
3 Doctor Vivien Ester 40AX 115.20
4 Fashion Designer Oliver 40.5
5 Teacher Julien 35.1 678 AX 111.1
6 Architect Bee 20.0 100.80 88AX
I try to split all of these measurements into different columns, based on the specified rules.
So the final dataframe should look like this:
Dream Career Names Weight Height Unique Code
0 Scientist Rachael 24.3 100.25 100AX
1 Astronaut Tuuli 30.1 100.50 NaN
2 Software Engineer Michael 28.0 113.75 7771AX
3 Doctor Vivien Ester NaN 115.20 40AX
4 Fashion Designer Oliver 40.5 NaN NaN
5 Teacher Julien 35.1 111.10 678AX
6 Architect Bee 10.0 100.80 88AX
I tried this code and it works very well, but only on single strings. And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
nums = re.findall(num_rx, s)
height = np.nan
weight = np.nan
if (len(nums) == 0):
height = np.nan
weight = np.nan
elif (len(nums) == 1):
if float(nums[0]) >= 100:
height = nums[0]
weight = np.nan
else:
weight = nums[0]
height = np.nan
elif (len(nums) == 2):
if float(nums[0]) >= 100:
height = nums[0]
weight = nums[1]
else:
height = nums[1]
weight = nums[0]
return height, weight
class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def extract_measurements(string, substring_name):
height = np.nan
weight = np.nan
unique_code = np.nan
name = ''
if hasNumbers(string):
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
nums = re.findall(num_rx, string)
if (substring_name in string):
special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
if special_match:
unique_code = special_match.group()
string = string.replace(unique_code, '')
unique_code = unique_code + substring_name
if len(nums) >= 2 & len(nums) <= 3:
height, weight = get_weight_height(string)
else:
height, weight = get_weight_height(string)
name = " ".join(re.findall("[a-zA-Z]+", string))
name = name.replace(substring_name,'')
return format(float(height), '.2f'), float(weight), unique_code, name
And I apply it like this:
string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)
The results are very good.
I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:
df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])
I cannot figure out how to apply it on my dataframe. Please help me.
I am at the very beginning, I highly appreciate all the help if you could possibly help me!
Use apply for rows (axis=1) and choose 'expand' option. Then rename columns and concat to the original df:
pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
.rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)
output:
Dream Career Measurements height weight unique_code name
-- ----------------- -------------------------- -------- -------- ------------- ------------
0 Scientist Rachel 24.3 100.25 100 AX 100 100 100AX Rachel
1 Astronaut 100.5 Tuuli 30.1 100 100 nan Tuuli
2 Software Engineer Michael 28.0 7771AX 113.75 100 100 7771AX Michael
3 Doctor Vivien Ester 40AX 115.20 100 100 40AX Vivien Ester
4 Fashion Designer Oliver 40.5 100 100 nan Oliver
5 Teacher Julien 35.1 678 AX 111.1 100 100 678AX Julien
6 Architect Bee 20.0 100.80 88AX 100 100 88AX Bee
(note I stubbed def get_weight_height(string) function because your coded did not include it, to always return 100,100)
#piterbarg's answer seems efficient given the original functions, but the functions seem verbose to me. I'm sure there's a simpler solution here that what I'm doing, but what I have below replaces the functions in OP with I think the same results.
First changing the column names to snake case for ease:
df = pd.DataFrame({
'dream_career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor',
'Fashion Designer', 'Teacher', 'Architect'],
'measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1',
'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20',
'Oliver 40.5', 'Julien 35.1 678 AX 111.1',
'Bee 20.0 100.80 88AX']
})
First the strings in .measurements are turned into lists. From here on list comphrehensions will be applied to each list to filter values.
df.measurements = df.measurements.str.split()
0 [Rachel, 24.3, 100.25, 100, AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678, AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
The second step is filtering out the 'AX' from .measurements and appending 'AX' to all integers. This assumes this example is totally reproducible and all the height/weight measurements are floats, but a different differentiator could be used if this isn't the case.
df.measurements = df.measurements.apply(
lambda val_list: [val for val in val_list if val!='AX']
).apply(
lambda val_list: [str(val)+'AX' if val.isnumeric() else val
for val in val_list]
)
0 [Rachel, 24.3, 100.25, 100AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
.name and .unique_code are pretty easy to grab. With .unique_code I had to apply a second lambda function to insert NaNs. If there are missing values for .name in the original df the same thing will need to be done there. For cases of multiple names, these are joined together separated with a space.
df['name'] = df.measurements.apply(
lambda val_list: ' '.join([val for val in val_list if val.isalpha()])
)
df['unique_code'] = df.measurements.apply(
lambda val_list: [val for val in val_list if 'AX' in val]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
For height and weight I needed to create a column of numerics first and work off that. In cases where there are missing values I'm having to come back around to deal with those.
import re
df['numerics'] = df.measurements.apply(
lambda val_list: [float(val) for val in val_list
if not re.search('[a-zA-Z]', val)]
)
df['height'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val < 70]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
df['weight'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val >= 100]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
Finally, .measurements and .numerics are dropped, and the df should be ready to go.
df = df.drop(columns=['measurements', 'numerics'])
dream_career name unique_code height weight
0 Scientist Rachel 100AX 24.3 100.25
1 Astronaut Tuuli NaN 30.1 100.50
2 Software Engineer Michael 7771AX 28.0 113.75
3 Doctor Vivien Ester 40AX NaN 115.20
4 Fashion Designer Oliver NaN 40.5 NaN
5 Teacher Julien 678AX 35.1 111.10
6 Architect Bee 88AX 20.0 100.80

Pandas - Groupby and aggregate over multiple columns

I am trying to aggregate values in a groupby over multiple columns. I come from the R/dplyr world and what I want is usually achievable in a single line using group_by/summarize. I am trying to find an equivalently elegant way of achieving this using pandas.
Consider the below Input Dataset. I would like to aggregate by state and calculate the column v1 as v1 = sum(n1)/sum(d1) by state.
The r-code for this using dplyr is as follows:
input %>% group_by(state) %>%
summarise(v1=sum(n1)/sum(d1),
v2=sum(n2)/sum(d2))
Is there an elegant way of doing this in Python? I found a slightly verbose way of getting what I want in on a stack overflow answer here.
Copying over modified python-code from the link
In [14]: s = mn.groupby('state', as_index=False).sum()
In [15]: s['v1'] = s['n1'] / s['d1']
In [16]: s['v2'] = s['n2'] / s['d2']
In [17]: s[['state', 'v1', 'v2']]
INPUT DATASET
state n1 n2 d1 d2
CA 100 1000 1 2
FL 200 2000 2 4
CA 300 3000 3 6
AL 400 4000 4 8
FL 500 5000 5 2
NY 600 6000 6 4
CA 700 7000 7 6
OUTPUT
state v1 v2
AL 100 500.000000
CA 100 500.000000
NY 100 1500.000000
CA 100 1166.666667
FL 100 1166.666667
One possible solution with DataFrame.assign and DataFrame.reindex:
df = (mn.groupby('state', as_index=False)
.sum()
.assign(v1 = lambda x: x['n1'] / x['d1'], v2 = lambda x: x['n2'] / x['d2'])
.reindex(['state', 'v1', 'v2'], axis=1))
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
And another with GroupBy.apply and custom lambda function:
df = (mn.groupby('state')
.apply(lambda x: x[['n1','n2']].sum() / x[['d1','d2']].sum().values)
.reset_index()
.rename(columns={'n1':'v1', 'n2':'v2'})
)
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
Another solution:
def func(x):
u = x.sum()
return pd.Series({'v1':u['n1']/u['d1'],
'v2':u['n2']/u['d2']})
df.groupby('state').apply(func)
Output:
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Here is the equivalent way as you did in R:
>>> from datar.all import f, tribble, group_by, summarise, sum
>>>
>>> input = tribble(
... f.state, f.n1, f.n2, f.d1, f.d2,
... "CA", 100, 1000, 1, 2,
... "FL", 200, 2000, 2, 4,
... "CA", 300, 3000, 3, 6,
... "AL", 400, 4000, 4, 8,
... "FL", 500, 5000, 5, 2,
... "NY", 600, 6000, 6, 4,
... "CA", 700, 7000, 7, 6,
... )
>>>
>>> input >> group_by(f.state) >> \
... summarise(v1=sum(f.n1)/sum(f.d1),
... v2=sum(f.n2)/sum(f.d2))
state v1 v2
<object> <float64> <float64>
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
I am the author of the datar package.
Another option is with the pipe function, where the groupby object is resuable:
(df.groupby('state')
.pipe(lambda df: pd.DataFrame({'v1' : df.n1.sum() / df.d1.sum(),
'v2' : df.n2.sum() / df.d2.sum()})
)
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Another option would be to convert the columns into a MultiIndex before grouping:
temp = temp = df.set_index('state')
temp.columns = temp.columns.str.split('(\d)', expand=True).droplevel(-1)
(temp.groupby('state')
.sum()
.pipe(lambda df: df.n /df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Yet another way, still with the MultiIndex option, while avoiding a groupby:
# keep the index, necessary for unstacking later
temp = df.set_index('state', append=True)
# convert the columns to a MultiIndex
temp.columns = temp.columns.map(tuple)
# this works because the index is unique
(temp.unstack('state')
.sum()
.unstack([0,1])
.pipe(lambda df: df.n / df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000

Python: how to groupby a given percentile?

I have a dataframe df
df
User City Job Age
0 A x Unemployed 33
1 B x Student 18
2 C x Unemployed 27
3 D y Data Scientist 28
4 E y Unemployed 45
5 F y Student 18
I want to groupby the City and do some stat. If I have to compute the mean, I can do the following:
tmp = df.groupby(['City']).mean()
I would like to do same by a specific quantile. Is it possible?
def q1(x):
return x.quantile(0.25)
def q2(x):
return x.quantile(0.75)
fc = {'Age': [q1,q2]}
temp = df.groupby('City').agg(fc)
temp
Age
q1 q2
City
x 22.5 30.0
y 23.0 36.5
I believe you need DataFrameGroupBy.quantile:
tmp = df.groupby('City')['Age'].quantile(0.4)
print (tmp)
City
x 25.2
y 26.0
Name: Age, dtype: float64
tmp = df.groupby('City')['Age'].quantile([0.25, 0.75]).unstack().add_prefix('q')
print (tmp)
q0.25 q0.75
City
x 22.5 30.0
y 23.0 36.5
I am using describe
df.groupby('City')['Age'].describe()[['25%','75%']]
Out[542]:
25% 75%
City
x 22.5 30.0
y 23.0 36.5
You can use:
df.groupby('City')['Age'].apply(lambda x: np.percentile(x,[25,75])).reset_index().rename(columns={'Age':'25%, 75%'})
City 25%, 75%
0 x [22.5, 30.0]
1 y [23.0, 36.5]

Pythonic / Panda Way to Create Function to Groupby

I am fairly new to programming & am looking for a more pythonic way to implement some code. Here is dummy data:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I have lots of transactional data like that that I perform various Groupby's on. My current solution is to make a master groupby like this:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
From there, I perform various groupbys using .groupby(level=) function to aggregate the information in the way I'm looking for. I usually make a summary at each level. In addition, I create sub-totals at each level using some variation of the below code.
y = master.groupby(level=[0,1,2]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=[0,1]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
pd.concat([master,y,y1,y2]).sort_index()\
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])\
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)\
.dropna(how='all')\
This is just an example - I may perform the same exercise, but perform the groupby in a different order. For example - next I may want to group by 'Category', 'Product', then 'Customer', so I'd have to do:
master.groupby(level=[1,3,0).sum()
Then I will have to repeat the whole exercise for sub-totals like above. I also frequently change the time period - could be year-ending a specific month, could be year to date, could be by quarter, etc.
From what I've learned so far in programming (which is minimal, clearly!), you should look to write a function any time you repeat code. Obviously I am repeating code over & over again in this example.
Is there a way to construct a function where you can provide the levels to Groupby, along with the time frame, all while creating a function for sub-totaling each level as well?
Thanks in advance for any guidance on this. It is very much appreciated.
For a DRY-er solution, consider generalizing your current method into a defined module that filters original data frame by date ranges and runs aggregations, receiving the group_by levels and date ranges (latter being optional) as passed in parameters:
Method
def multiple_agg(mylevels, start_date='2016-01-01', end_date='2018-12-31'):
filter_df = df[df['Date'].between(start_date, end_date)]
master = (filter_df.groupby(['Customer', 'Category', 'Sub-Category', 'Product',
pd.Grouper(key='Date',freq='A')])['Units_Sold']
.sum()
.unstack()
)
y = master.groupby(level=mylevels[:-1]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=mylevels[0:2]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=mylevels[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
final_df = (pd.concat([master,y,y1,y2])
.sort_index()
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)
.dropna(how='all')
.reorder_levels(mylevels)
)
return final_df
Aggregation Runs (of different levels and date ranges)
agg_df1 = multiple_agg([0,1,2,3])
agg_df2 = multiple_agg([1,3,0,2], '2016-01-01', '2017-12-31')
agg_df3 = multiple_agg([2,3,1,0], start_date='2017-01-01', end_date='2018-12-31')
Testing (final_df being OP'S pd.concat() output)
# EQUALITY TESTING OF FIRST 10 ROWS
print(final_df.head(10).eq(agg_df1.head(10)))
# Date 2016-12-31 00:00:00 2017-12-31 00:00:00 2018-12-31 00:00:00 Diff Diff_Perc
# Customer Category Sub-Category Product
# 45mhn4PU1O Group A X Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# X Total True True True True True
# Y Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# Y Total True True True True True
# Z Product 1 True True True True True
# Product 2 True True True True True
I think you can do it using sum with the level parameter:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True)
s2 = master.sum(level=[0,1])
# Wanted to use assign method but because of the hyphen in the column name you can't.
# Also use the Z in front for sorting purposes
s2['Sub-Category'] = 'ZTotal'
s2['Product'] = ''
s2 = s2.set_index(['Sub-Category','Product'], append=True)
s3 = master.sum(level=[0])
s3['Category'] = 'Total'
s3['Sub-Category'] = ''
s3['Product'] = ''
s3 = s3.set_index(['Category','Sub-Category','Product'], append=True)
master_new = pd.concat([master,s1,s2,s3]).sort_index()
master_new
Output:
Date 2016-12-31 2017-12-31 2018-12-31
Customer Category Sub-Category Product
30XWmt1jm0 Group A X Product 1 651.0 341.0 453.0
Product 2 267.0 445.0 117.0
Product 3 186.0 280.0 352.0
Total 1104.0 1066.0 922.0
Y Product 1 426.0 417.0 670.0
Product 2 362.0 210.0 380.0
Product 3 232.0 290.0 430.0
Total 1020.0 917.0 1480.0
Z Product 1 196.0 212.0 703.0
Product 2 277.0 340.0 579.0
Product 3 416.0 392.0 259.0
Total 889.0 944.0 1541.0
ZTotal 3013.0 2927.0 3943.0
Group B X Product 1 356.0 230.0 407.0
Product 2 402.0 370.0 590.0
Product 3 262.0 381.0 377.0
Total 1020.0 981.0 1374.0
Y Product 1 575.0 314.0 643.0
Product 2 557.0 375.0 411.0
Product 3 344.0 246.0 280.0
Total 1476.0 935.0 1334.0
Z Product 1 278.0 152.0 392.0
Product 2 149.0 596.0 303.0
Product 3 234.0 505.0 521.0
Total 661.0 1253.0 1216.0
ZTotal 3157.0 3169.0 3924.0
Total 6170.0 6096.0 7867.0
3U2anYOD6o Group A X Product 1 214.0 443.0 195.0
Product 2 170.0 220.0 423.0
Product 3 111.0 469.0 369.0
... ... ... ...
somc22Y2Hi Group B Z Total 906.0 1063.0 680.0
ZTotal 3070.0 3751.0 2736.0
Total 6435.0 7187.0 6474.0
zRZq6MSKuS Group A X Product 1 421.0 182.0 387.0
Product 2 359.0 287.0 331.0
Product 3 232.0 394.0 279.0
Total 1012.0 863.0 997.0
Y Product 1 245.0 366.0 111.0
Product 2 377.0 148.0 239.0
Product 3 372.0 219.0 310.0
Total 994.0 733.0 660.0
Z Product 1 280.0 363.0 354.0
Product 2 384.0 604.0 178.0
Product 3 219.0 462.0 366.0
Total 883.0 1429.0 898.0
ZTotal 2889.0 3025.0 2555.0
Group B X Product 1 466.0 413.0 187.0
Product 2 502.0 370.0 368.0
Product 3 745.0 480.0 318.0
Total 1713.0 1263.0 873.0
Y Product 1 218.0 226.0 385.0
Product 2 123.0 382.0 570.0
Product 3 173.0 572.0 327.0
Total 514.0 1180.0 1282.0
Z Product 1 480.0 317.0 604.0
Product 2 256.0 215.0 572.0
Product 3 463.0 50.0 349.0
Total 1199.0 582.0 1525.0
ZTotal 3426.0 3025.0 3680.0
Total 6315.0 6050.0 6235.0
[675 rows x 3 columns]

Efficient pandas rolling aggregation over date range by group - Python 2.7 Windows - Pandas 0.19.2

I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.
Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0

Categories