Pandas - Groupby and aggregate over multiple columns - python

I am trying to aggregate values in a groupby over multiple columns. I come from the R/dplyr world and what I want is usually achievable in a single line using group_by/summarize. I am trying to find an equivalently elegant way of achieving this using pandas.
Consider the below Input Dataset. I would like to aggregate by state and calculate the column v1 as v1 = sum(n1)/sum(d1) by state.
The r-code for this using dplyr is as follows:
input %>% group_by(state) %>%
summarise(v1=sum(n1)/sum(d1),
v2=sum(n2)/sum(d2))
Is there an elegant way of doing this in Python? I found a slightly verbose way of getting what I want in on a stack overflow answer here.
Copying over modified python-code from the link
In [14]: s = mn.groupby('state', as_index=False).sum()
In [15]: s['v1'] = s['n1'] / s['d1']
In [16]: s['v2'] = s['n2'] / s['d2']
In [17]: s[['state', 'v1', 'v2']]
INPUT DATASET
state n1 n2 d1 d2
CA 100 1000 1 2
FL 200 2000 2 4
CA 300 3000 3 6
AL 400 4000 4 8
FL 500 5000 5 2
NY 600 6000 6 4
CA 700 7000 7 6
OUTPUT
state v1 v2
AL 100 500.000000
CA 100 500.000000
NY 100 1500.000000
CA 100 1166.666667
FL 100 1166.666667

One possible solution with DataFrame.assign and DataFrame.reindex:
df = (mn.groupby('state', as_index=False)
.sum()
.assign(v1 = lambda x: x['n1'] / x['d1'], v2 = lambda x: x['n2'] / x['d2'])
.reindex(['state', 'v1', 'v2'], axis=1))
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
And another with GroupBy.apply and custom lambda function:
df = (mn.groupby('state')
.apply(lambda x: x[['n1','n2']].sum() / x[['d1','d2']].sum().values)
.reset_index()
.rename(columns={'n1':'v1', 'n2':'v2'})
)
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000

Another solution:
def func(x):
u = x.sum()
return pd.Series({'v1':u['n1']/u['d1'],
'v2':u['n2']/u['d2']})
df.groupby('state').apply(func)
Output:
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000

Here is the equivalent way as you did in R:
>>> from datar.all import f, tribble, group_by, summarise, sum
>>>
>>> input = tribble(
... f.state, f.n1, f.n2, f.d1, f.d2,
... "CA", 100, 1000, 1, 2,
... "FL", 200, 2000, 2, 4,
... "CA", 300, 3000, 3, 6,
... "AL", 400, 4000, 4, 8,
... "FL", 500, 5000, 5, 2,
... "NY", 600, 6000, 6, 4,
... "CA", 700, 7000, 7, 6,
... )
>>>
>>> input >> group_by(f.state) >> \
... summarise(v1=sum(f.n1)/sum(f.d1),
... v2=sum(f.n2)/sum(f.d2))
state v1 v2
<object> <float64> <float64>
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
I am the author of the datar package.

Another option is with the pipe function, where the groupby object is resuable:
(df.groupby('state')
.pipe(lambda df: pd.DataFrame({'v1' : df.n1.sum() / df.d1.sum(),
'v2' : df.n2.sum() / df.d2.sum()})
)
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Another option would be to convert the columns into a MultiIndex before grouping:
temp = temp = df.set_index('state')
temp.columns = temp.columns.str.split('(\d)', expand=True).droplevel(-1)
(temp.groupby('state')
.sum()
.pipe(lambda df: df.n /df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Yet another way, still with the MultiIndex option, while avoiding a groupby:
# keep the index, necessary for unstacking later
temp = df.set_index('state', append=True)
# convert the columns to a MultiIndex
temp.columns = temp.columns.map(tuple)
# this works because the index is unique
(temp.unstack('state')
.sum()
.unstack([0,1])
.pipe(lambda df: df.n / df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000

Related

Calculate percentage of a values occurence

I am using this dataframe:
Car make | Driver's Gender
Ford | m
GMC | m
GMC | f
Ferrari | f
I would like to calculate the percentage of each make's male drivers.
Car make | Male drivers
Ford | 100
GMC | 50
Ferrari | 0
Compare second column for m and then aggregate mean:
df1 = (df["Driver's Gender"].eq('m')
.groupby(df['Car make'], sort=False)
.mean()
.mul(100)
.reset_index(name='Male drivers'))
print (df1)
Car make Male drivers
0 Ford 100.0
1 GMC 50.0
2 Ferrari 0.0
Another idea with crosstab and normalize parameter:
df2 = pd.crosstab(df['Car make'], df["Driver's Gender"], normalize=0).mul(100)
print (df2)
Driver's Gender f m
Car make
Ferrari 100.0 0.0
Ford 0.0 100.0
GMC 50.0 50.0
Here are a few approaches:
Quick and dirty by converting "m" to 100 and "f" to 0and taking a mean
df["Male drivers"] = df["Driver's gender"].apply(lambda x: 100 if x=="m" else 0)
male_freq = df.groupby("Car make").mean(numeric_only=True)
Using groupby and a manual frequency calculation
male_freq = df.groupby("Car make").agg(lambda x: 100*sum(x == "m") / len(x))
Using groupby and value_counts
def get_male_frequency(series):
val_counts = series.value_counts(normalize=True)
return 100 * val_counts.get("m", 0)
male_freq = df.groupby("Car make").agg(get_male_frequency)
Or a more general version of the same:
def get_frequency(value_of_interest):
def _get_frequency(series):
val_counts = series.value_counts(normalize=True)
return 100 * val_counts.get(value_of_interest, 0)
return _get_frequency
x = df.groupby("Car make").agg(get_frequency("m"))
They all output the following:
Driver's gender
Car make
Ferrari 0.0
Ford 100.0
GMC 50.0

Python : Remodeling the presentation data from a pandas Dataframe / group duplicates

Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you
Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

Standardize Unit of measurements in dataframe

Consider I have the following dataframe
d = {'quantity': [100, 910, 500, 50, 0.5, 22.5, 1300, 600, 20], 'uom': ['KG', 'GM', 'KG', 'KG', 'GM', 'MT', 'GM', 'GM', 'MT']}
df = pd.DataFrame(data=d)
df
My dataframe is like this:
quantity uom
0 100.0 KG
1 910.0 GM
2 500.0 KG
3 50.0 KG
4 0.5 GM
5 22.5 MT
6 1300.0 GM
7 600.0 GM
8 20.0 MT
Now I want to use a single UOM for all the data. For that I have the following code:
listy = []
listy.append(list(df['quantity']))
listy.append(list(df['uom']))
for index, x in enumerate(listy[0]):
if listy[1][index] == 'MT':
listy[0][index] = '{:1.4f}'.format(x * 1000)
listy[1][index] = 'KG'
elif listy[1][index] == 'LBS':
listy[0][index] = '{:1.4f}'.format(x * 0.453592)
listy[1][index] = 'KG'
elif listy[1][index] == 'GM':
listy[0][index] = '{:1.4f}'.format(x * 0.001)
listy[1][index] = 'KG'
elif listy[1][index] == 'MG':
listy[0][index] = '{:1.4f}'.format(x * 0.000001)
listy[1][index] = 'KG'
elif listy[1][index] == 'KG':
listy[0][index] = '{:1.4f}'.format(x * 1)
listy[1][index] = 'KG'
df['quantity'] = listy[0]
df['uom'] = listy[1]
df
quantity uom
0 100.0000 KG
1 0.9100 KG
2 500.0000 KG
3 50.0000 KG
4 0.0005 KG
5 22500.0000 KG
6 1.3000 KG
7 0.6000 KG
8 20000.0000 KG
But If we have a really large dataframe I dont think looping through it would be a good way of doing this.
Can I do the similar in some better way?
I was also trying List Comprehension but couldn't do it using that.
Map using a dict and multiply the values i.e
vals = {'MT':1000, 'LBS':0.453592, 'GM': 0.001, 'MG':0.000001, 'KG':1}
df['new'] = df['quantity']*df['uom'].map(vals)
quantity uom new
0 100.0 KG 100.0000
1 910.0 GM 0.9100
2 500.0 KG 500.0000
3 50.0 KG 50.0000
4 0.5 GM 0.0005
5 22.5 MT 22500.0000
6 1300.0 GM 1.3000
7 600.0 GM 0.6000
8 20.0 MT 20000.0000
If you want to add 'KG' as a column values then use df['new_unit'] = 'KG'
You can use apply on rows by specifying the axis parameter. Like this:
uom_map = {
'KG': 1,
'GM': .001,
'MT': 1000,
'LBS': 0.453592,
'MG': .000001,
}
def to_kg(row):
quantity, uom = row.quantity, row.uom
multiplier = uom_map[uom]
return quantity*multiplier
df['quantity_kg'] = df.apply(to_kg, axis=1)

Efficient pandas rolling aggregation over date range by group - Python 2.7 Windows - Pandas 0.19.2

I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.
Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0

Categories