This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.
Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the
data, transformation can return some transformed version of the full
data to recombine. For such a transformation, the output is the same
shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
You need to make a second groupby object that groups by the states, and then use the div method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.
For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In [12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In [13]: c / c.groupby(level=0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).
I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
0th Caner
NEW Pandas Tranform looks much faster.
df['sales'] / df.groupby('state')['sales'].transform('sum')
1.32 ms ± 352 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
1st Andy Hayden
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
3rd exp1orer
This is the slowest answer as it calculates x.sum() for each x in level 0.
For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).
Here is the modification,
(
df.groupby(['state', 'office_id'])
.agg({'sales': 'sum'})
.groupby(level=0)
.apply(lambda x: 100 * x / float(x.sum()))
)
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
)
]
df = pd.DataFrame({'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)] * 1000000
})
Using Caner's,
0.791 s ± 19.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
Using Andy's,
2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
and exp1orer
19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
So now we see x10 speed up on large, high cardinality datasets with Andy but a very impressive x20 speed up with Caner's.
Be sure to UV these three answers if you UV this one!!
Edit: added Caner benchmark
I realize there are already good answers here.
I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.
It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).
I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494
The most elegant way to find percentages across columns or index is to use pd.crosstab.
Sample Data
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
The output dataframe is like this
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%
You can sum the whole DataFrame and divide by the state total:
# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
df
Returns
office_id sales state sales_ratio
0 1 405711 CA 0.193319
1 2 535829 WA 0.347072
2 3 217952 CO 0.198743
3 4 252315 AZ 0.192500
4 5 982371 CA 0.468094
5 6 459783 WA 0.297815
6 1 404137 CO 0.368519
7 2 222579 AZ 0.169814
8 3 710581 CA 0.338587
9 4 548242 WA 0.355113
10 5 474564 CO 0.432739
11 6 835831 AZ 0.637686
But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:
df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()
Output:
state office_id weightage
0 AZ 2 0.169814
1 AZ 4 0.192500
2 AZ 6 0.637686
3 CA 1 0.193319
4 CA 3 0.338587
5 CA 5 0.468094
6 CO 1 0.368519
7 CO 3 0.198743
8 CO 5 0.432739
9 WA 2 0.347072
10 WA 4 0.355113
11 WA 6 0.297815
I think this would do the trick in 1 line:
df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)
Simple way I have used is a merge after the 2 groupby's then doing simple division.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
Returns:
sales
state office_id
AZ 2 54.587910
4 33.009225
6 12.402865
CA 1 32.046582
3 44.937684
5 23.015735
CO 1 21.099989
3 31.848658
5 47.051353
WA 2 43.882790
4 10.265275
6 45.851935
As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of #exp1orer's accepted answer
With the df, I'll call it by the alias state_office_sales:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).
In: state_total_sales = df.groupby(level=0).sum()
state_total_sales
Out:
sales
state
AZ 2448009
CA 2832270
CO 1495486
WA 595859
Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:
In: state_office_sales / state_total_sales
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 0.288022
3 0.322169
5 0.389809
CO 1 0.206684
3 0.357891
5 0.435425
WA 2 0.321689
4 0.346325
6 0.331986
To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:
In: partial_total = pd.DataFrame(
data = {'sales' : [2448009, 595859, 99999]},
index = ['AZ', 'WA', 'XX' ]
)
partial_total.index.name = 'state'
Out:
sales
state
AZ 2448009
WA 595859
XX 99999
In: state_office_sales / partial_total
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 NaN
3 NaN
5 NaN
CO 1 NaN
3 NaN
5 NaN
WA 2 0.321689
4 0.346325
6 0.331986
This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.
In: missing_index_totals = state_total_sales.rename_axis("")
missing_index_totals
Out:
sales
AZ 2448009
CA 2832270
CO 1495486
WA 595859
In: state_office_sales / missing_index_totals
Out: ValueError: cannot join with no overlapping index names
df.groupby('state').office_id.value_counts(normalize = True)
I used value_counts method, but it returns percentage like 0.70 and 0.30, not like a 70 and 30.
One-line solution:
df.join(
df.groupby('state').agg(state_total=('sales', 'sum')),
on='state'
).eval('sales / state_total')
This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.
Related
I have a csv file that contains some columns. The columns of interest have multiple json objects in a single row. it looks something like this:
IN: df=read_csv('filename.tsv',sep='\t')
IN: df
OUT: name RSN model version dt si2 si3 pi1 wi20 wi28 li1 ci1 ai1 ai2 ai3 ad1 wi19 wi27 wan2 wan1 li3 li2 li5 li4 li7 li6 li9 li8 wi22 wi21 wi24 wi23 wi26 wi25 wi30 wi29 wi14 wi13 wi16 wi15 wi17 wi18
0 DE1 RSN JCO4032 R2.15 12-03-21 06:53:32:155 14 46 831 5 149 2 0 NaN NaN NaN NaN 0 0 218419 553198 1754335 32208167 18594 28750 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN
1 DE1 RSN JCO4032 R2.15 12-03-21 06:54:04:343 14 46 863 5 149 2 0 NaN NaN NaN NaN 0 0 9063 209 99335 1941734 1084 1598 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN
2 DE1 RSN JCO4032 R2.15 12-03-21 07:04:07:579 13 46 1469 5 149 2 0 NaN NaN NaN NaN 0 0 152680 18355 1656295 29541773 17201 25804 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN
IN: df.wi17
OUT:
35 NaN
36 NaN
37 [{"mac":"2xx01:xxF","rssi":-60,"txrate...
38 [{"mac":"20:4xx:1F","rssi":-60,"txrate...
39 NaN
Name: wi17, dtype: object
IN: df.wi17[37]
OUT: '[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]'
I converted this column of strings and NaNs to a column of list of dictionaries using json.loads.
def parser2(d):
if d!=d:
return np.nan
else:
return json.loads(d)
df.wi17 = df.wi17.apply(parser2)
I am looking for a graceful solution to explode these dictionaries and group them on the basis of a unique "mac" which in turn be grouped by a unique 'RSN' in the original df.
It should look something like this:
... RSN .... mac rssi txrate max_txrate txbytes rxbytes nxn ...
... RSNFDXXXKDF ... 2A:xxxx:sd 30 34 50 2323 34323 1x1 ...
... RSNFDXXXKDF ... 2A:xxxx:sd 50 84 70 20 2334343 1x1 ...
... RSNFDXXXKDF ... 3B:yyyy:sd 45 48 47 40 2334 2x2 ...
... RSNFDXXXKDF ... Nan Nan Nan Nan Nan Nan Nan ...
... ADKNCCJXKDF ... AA:yyyy:sd 45 48 47 40 2334 2x2 ...
Any suggestions?
Let's use df.explode(), df.apply() + pd.Series() on the data of column wi17 and then pd.concat :
df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1),
df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
axis=1).reset_index()
Here, we use df2.apply(lambda x: pd.Series(x.wi17), axis=1) on df2 which has been exploded with the list of dictionaries into a single dictionary in each row of df2. Use the lambda function with pd.Series to expand a dictionary with its respective dictionary keys, values into column index and column values.
Demo Run
Test data construction
data = {'name': ['DE1', 'DE2', 'DE3'], 'RSN': ['RSNJCO4032', 'RSNJCO4033', 'RSNJCO4034']}
df = pd.DataFrame(data)
df['wi17'] = ['[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"40:17xx:1F","rssi":-62,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"F8xx:B0","rssi":-58,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"60:07xx:1F","rssi":-64,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"A8xx:C0","rssi":-61,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]']
import json
def parser2(d):
if d!=d:
return np.nan
else:
return json.loads(d)
df.wi17 = df.wi17.apply(parser2)
print(df)
name RSN wi17
0 DE1 RSNJCO4032 [{'mac': '20:47xx:1F', 'rssi': -60, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'E8xx:A0', 'rssi': -57, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
1 DE2 RSNJCO4033 [{'mac': '40:17xx:1F', 'rssi': -62, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'F8xx:B0', 'rssi': -58, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
2 DE3 RSNJCO4034 [{'mac': '60:07xx:1F', 'rssi': -64, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'A8xx:C0', 'rssi': -61, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
Run New Codes
df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1),
df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
axis=1).reset_index()
print(df3)
Output:
name RSN mac rssi txrate max_txrate txbytes rxbytes nxn
0 DE1 RSNJCO4032 20:47xx:1F -60 72.0 72.0 0 0 1x1
1 DE1 RSNJCO4032 E8xx:A0 -57 72.0 72.0 1414810891 808725830 1x1
2 DE2 RSNJCO4033 40:17xx:1F -62 72.0 72.0 0 0 1x1
3 DE2 RSNJCO4033 F8xx:B0 -58 72.0 72.0 1414810891 808725830 1x1
4 DE3 RSNJCO4034 60:07xx:1F -64 72.0 72.0 0 0 1x1
5 DE3 RSNJCO4034 A8xx:C0 -61 72.0 72.0 1414810891 808725830 1x1
Edit
For better system performance (execution time), you can try to change the .apply() function to list(map(...) as follows:
df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
axis=1).reset_index()
Edit 2
The system performance (execution time) is further fine-tuned. Benchmarking shows more than 20x times faster execution time can be achieved with using pd.json_normalize() to expand the json structure into a new Pandas DataFrame for merging into the original DataFrame:
df2 = df.explode('wi17')
df2['wi17'] = df2['wi17'].fillna({i: {} for i in df2.index}) # as suggested by #TwerkingPanda to handle NaN entries.
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True),
pd.json_normalize(df2['wi17'])],
axis=1).reset_index()
Benchmarking system performance with 30,000 rows (each row with 2 json, hence total 60,000 json)
df1 = pd.concat([df] * 10000, ignore_index=True)
df2 = df1.explode('wi17')
(1) Benchmark for original version using .apply() with pd.Series()
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1),
df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
axis=1).reset_index()
21.9 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(2) Benchmark for revised version using list(map(...)) with pd.Series() and pd.DataFrame():
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
axis=1).reset_index()
20.6 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(3) Benchmark for revised version using pd.json_normalize():
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True),
pd.json_normalize(df2['wi17'])],
axis=1).reset_index()
999 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edt 3
This version further improves on system performance (execution time) and at the same time allows keeping the original series index for OP to ensure the data integrity.
As the json objects are in one level without nesting, we can make use of the more efficient DataFrame constructor pd.DataFrame() to expand the json fields into columns, as follows:
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
axis=1)
Benchmark for version using pd.DataFrame():
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
axis=1)
116 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This version is further improved by more than 8x times faster than the version using pd.json_normalize() and more than 180x times faster than the version using df.apply() + pd.Series(). Moreover, we have the flexibility to retain the index the original dataframe on the expanded columns by using the index= parameter of pd.DataFrame().
The question is to use Pandas, remove any invalid names and any candidate without both predictions. In the data frame, some candidate names appear twice for two prediction dates or some candidate names only appear once. So I want to drop those candidates who only have one prediction date.
I am trying to use groupby and filter function to drop the candidates' names that doesn't meet both condition: ('forecast_date'== '2018-08-11') AND ('forecast_date'=='2018-11-06')
Here is my code:
election_sub=election_sub.dropna(subset=['candidate'])
election_sub.groupby('candidate')
grouped.filter(lambda x: (x['forecast_date']== '2018-08-11')&(x['forecast_date']=='2018-11-06'))
Here is the dataframe:
Use:
#data to DataFrame
url = 'https://raw.githubusercontent.com/fivethirtyeight/checking-our-work-data/master/us_house_elections.csv'
election_sub = pd.read_csv(url, parse_dates=['election_date','forecast_date'])
#filter out `NaN`s
election_sub=election_sub.dropna(subset=['candidate'])
#filter rows for match one OR another datetime
df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
#get number of unique datetimes per groups
s = df.groupby('candidate')['forecast_date'].nunique()
#filter candidates only with both datetimes, like condition AND
cand = s.index[s.eq(2)].unique()
print (cand)
Index(['A. Donald McEachin', 'Aaron Andrus', 'Aaron Swisher',
'Abby Finkenauer', 'Abigail Spanberger', 'Adam B. Schiff',
'Adam Kinzinger', 'Adam Smith', 'Adrian Smith', 'Adriano Espaillat',
...
'William Lacy Clay', 'William Tanoos', 'William Timmons',
'Willie Billups', 'Xochitl Torres Small', 'Young Kim', 'Yvette Clarke',
'Yvette Herrell', 'Yvonne Hayes Hinson', 'Zoe Lofgren'],
dtype='object', name='candidate', length=960)
#filter original data by candidates
df = election_sub[election_sub['candidate'].isin(cand)]
Your solution is possible with test if at least one condition True for both - output are 2 scalars, so for AND is used and:
grouped = election_sub.groupby('candidate')
df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
print(df)
year office state district special election_date forecast_date \
0 2018 House WY 1.0 False 2018-11-06 2018-11-06
1 2018 House WY 1.0 False 2018-11-06 2018-11-06
2 2018 House WY 1.0 False 2018-11-06 2018-11-06
3 2018 House WY 1.0 False 2018-11-06 2018-11-06
4 2018 House WY 1.0 False 2018-11-06 2018-11-06
... ... ... ... ... ... ... ...
282688 2018 House AK 1.0 False 2018-11-06 2018-08-01
282689 2018 House AK 1.0 False 2018-11-06 2018-08-01
282690 2018 House AK 1.0 False 2018-11-06 2018-08-01
282691 2018 House AK 1.0 False 2018-11-06 2018-08-01
282692 2018 House AK 1.0 False 2018-11-06 2018-08-01
forecast_type party candidate projected_voteshare \
0 lite D Greg Hunter 33.29836
1 lite R Liz Cheney 61.18835
2 deluxe D Greg Hunter 31.37998
3 deluxe R Liz Cheney 63.10673
4 classic D Greg Hunter 31.33293
... ... ... ... ...
282688 lite R Don Young 50.74973
282689 deluxe D Alyse S. Galvin 41.49152
282690 deluxe R Don Young 51.96705
282691 classic D Alyse S. Galvin 44.10701
282692 classic R Don Young 49.35155
actual_voteshare probwin probwin_outcome
0 NaN 0.00134 0
1 NaN 0.99866 1
2 NaN 0.00020 0
3 NaN 0.99980 1
4 NaN 0.00032 0
... ... ... ...
282688 NaN 0.76900 1
282689 NaN 0.12776 0
282690 NaN 0.87224 1
282691 NaN 0.28146 0
282692 NaN 0.71854 1
[282240 rows x 14 columns]
EDIT:
Performance of both solutions is different:
In [41]: %%timeit
...: df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
...: #get number of unique datetimes per groups
...: s = df.groupby('candidate')['forecast_date'].nunique()
...: #filter candidates only with both datetimes, like condition AND
...: cand = s.index[s.eq(2)].unique()
...:
...: #filter original data by candidates
...: df = election_sub[election_sub['candidate'].isin(cand)]
...:
61.3 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: %%timeit
...: grouped = election_sub.groupby('candidate')
...: df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
...:
1.07 s ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.
This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.
Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the
data, transformation can return some transformed version of the full
data to recombine. For such a transformation, the output is the same
shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
You need to make a second groupby object that groups by the states, and then use the div method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.
For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In [12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In [13]: c / c.groupby(level=0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).
I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
0th Caner
NEW Pandas Tranform looks much faster.
df['sales'] / df.groupby('state')['sales'].transform('sum')
1.32 ms ± 352 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
1st Andy Hayden
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
3rd exp1orer
This is the slowest answer as it calculates x.sum() for each x in level 0.
For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).
Here is the modification,
(
df.groupby(['state', 'office_id'])
.agg({'sales': 'sum'})
.groupby(level=0)
.apply(lambda x: 100 * x / float(x.sum()))
)
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
)
]
df = pd.DataFrame({'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)] * 1000000
})
Using Caner's,
0.791 s ± 19.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
Using Andy's,
2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
and exp1orer
19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
So now we see x10 speed up on large, high cardinality datasets with Andy but a very impressive x20 speed up with Caner's.
Be sure to UV these three answers if you UV this one!!
Edit: added Caner benchmark
I realize there are already good answers here.
I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.
It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).
I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494
The most elegant way to find percentages across columns or index is to use pd.crosstab.
Sample Data
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
The output dataframe is like this
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%
You can sum the whole DataFrame and divide by the state total:
# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
df
Returns
office_id sales state sales_ratio
0 1 405711 CA 0.193319
1 2 535829 WA 0.347072
2 3 217952 CO 0.198743
3 4 252315 AZ 0.192500
4 5 982371 CA 0.468094
5 6 459783 WA 0.297815
6 1 404137 CO 0.368519
7 2 222579 AZ 0.169814
8 3 710581 CA 0.338587
9 4 548242 WA 0.355113
10 5 474564 CO 0.432739
11 6 835831 AZ 0.637686
But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:
df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()
Output:
state office_id weightage
0 AZ 2 0.169814
1 AZ 4 0.192500
2 AZ 6 0.637686
3 CA 1 0.193319
4 CA 3 0.338587
5 CA 5 0.468094
6 CO 1 0.368519
7 CO 3 0.198743
8 CO 5 0.432739
9 WA 2 0.347072
10 WA 4 0.355113
11 WA 6 0.297815
I think this would do the trick in 1 line:
df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)
Simple way I have used is a merge after the 2 groupby's then doing simple division.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
Returns:
sales
state office_id
AZ 2 54.587910
4 33.009225
6 12.402865
CA 1 32.046582
3 44.937684
5 23.015735
CO 1 21.099989
3 31.848658
5 47.051353
WA 2 43.882790
4 10.265275
6 45.851935
As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of #exp1orer's accepted answer
With the df, I'll call it by the alias state_office_sales:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).
In: state_total_sales = df.groupby(level=0).sum()
state_total_sales
Out:
sales
state
AZ 2448009
CA 2832270
CO 1495486
WA 595859
Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:
In: state_office_sales / state_total_sales
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 0.288022
3 0.322169
5 0.389809
CO 1 0.206684
3 0.357891
5 0.435425
WA 2 0.321689
4 0.346325
6 0.331986
To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:
In: partial_total = pd.DataFrame(
data = {'sales' : [2448009, 595859, 99999]},
index = ['AZ', 'WA', 'XX' ]
)
partial_total.index.name = 'state'
Out:
sales
state
AZ 2448009
WA 595859
XX 99999
In: state_office_sales / partial_total
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 NaN
3 NaN
5 NaN
CO 1 NaN
3 NaN
5 NaN
WA 2 0.321689
4 0.346325
6 0.331986
This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.
In: missing_index_totals = state_total_sales.rename_axis("")
missing_index_totals
Out:
sales
AZ 2448009
CA 2832270
CO 1495486
WA 595859
In: state_office_sales / missing_index_totals
Out: ValueError: cannot join with no overlapping index names
df.groupby('state').office_id.value_counts(normalize = True)
I used value_counts method, but it returns percentage like 0.70 and 0.30, not like a 70 and 30.
One-line solution:
df.join(
df.groupby('state').agg(state_total=('sales', 'sum')),
on='state'
).eval('sales / state_total')
This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.
When I use pandas value_count method, I get the data below:
new_df['mark'].value_counts()
1 1349110
2 1606640
3 175629
4 790062
5 330978
How can I get the percentage for each row like this?
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
I need to divide each row by the sum of these data.
np.random.seed([3,1415])
s = pd.Series(np.random.choice(list('ABCDEFGHIJ'), 1000, p=np.arange(1, 11) / 55.))
s.value_counts()
I 176
J 167
H 136
F 128
G 111
E 85
D 83
C 52
B 38
A 24
dtype: int64
As percent
s.value_counts(normalize=True)
I 0.176
J 0.167
H 0.136
F 0.128
G 0.111
E 0.085
D 0.083
C 0.052
B 0.038
A 0.024
dtype: float64
counts = s.value_counts()
percent = counts / counts.sum()
fmt = '{:.1%}'.format
pd.DataFrame({'counts': counts, 'per': percent.map(fmt)})
counts per
I 176 17.6%
J 167 16.7%
H 136 13.6%
F 128 12.8%
G 111 11.1%
E 85 8.5%
D 83 8.3%
C 52 5.2%
B 38 3.8%
A 24 2.4%
I think you need:
#if output is Series, convert it to DataFrame
df = df.rename('a').to_frame()
df['per'] = (df.a * 100 / df.a.sum()).round(1).astype(str) + '%'
print (df)
a per
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
Timings:
It seems faster is use sum as twice value_counts:
In [184]: %timeit (jez(s))
10 loops, best of 3: 38.9 ms per loop
In [185]: %timeit (pir(s))
10 loops, best of 3: 76 ms per loop
Code for timings:
np.random.seed([3,1415])
s = pd.Series(np.random.choice(list('ABCDEFGHIJ'), 1000, p=np.arange(1, 11) / 55.))
s = pd.concat([s]*1000)#.reset_index(drop=True)
def jez(s):
df = s.value_counts()
df = df.rename('a').to_frame()
df['per'] = (df.a * 100 / df.a.sum()).round(1).astype(str) + '%'
return df
def pir(s):
return pd.DataFrame({'a':s.value_counts(),
'per':s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})
print (jez(s))
print (pir(s))
Here's a more pythonic snippet than what is proposed above I think
def aspercent(column,decimals=2):
assert decimals >= 0
return (round(column*100,decimals).astype(str) + "%")
aspercent(df['mark'].value_counts(normalize=True),decimals=1)
This will output:
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
This also allows to adjust the number of decimals
Create two series, first one with absolute values and a second one with percentages, and concatenate them:
import pandas
d = {'mark': ['pos', 'pos', 'pos', 'pos', 'pos',
'neg', 'neg', 'neg', 'neg',
'neutral', 'neutral' ]}
df = pd.DataFrame(d)
absolute = df['mark'].value_counts(normalize=False)
absolute.name = 'value'
percentage = df['mark'].value_counts(normalize=True)
percentage.name = '%'
percentage = (percentage*100).round(2)
pd.concat([absolute, percentage], axis=1)
Output:
value %
pos 5 45.45
neg 4 36.36
neutral 2 18.18