count common entries between two string variables via Python - python

I would greatly appreciate someones help with counting the number of matching state names from two columns in my csv file. For instance consider the first 7 observations from columns State_born_in and state_lives_in:
State_born_in State_lives_in
New York Florida
Massachusetts Massachusetts
Florida Massachusetts
Illinois Illinois 
Iowa Texas
New Hampshire Massachusetts
California California
Basically I want to count the number of people who lived in the same state they were born in. I then want the percentage of all people who live in the same state they're born in. So in the example above I would have a count = 2 since there are two people that live in the same state they were born in(California and Massachusetts) who live in the same state they were born in. And if I wanted the percentage I would just divide 2 by the number of observations. I'm still relatively new to using pandas but this is what I've tried so far
df = pd.read_csv("uscitizens.csv","a")
import pandas as pd
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])
Moreover, how would I do this on a dataset that has over 2 million observations? I would greatly appreciate anyone's help

You can use first boolean indexing and then simple divide length of filtered DataFrame with length of original (it is same as length of index, what is fastest):
print (df)
State_born_in State_lives_in
0 New York Florida
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
4 Florida Massachusetts
5 Illinois Illinois
6 Iowa Texas
7 New Hampshire Massachusetts
8 California California
same = df[(df['State_born_in'] == df['State_lives_in'])]
print (same)
State_born_in State_lives_in
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
5 Illinois Illinois
8 California California
counts = len(same.index)
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.55555555555556
Timings:
In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop
In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop
In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
Faster solution:
same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0 False
1 True
2 True
3 True
4 False
5 True
6 False
7 False
8 True
dtype: bool
counts = same.sum()
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556
Timings in 2M DataFrame:
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)
In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop
In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop

Are you expecting this?
counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])

Related

Improve performance of a nested apply in pandas

I have a pandas DataFrame names and a Series illegal_words.
# names - ca. 250k rows
name
0 MISS ELFRIEDA ALPERT
1 DALE VON PETTY
2 MOHAMMAD IBN MASILLAH
3 YELENA THE MORRIS
4 MR. SHENNA DEMOSS
...
# illegal_words - ca. 2k rows
0 MISS
1 VON
2 THE
...
I want to remove any illegal word from names.
I'm calling a loop in a loop:
import re
for word in illegal_words:
names['name'] = names['name'].apply(lambda x: re.sub(word, '', x))
Output:
# names
name
0 ELFRIEDA ALPERT
1 DALE PETTY
2 MOHAMMAD MASILLAH
3 YELENA MORRIS
4 SHENNA DEMOSS
...
(double spaces are not much of an issue)
And it works, but... this is very slow. The re.sub() method is called 2'000 * 250'000 = 500'000'000 times!
What can I do to speed it up?
(since the illegal_words is very project-specific, I cannot use any external package)
Can you try that?
illegal_words = ['MISS', 'VON', 'THE']
out = df['name'].str.replace(fr"({'|'.join(illegal_words)}) ", '', regex=True)
>>> out
0 ELFRIEDA ALPERT
1 DALE PETTY
2 MOHAMMAD IBN MASILLAH
3 YELENA MORRIS
4 MR. SHENNA DEMOSS
Name: name, dtype: object
Performance
For a random list of 2,500 words and 250,000 records:
%timeit df['name'].str.replace(fr"({'|'.join(illegal_words)}) ", '', regex=True)
130 ms ± 870 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pandas getting most frequent names from a column which has list of names

my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()

How can I only sum the first element of each group in the first GroupBy column when there are multiple GroupBy columns? [duplicate]

This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.
Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the
data, transformation can return some transformed version of the full
data to recombine. For such a transformation, the output is the same
shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
You need to make a second groupby object that groups by the states, and then use the div method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.
For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In [12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In [13]: c / c.groupby(level=0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).
I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
0th Caner
NEW Pandas Tranform looks much faster.
df['sales'] / df.groupby('state')['sales'].transform('sum')
1.32 ms ± 352 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
1st Andy Hayden
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
3rd exp1orer
This is the slowest answer as it calculates x.sum() for each x in level 0.
For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).
Here is the modification,
(
df.groupby(['state', 'office_id'])
.agg({'sales': 'sum'})
.groupby(level=0)
.apply(lambda x: 100 * x / float(x.sum()))
)
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
)
]
df = pd.DataFrame({'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)] * 1000000
})
Using Caner's,
0.791 s ± 19.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
Using Andy's,
2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
and exp1orer
19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
So now we see x10 speed up on large, high cardinality datasets with Andy but a very impressive x20 speed up with Caner's.
Be sure to UV these three answers if you UV this one!!
Edit: added Caner benchmark
I realize there are already good answers here.
I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.
It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).
I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494
The most elegant way to find percentages across columns or index is to use pd.crosstab.
Sample Data
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
The output dataframe is like this
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%
You can sum the whole DataFrame and divide by the state total:
# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
df
Returns
office_id sales state sales_ratio
0 1 405711 CA 0.193319
1 2 535829 WA 0.347072
2 3 217952 CO 0.198743
3 4 252315 AZ 0.192500
4 5 982371 CA 0.468094
5 6 459783 WA 0.297815
6 1 404137 CO 0.368519
7 2 222579 AZ 0.169814
8 3 710581 CA 0.338587
9 4 548242 WA 0.355113
10 5 474564 CO 0.432739
11 6 835831 AZ 0.637686
But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:
df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()
Output:
state office_id weightage
0 AZ 2 0.169814
1 AZ 4 0.192500
2 AZ 6 0.637686
3 CA 1 0.193319
4 CA 3 0.338587
5 CA 5 0.468094
6 CO 1 0.368519
7 CO 3 0.198743
8 CO 5 0.432739
9 WA 2 0.347072
10 WA 4 0.355113
11 WA 6 0.297815
I think this would do the trick in 1 line:
df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)
Simple way I have used is a merge after the 2 groupby's then doing simple division.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
Returns:
sales
state office_id
AZ 2 54.587910
4 33.009225
6 12.402865
CA 1 32.046582
3 44.937684
5 23.015735
CO 1 21.099989
3 31.848658
5 47.051353
WA 2 43.882790
4 10.265275
6 45.851935
As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of #exp1orer's accepted answer
With the df, I'll call it by the alias state_office_sales:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).
In: state_total_sales = df.groupby(level=0).sum()
state_total_sales
Out:
sales
state
AZ 2448009
CA 2832270
CO 1495486
WA 595859
Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:
In: state_office_sales / state_total_sales
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 0.288022
3 0.322169
5 0.389809
CO 1 0.206684
3 0.357891
5 0.435425
WA 2 0.321689
4 0.346325
6 0.331986
To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:
In: partial_total = pd.DataFrame(
data = {'sales' : [2448009, 595859, 99999]},
index = ['AZ', 'WA', 'XX' ]
)
partial_total.index.name = 'state'
Out:
sales
state
AZ 2448009
WA 595859
XX 99999
In: state_office_sales / partial_total
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 NaN
3 NaN
5 NaN
CO 1 NaN
3 NaN
5 NaN
WA 2 0.321689
4 0.346325
6 0.331986
This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.
In: missing_index_totals = state_total_sales.rename_axis("")
missing_index_totals
Out:
sales
AZ 2448009
CA 2832270
CO 1495486
WA 595859
In: state_office_sales / missing_index_totals
Out: ValueError: cannot join with no overlapping index names
df.groupby('state').office_id.value_counts(normalize = True)
I used value_counts method, but it returns percentage like 0.70 and 0.30, not like a 70 and 30.
One-line solution:
df.join(
df.groupby('state').agg(state_total=('sales', 'sum')),
on='state'
).eval('sales / state_total')
This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.

optimize a string query with pandas. large data

I have a dataframe data, which has close to 4 millions rows. It is a list of cities on the world. I need to query the city name as fast as possible.
I found out that one with 346ms via indexing the city name:
d2=data.set_index("city",inplace=False)
timeit d2.loc[['PARIS']]
1 loop, best of 3: 346 ms per loop
This is still much too slow. I wonder if with group-by I could achieve faster query (how to do such a query). Each city has around 10 rows in the dataframe (duplicate city).
I searched several days and could not find a clear solution on the internet
thank you
Setup
df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])
Baseline
df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop
Solution
Using a lookup dict and then use iloc:
idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop
It seems this approach is almost 100 times faster than the baseline.
Comparing to other approaches:
%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop
%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop
Working with the array data for the index, comparing against the needed index and then using the mask from the comparison might be one option when looking for performance. A sample case might make things clear.
1) Input dataframes :
In [591]: df
Out[591]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [592]: d2 = df.set_index("city",inplace=False)
In [593]: d2
Out[593]:
population
city
Delhi 1000
Paris 56
NY 89
Paris 36
Delhi 300
Paris 52
Paris 34
Delhi 40
NY 89
Delhi 450
2) Indexing with .loc :
In [594]: d2.loc[['Paris']]
Out[594]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
3) Use mask based indexing :
In [595]: d2[d2.index.values=="Paris"]
Out[595]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
4) Finally timings :
In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop
In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop
Further boost
Going further with using array data, we can extract the entire input dataframe as array and index into it. Thus, an implementation using that philosophy would look something like this -
def full_array_based(d2, indexval):
df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
df0.index = [indexval]*df0.shape[0]
df0.columns = d2.columns
return df0
Sample run and timings -
In [635]: full_array_based(d2, "Paris")
Out[635]:
population
Paris 56
Paris 36
Paris 52
Paris 34
In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop
If we are allowed to pre-process to setup a dictonary that could be indexed for extracting city string based data extraction from the input dataframe, here's one solution using NumPy to do so -
def indexed_dict_numpy(df):
cs = df.city.values.astype(str)
sidx = cs.argsort()
scs = cs[sidx]
idx = np.concatenate(( [0], np.flatnonzero(scs[1:] != scs[:-1])+1, [cs.size]))
return {n:sidx[i:j] for n,i,j in zip(cs[sidx[idx[:-1]]], idx[:-1], idx[1:])}
Sample run -
In [10]: df
Out[10]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [11]: dict1 = indexed_dict_numpy(df)
In [12]: df.iloc[dict1['Paris']]
Out[12]:
city population
1 Paris 56
3 Paris 36
5 Paris 52
6 Paris 34
Runtime test against #Allen's solution to setup a similar dictionary with 4 Mil rows -
In [43]: # Setup 4 miliion rows of df
...: df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(400000)]*10,\
...: columns=['city','value'])
...: np.random.shuffle(df.values)
...:
In [44]: %timeit df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
1 loops, best of 3: 2.01 s per loop
In [45]: %timeit indexed_dict_numpy(df)
1 loops, best of 3: 1.15 s per loop

Pandas percentage of total with groupby

This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.
Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the
data, transformation can return some transformed version of the full
data to recombine. For such a transformation, the output is the same
shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
You need to make a second groupby object that groups by the states, and then use the div method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.
For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In [12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In [13]: c / c.groupby(level=0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).
I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
0th Caner
NEW Pandas Tranform looks much faster.
df['sales'] / df.groupby('state')['sales'].transform('sum')
1.32 ms ± 352 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
1st Andy Hayden
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
3rd exp1orer
This is the slowest answer as it calculates x.sum() for each x in level 0.
For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).
Here is the modification,
(
df.groupby(['state', 'office_id'])
.agg({'sales': 'sum'})
.groupby(level=0)
.apply(lambda x: 100 * x / float(x.sum()))
)
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)
So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
)
]
df = pd.DataFrame({'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)] * 1000000
})
Using Caner's,
0.791 s ± 19.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
Using Andy's,
2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
and exp1orer
19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)
So now we see x10 speed up on large, high cardinality datasets with Andy but a very impressive x20 speed up with Caner's.
Be sure to UV these three answers if you UV this one!!
Edit: added Caner benchmark
I realize there are already good answers here.
I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.
It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).
I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494
The most elegant way to find percentages across columns or index is to use pd.crosstab.
Sample Data
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
The output dataframe is like this
print(df)
state office_id sales
0 CA 1 764505
1 WA 2 313980
2 CO 3 558645
3 AZ 4 883433
4 CA 5 301244
5 WA 6 752009
6 CO 1 457208
7 AZ 2 259657
8 CA 3 584471
9 WA 4 122358
10 CO 5 721845
11 AZ 6 136928
Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.
result = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index').applymap('{:.2f}%'.format)
print(result)
office_id 1 2 3 4 5 6
state
AZ 0.00% 0.20% 0.00% 0.69% 0.00% 0.11%
CA 0.46% 0.00% 0.35% 0.00% 0.18% 0.00%
CO 0.26% 0.00% 0.32% 0.00% 0.42% 0.00%
WA 0.00% 0.26% 0.00% 0.10% 0.00% 0.63%
You can sum the whole DataFrame and divide by the state total:
# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
df
Returns
office_id sales state sales_ratio
0 1 405711 CA 0.193319
1 2 535829 WA 0.347072
2 3 217952 CO 0.198743
3 4 252315 AZ 0.192500
4 5 982371 CA 0.468094
5 6 459783 WA 0.297815
6 1 404137 CO 0.368519
7 2 222579 AZ 0.169814
8 3 710581 CA 0.338587
9 4 548242 WA 0.355113
10 5 474564 CO 0.432739
11 6 835831 AZ 0.637686
But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:
df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()
Output:
state office_id weightage
0 AZ 2 0.169814
1 AZ 4 0.192500
2 AZ 6 0.637686
3 CA 1 0.193319
4 CA 3 0.338587
5 CA 5 0.468094
6 CO 1 0.368519
7 CO 3 0.198743
8 CO 5 0.432739
9 WA 2 0.347072
10 WA 4 0.355113
11 WA 6 0.297815
I think this would do the trick in 1 line:
df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)
Simple way I have used is a merge after the 2 groupby's then doing simple division.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
state office_id sales_x sales_y sales_ratio
0 AZ 2 222579 1310725 16.981365
1 AZ 4 252315 1310725 19.250033
2 AZ 6 835831 1310725 63.768601
3 CA 1 405711 2098663 19.331879
4 CA 3 710581 2098663 33.858747
5 CA 5 982371 2098663 46.809373
6 CO 1 404137 1096653 36.851857
7 CO 3 217952 1096653 19.874290
8 CO 5 474564 1096653 43.273852
9 WA 2 535829 1543854 34.707233
10 WA 4 548242 1543854 35.511259
11 WA 6 459783 1543854 29.781508
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
Returns:
sales
state office_id
AZ 2 54.587910
4 33.009225
6 12.402865
CA 1 32.046582
3 44.937684
5 23.015735
CO 1 21.099989
3 31.848658
5 47.051353
WA 2 43.882790
4 10.265275
6 45.851935
As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of #exp1orer's accepted answer
With the df, I'll call it by the alias state_office_sales:
sales
state office_id
AZ 2 839507
4 373917
6 347225
CA 1 798585
3 890850
5 454423
CO 1 819975
3 202969
5 614011
WA 2 163942
4 369858
6 959285
state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).
In: state_total_sales = df.groupby(level=0).sum()
state_total_sales
Out:
sales
state
AZ 2448009
CA 2832270
CO 1495486
WA 595859
Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:
In: state_office_sales / state_total_sales
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 0.288022
3 0.322169
5 0.389809
CO 1 0.206684
3 0.357891
5 0.435425
WA 2 0.321689
4 0.346325
6 0.331986
To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:
In: partial_total = pd.DataFrame(
data = {'sales' : [2448009, 595859, 99999]},
index = ['AZ', 'WA', 'XX' ]
)
partial_total.index.name = 'state'
Out:
sales
state
AZ 2448009
WA 595859
XX 99999
In: state_office_sales / partial_total
Out:
sales
state office_id
AZ 2 0.448640
4 0.125865
6 0.425496
CA 1 NaN
3 NaN
5 NaN
CO 1 NaN
3 NaN
5 NaN
WA 2 0.321689
4 0.346325
6 0.331986
This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.
In: missing_index_totals = state_total_sales.rename_axis("")
missing_index_totals
Out:
sales
AZ 2448009
CA 2832270
CO 1495486
WA 595859
In: state_office_sales / missing_index_totals
Out: ValueError: cannot join with no overlapping index names
df.groupby('state').office_id.value_counts(normalize = True)
I used value_counts method, but it returns percentage like 0.70 and 0.30, not like a 70 and 30.
One-line solution:
df.join(
df.groupby('state').agg(state_total=('sales', 'sum')),
on='state'
).eval('sales / state_total')
This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.

Categories