pandas slicing multiindex dataframe - python

I want to slice a multi-index pandas dataframe
here is the code to obtain my test data:
import pandas as pd
testdf = {
'Name': {
0: 'H', 1: 'H', 2: 'H', 3: 'H', 4: 'H'}, 'Division': {
0: 'C', 1: 'C', 2: 'C', 3: 'C', 4: 'C'}, 'EmployeeId': {
0: 14, 1: 14, 2: 14, 3: 14, 4: 14}, 'Amt1': {
0: 124.39, 1: 186.78, 2: 127.94, 3: 258.35000000000002, 4: 284.77999999999997}, 'Amt2': {
0: 30.0, 1: 30.0, 2: 30.0, 3: 30.0, 4: 60.0}, 'Employer': {
0: 'Z', 1: 'Z', 2: 'Z', 3: 'Z', 4: 'Z'}, 'PersonId': {
0: 14, 1: 14, 2: 14, 3: 14, 4: 15}, 'Provider': {
0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B'}, 'Year': {
0: 2012, 1: 2012, 2: 2013, 3: 2013, 4: 2012}}
testdf = pd.DataFrame(testdf)
testdf
grouper_keys = [
'Employer',
'Year',
'Division',
'Name',
'EmployeeId',
'PersonId']
testdf2 = pd.pivot_table(data=testdf,
values='Amt1',
index=grouper_keys,
columns='Provider',
fill_value=None,
margins=False,
dropna=True,
aggfunc=('sum', 'count'),
)
print(testdf2)
gives:
Now I can get only sum for A or B using
testdf2.loc[:, slice(None, ('sum', 'A'))]
which gives
How can I get both sum and count for only A or B

Use xs for cross section
testdf2.xs('A', axis=1, level=1)
Or keep the column level with drop_level=False
testdf2.xs('A', axis=1, level=1, drop_level=False)

You can use:
idx = pd.IndexSlice
df = testdf2.loc[:, idx[['sum', 'count'], 'A']]
print (df)
sum count
Provider A A
Employer Year Division Name EmployeeId PersonId
Z 2012 C H 14 14 311.17 2.0
15 NaN NaN
2013 C H 14 14 386.29 2.0
Another solution:
df = testdf2.loc[:, (slice('sum','count'), ['A'])]
print (df)
sum count
Provider A A
Employer Year Division Name EmployeeId PersonId
Z 2012 C H 14 14 311.17 2.0
15 NaN NaN
2013 C H 14 14 386.29 2.0

Related

Pandas group_by multiple columns with conditional count

I have a dataframe, for instance:
df = pd.DataFrame({'Host': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: 'N',
5: 'V',
6: 'B'},
'Registration': {0: 'Registered',
1: 'MR',
2: 'Registered',
3: 'Registered',
4: '',
5: 'Registered',
6: 'Registered'},
'Val': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: '',
5: 'V',
6: 'B'},
'Sum': {0: 100.0,
1: 0.0,
2: 300.0,
3: 150.0,
4: 0.0,
5: 0.0,
6: 20.0}})
I want to get the count, for each Host. Something like:
df.groupby("Host").count()
"""
Host Registration Val Sum
B 2 2 2
N 4 4 4
V 1 1 1
"""
But I want it conditional as a function of each column. For example, I want to count in Sum, only those rows that have more than 0.0, and in the others the ones that are not empty. So my expected output would be:
Host Registration Val Sum
B 2 2 1
N 3 3 3
V 1 1 0
"""
Not sure how to do that. My best attempt has been:
df.groupby("Host").agg({'Registration': lambda x: (x != "").count(),
'Val':lambda x: (x != "").count(),
'Sum': lambda x: (x != 0).count()})
But this produces the same output as df.groupby("Host").count()
Any suggestion?
First your solution - for count Trues values use sum:
df = df.groupby("Host").agg({'Registration': lambda x: (x != "").sum(),
'Val':lambda x: (x != "").sum(),
'Sum': lambda x: (x != 0).sum()})
print (df)
Registration Val Sum
Host
B 2 2 1.0
N 3 3 3.0
V 1 1 0.0
Improved solution - create boolean columns before aggregation sum:
df = df.assign(Registration = df['Registration'].ne(""),
Val = df['Val'].ne(""),
Sum = df['Sum'].ne(0)).groupby("Host").sum()
print (df)
Registration Val Sum
Host
B 2 2 1
N 3 3 3
V 1 1 0

How to pivot my dataframe to exist on a single row only

I'm trying to pivot my dataframe so that there is a single row and a single cell for each summary X metric comparison. I have tried pivoting this, but can't figure out a sensible index column.
Here is my current output.
Does anyone know how to achieve my expected output?
To reproduce:
import pandas as pd
pd.DataFrame({'summary': {0: 'mean',
1: 'stddev',
2: 'mean',
3: 'stddev',
4: 'mean',
5: 'stddev'},
'metric': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C'},
'value': {0: '2.0',
1: '1.5811388300841898',
2: '0.4',
3: '0.5477225575051661',
4: None,
5: None}})
Remove missing values by DataFrame.dropna, join columns together, convert to index and transpose by DataFrame.T:
df = df.dropna(subset=['value'])
df['g'] = df['summary'] + '_' + df['metric']
df = df.set_index('g')[['value']].T.reset_index(drop=True).rename_axis(None, axis=1)
print (df)
mean_A stddev_A mean_B stddev_B
0 2.0 1.5811388300841898 0.4 0.5477225575051661

Python Pandas group by mean() for a certain count of rows

I need to group by mean() for the first 2 values of each category, how I define that.
df like
category value
-> a 2
-> a 5
a 4
a 8
-> b 6
-> b 3
b 1
-> c 2
-> c 2
c 7
by reading only the arrowed data where the output be like
category mean
a 3.5
b 4.5
c 2
how can I do this
I am trying but do not know where to define the to get only 1st 2 observation from each categrory
output = df.groupby(['category'])['value'].mean().reset_index()
your help is appreciated, thanks in advance
You can also do this via groupby() and agg():
out=df.groupby('category',as_index=False)['value'].agg(lambda x:x.head(2).mean())
Try apply on each group of values and use head(2) to just get the first 2 values then mean:
import pandas as pd
df = pd.DataFrame({
'category': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b',
6: 'b', 7: 'c', 8: 'c', 9: 'c'},
'value': {0: 2, 1: 5, 2: 4, 3: 8, 4: 6, 5: 3, 6: 1, 7: 2,
8: 2, 9: 7}
})
output = df.groupby('category', as_index=False)['value'] \
.apply(lambda a: a.head(2).mean())
print(output)
output:
category value
0 a 3.5
1 b 4.5
2 c 2.0
Or create a boolean index to filter df with:
m = df.groupby('category').cumcount().lt(2)
output = df[m].groupby('category')['value'].mean().reset_index()
print(output)
category value
0 a 3.5
1 b 4.5
2 c 2.0

pd.groupby() intermediary sums

on Python 3.6, pandas 1.1.2
I am trying to make intermediary sums. I could certainly use the sum(level) but this is not elegant nor optimal and I am wondering if there would be a better way.
For example:
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'}, 'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}}).groupby(by=['level_0', 'level_1', 'level_2']).sum()
Gives me:
value
level_0 level_1 level_2
a aa aaa 5
aab 2
bb bba 3
b aa aaa 5
aab 9
aac 2
cc cca 2
ccb 9
c bb bba 1
bbb 9
cc cca 9
ccb 5
ccc 5
dd dda 5
ddb 5
ddc 3
Now, I would like to be able to get the subtotal for each level_0 and level_1, such as below:
There you are:
import pandas as pd
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'},
'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}})
gb1 = df.groupby(by=['level_0', 'level_1', 'level_2']).sum().reset_index()
gb2 = df.groupby(by=['level_0', 'level_1']).sum().reset_index()
gb3 = df.groupby(by=['level_0']).sum().reset_index()
gb2['level_2'] = ''
gb3['level_1'] = ''
gb3['level_2'] = ''
gb_all = pd.concat((gb1, gb2, gb3), axis=0)
gb_all.sort_values(['level_0', 'level_1', 'level_2'], inplace=True)
gb_all.reset_index(inplace=True, drop=True)
print(gb_all)
Output:
level_0 level_1 level_2 value
0 a 10
1 a aa 7
2 a aa aaa 5
3 a aa aab 2
4 a bb 3
5 a bb bba 3
6 b 27
7 b aa 16
8 b aa aaa 5
9 b aa aab 9
10 b aa aac 2
11 b cc 11
12 b cc cca 2
13 b cc ccb 9
14 c 42
15 c bb 10
16 c bb bba 1
17 c bb bbb 9
18 c cc 19
19 c cc cca 9
20 c cc ccb 5
21 c cc ccc 5
22 c dd 13
23 c dd dda 5
24 c dd ddb 5
25 c dd ddc 3

Apply function to a MultiIndex dataframe with pandas/python

I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).
Here is an example dataset. The actual dataset can exceed 200,000 records.
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.
I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3:
'30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935',
7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977',
11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14:
'9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211,
10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544},
'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7:
8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A',
6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C',
13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2:
'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7:
'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0:
'12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012',
4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7:
'15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10:
'23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13:
'13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M',
3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}
I think what you might be looking for is
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.
Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.

Categories