pd.groupby() intermediary sums

pd.groupby() intermediary sums - python

on Python 3.6, pandas 1.1.2
I am trying to make intermediary sums. I could certainly use the sum(level) but this is not elegant nor optimal and I am wondering if there would be a better way.
For example:
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'}, 'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}}).groupby(by=['level_0', 'level_1', 'level_2']).sum()
Gives me:
value
level_0 level_1 level_2
a aa aaa 5
aab 2
bb bba 3
b aa aaa 5
aab 9
aac 2
cc cca 2
ccb 9
c bb bba 1
bbb 9
cc cca 9
ccb 5
ccc 5
dd dda 5
ddb 5
ddc 3
Now, I would like to be able to get the subtotal for each level_0 and level_1, such as below:

There you are:
import pandas as pd
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'},
'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}})
gb1 = df.groupby(by=['level_0', 'level_1', 'level_2']).sum().reset_index()
gb2 = df.groupby(by=['level_0', 'level_1']).sum().reset_index()
gb3 = df.groupby(by=['level_0']).sum().reset_index()
gb2['level_2'] = ''
gb3['level_1'] = ''
gb3['level_2'] = ''
gb_all = pd.concat((gb1, gb2, gb3), axis=0)
gb_all.sort_values(['level_0', 'level_1', 'level_2'], inplace=True)
gb_all.reset_index(inplace=True, drop=True)
print(gb_all)
Output:
level_0 level_1 level_2 value
0 a 10
1 a aa 7
2 a aa aaa 5
3 a aa aab 2
4 a bb 3
5 a bb bba 3
6 b 27
7 b aa 16
8 b aa aaa 5
9 b aa aab 9
10 b aa aac 2
11 b cc 11
12 b cc cca 2
13 b cc ccb 9
14 c 42
15 c bb 10
16 c bb bba 1
17 c bb bbb 9
18 c cc 19
19 c cc cca 9
20 c cc ccb 5
21 c cc ccc 5
22 c dd 13
23 c dd dda 5
24 c dd ddb 5
25 c dd ddc 3

Related

Dataframe Groupby ID to JSON

I'm trying to convert the following dataframe into a JSON file:
id email surveyname question answer
1 lol#gmail s 1 apple
1 lol#gmail s 3 apple/juice
1 lol#gmail s 2 apple-pie
1 lol#gmail s 4 apple-pie
1 lol#gmail s 5 apple|pie|yes
1 lol#gmail s 6 apple
1 lol#gmail s 8 apple
1 lol#gmail s 7 apple
1 lol#gmail s 9 apple
1 lol#gmail s 12 apple
1 lol#gmail s 11 apple
1 lol#gmail s 10 apple_sauce
2 ll#gmail s 1 orange
2 ll#gmail s 3 juice
.
.
To:
{
"df":[
{
"id":"1",
"email:"lol#gmail"
"surveyname":"s",
"1":"apple",
"2":"apple-pie",
"3":"apple/juice",
"4":"apple-pie",
"5":"apple|pie|yes",
"6":"apple",
"7":"apple",
"8":"apple",
"9":"apple",
"10":"apple_sauce",
"11":"apple",
"12":"apple"
},
{
"id": "vid",
"email:"llgmail"
"surveyname: "s"
"1":"orange",
"2":"", # empty
"3":"juice",
.
.
.
}
]
}
It should map all the ids in the df and skip the numbers if they're empty.
Below is a sample for the df I used above. If the whole df for id = 2 needs to be constructed, please let me know and I can edit that in. However, some entries don't have completed values inside the actual df.
d = {'id': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2},
'email': {0: 'lol#gmail',
1: 'lol#gmail',
2: 'lol#gmail',
3: 'lol#gmail',
4: 'lol#gmail',
5: 'lol#gmail',
6: 'lol#gmail',
7: 'lol#gmail',
8: 'lol#gmail',
9: 'lol#gmail',
10: 'lol#gmail',
11: 'lol#gmail',
12: 'll#gmail',
13: 'll#gmail'},
'surveyname': {0: 's',
1: 's',
2: 's',
3: 's',
4: 's',
5: 's',
6: 's',
7: 's',
8: 's',
9: 's',
10: 's',
11: 's',
12: 's',
13: 's'},
'question': {0: 1,
1: 3,
2: 2,
3: 4,
4: 5,
5: 6,
6: 8,
7: 7,
8: 9,
9: 12,
10: 11,
11: 10,
12: 1,
13: 3},
'answer': {0: 'apple',
1: 'apple/juice',
2: 'apple-pie',
3: 'apple-pie',
4: 'apple|pie|yes',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple_sauce',
12: 'orange',
13: 'juice'}}
df = pd.DataFrame.from_dict(d)

You can pivot the dataframe before exporting to JSON:
(
df.pivot_table(
index=["id", "email", "surveyname"],
columns="question",
values="answer",
aggfunc="first",
)
.reindex(columns=np.arange(1, 13))
.fillna("")
.reset_index()
.to_json("data.json", orient="records")
)

how to use drop_duplicates() with a condition?

I have a dataframe consisting of two columns. Column A consists of strings, column B consists of numbers. Column A has duplicates that I want to remove. However, I only want to retain those duplicates that have the highest number in column B. This is an example of how my dataframe looks like:
columnA | columnB
---------------------
a | 1
a | 2
b | 2
b | 1
What I want is this:
columnA | columnB
---------------------
a | 2
b | 2
using drop_duplicates()

You can sort your dataframe in descending order based on 'columnB', and use drop_duplicates() on your columnA keeping the first occurence:
df.sort_values(by='columnB',ascending=False).drop_duplicates('columnA',keep='first')
columnA columnB
13 d 555
27 h 6
16 f 6
6 c 3
1 a 2
2 b 2
15 e 1
Sample data (slightly enhanced than your sample):
df.to_dict()
{'columnA': {0: 'a',
1: 'a',
2: 'b',
3: 'b',
4: 'c',
5: 'c',
6: 'c',
7: 'd',
8: 'd',
9: 'd',
10: 'd',
11: 'd',
12: 'd',
13: 'd',
14: 'e',
15: 'e',
16: 'f',
17: 'f',
18: 'f',
19: 'f',
20: 'f',
21: 'f',
22: 'h',
23: 'h',
24: 'h',
25: 'h',
26: 'h',
27: 'h'},
'columnB': {0: 1,
1: 2,
2: 2,
3: 1,
4: 1,
5: 2,
6: 3,
7: 33,
8: 223,
9: 3,
10: 2,
11: 1,
12: 3,
13: 555,
14: 1,
15: 1,
16: 6,
17: 5,
18: 4,
19: 3,
20: 2,
21: 1,
22: 1,
23: 2,
24: 3,
25: 4,
26: 5,
27: 6}}

Just group by 'A' and take the max 'B'
df.groupby('A').max()

Grouping your dataframe by column a, taking only the max of column b and creating a new dataframe by this method can also help as it retains the original dataframe as it is:
df.groupby('columnA')['columB'].max()

How to select rows with certain value between 2 columns from another DataFrame in pandas?

For example, I have 2 Frames, the first one is the one I want to select rows from, the second one contains the creteria for selection.
df1 = pd.DataFrame({'chr': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7},
0: {0: 55241686,
1: 55242415,
2: 55248986,
3: 55259412,
4: 55260459,
5: 55266410,
6: 55268009},
1: {0: 55241736,
1: 55242513,
2: 55249171,
3: 55259567,
4: 55260534,
5: 55266556,
6: 55268064}})
df1
df2 = pd.DataFrame({'chr': {0: 7,
1: 7,
2: 7,
3: 7,
4: 7,
5: 7,
6: 7,
7: 7,
8: 7,
9: 7,
10: 7,
11: 7,
12: 7,
13: 7,
14: 7,
15: 7,
16: 7,
17: 7,
18: 7,
19: 7},
's': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'e': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'ref': {0: 'T',
1: 'T',
2: 'A',
3: 'G',
4: 'C',
5: 'G',
6: 'G',
7: 'A',
8: 'G',
9: 'G',
10: 'C',
11: 'G',
12: 'C',
13: 'G',
14: 'G',
15: 'G',
16: 'G',
17: 'G',
18: 'C',
19: 'C'},
'alt': {0: 'C',
1: 'G',
2: 'C',
3: 'A',
4: 'T',
5: 'A',
6: 'A',
7: 'G',
8: 'A',
9: 'A',
10: 'T',
11: 'A',
12: 'G',
13: 'A',
14: 'C',
15: 'A',
16: 'C',
17: 'A',
18: 'G',
19: 'T'}})
df2 here only shows a small part.
df2
what I want to achieve is
for each row in df1, if this row(row_df1) match with certain row in df2 (row_df2) (match means, row_df1['chr']==row_df2['chr'] & row_df1[0] >= row_df2['s'] & row_df11 <= row_df2['e']
in brief,
if the value is fall into certain intervals constructed by df2['s'] and df2['e'], return it.

I believe best case scenario for you is to merge both dataframes first using a common column. In your case "chr". For example as I understand you want all 'chr' from df1 which exist df2, so in that case you just do:
merged_df = df1.merge(df2, on='chr', how='left')
In merge you can use "indicator=True" which will create a new column called "_merge" for you which will indicate the source of each row.
Now when you have your data merged on you can make simple condition statements to get all the needed columns like:
merged_df.loc[(merged_df[0] >= merged_df['s']) & (merged_df[1] >= merged_df ['e'])]
Or you could add a new column as a result, using apply and etc.

pandas slicing multiindex dataframe

I want to slice a multi-index pandas dataframe
here is the code to obtain my test data:
import pandas as pd
testdf = {
'Name': {
0: 'H', 1: 'H', 2: 'H', 3: 'H', 4: 'H'}, 'Division': {
0: 'C', 1: 'C', 2: 'C', 3: 'C', 4: 'C'}, 'EmployeeId': {
0: 14, 1: 14, 2: 14, 3: 14, 4: 14}, 'Amt1': {
0: 124.39, 1: 186.78, 2: 127.94, 3: 258.35000000000002, 4: 284.77999999999997}, 'Amt2': {
0: 30.0, 1: 30.0, 2: 30.0, 3: 30.0, 4: 60.0}, 'Employer': {
0: 'Z', 1: 'Z', 2: 'Z', 3: 'Z', 4: 'Z'}, 'PersonId': {
0: 14, 1: 14, 2: 14, 3: 14, 4: 15}, 'Provider': {
0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B'}, 'Year': {
0: 2012, 1: 2012, 2: 2013, 3: 2013, 4: 2012}}
testdf = pd.DataFrame(testdf)
testdf
grouper_keys = [
'Employer',
'Year',
'Division',
'Name',
'EmployeeId',
'PersonId']
testdf2 = pd.pivot_table(data=testdf,
values='Amt1',
index=grouper_keys,
columns='Provider',
fill_value=None,
margins=False,
dropna=True,
aggfunc=('sum', 'count'),
)
print(testdf2)
gives:
Now I can get only sum for A or B using
testdf2.loc[:, slice(None, ('sum', 'A'))]
which gives
How can I get both sum and count for only A or B

Use xs for cross section
testdf2.xs('A', axis=1, level=1)
Or keep the column level with drop_level=False
testdf2.xs('A', axis=1, level=1, drop_level=False)

You can use:
idx = pd.IndexSlice
df = testdf2.loc[:, idx[['sum', 'count'], 'A']]
print (df)
sum count
Provider A A
Employer Year Division Name EmployeeId PersonId
Z 2012 C H 14 14 311.17 2.0
15 NaN NaN
2013 C H 14 14 386.29 2.0
Another solution:
df = testdf2.loc[:, (slice('sum','count'), ['A'])]
print (df)
sum count
Provider A A
Employer Year Division Name EmployeeId PersonId
Z 2012 C H 14 14 311.17 2.0
15 NaN NaN
2013 C H 14 14 386.29 2.0

Apply function to a MultiIndex dataframe with pandas/python

I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).
Here is an example dataset. The actual dataset can exceed 200,000 records.
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.
I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3:
'30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935',
7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977',
11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14:
'9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211,
10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544},
'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7:
8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A',
6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C',
13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2:
'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7:
'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0:
'12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012',
4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7:
'15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10:
'23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13:
'13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M',
3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}

I think what you might be looking for is
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.
Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pd.groupby() intermediary sums - python

Related

Dataframe Groupby ID to JSON

how to use drop_duplicates() with a condition?

How to select rows with certain value between 2 columns from another DataFrame in pandas?

pandas slicing multiindex dataframe

Apply function to a MultiIndex dataframe with pandas/python

Categories

Resources