I want to display the result of a single value aggregation with 2 group by's into a table.
Such that
df.groupby(['colA', 'colB']).size
Would yield:
B1 B2 B3 B4
A1 s11 s12 s13 ..
A2 s21 s22 s23 ..
A3 s31 s32 s33 ..
A4 .. .. .. s44
What's a quick and easy way of doing this?
EDIT: here's an example. I have the logins of all users, and I want to display the number of logins (=rows) for each user and day
Day,User
1,John
1,John
1,Ben
1,Sarah
2,Ben
2,Sarah
2,Sarah
Should yield:
D\U John Ben Sarah
1 2 1 1
2 0 1 2
Use:
df.groupby(['colA', 'colB']).size().unstack()
Example:
df = pd.DataFrame(np.transpose([np.random.choice(['B1','B2','B3'], size=10),
np.random.choice(['A1','A2','A3'], size=10)]),
columns=['A','B'])
df
A B
0 B3 A1
1 B1 A2
2 B3 A3
3 B1 A3
4 B2 A2
5 B3 A3
6 B3 A1
7 B2 A1
8 B1 A3
9 B3 A3
Now:
df.groupby(['A','B']).size().unstack()
B A1 A2 A3
A
B1 NaN 1.0 2.0
B2 1.0 1.0 NaN
B3 2.0 NaN 3.0
Update now that your post has data:
df.groupby(['Day','User']).size().unstack().fillna(0)
User Ben John Sarah
Day
1 1.0 2.0 1.0
2 1.0 0.0 2.0
Related
I have a df like this:
A B C D E F
2 a1 a2 a3 a4 100
2 a1 b2 c3 a4 100 # note
2 b1 b2 b3 b4 100
2 c1 c2 c3 c4 100
1 a1 a2 a3 a4 120
2 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
I want to compare the results of F column where the columns B-E match based on A column like so:
A B C D E F diff
2 a1 a2 a3 a4 100 120/100
2 a1 b2 c3 a4 100 # note 150/100
2 b1 b2 b3 b4 100 130/100
2 c1 c2 c3 c4 100 110/100
1 a1 a2 a3 a4 120 80/120
1 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
Since the first line has the same values in the first line where A is 1 I do 120/100.
What I've tried:
df.groupby(['B',' 'C', 'D', 'E']) - this groups the data, but I don't know how I could apply the logic of calculating the previous value of column A. Or maybe there is a simpler way of achieving it.
Use DataFrameGroupBy.shift with Series.div:
df['d'] = df.groupby(['B', 'C', 'D', "E"])['F'].shift(-1).div(df['F'])
print (df)
A B C D E F d
0 2 a1 a2 a3 a4 100 1.200000
1 2 a1 b2 c3 a4 100 1.500000
2 2 b1 b2 b3 b4 100 1.300000
3 2 c1 c2 c3 c4 100 1.100000
4 1 a1 a2 a3 a4 120 0.666667
5 2 a1 b2 c3 a4 150 NaN
6 1 b1 b2 b3 b4 130 NaN
7 1 c1 c2 c3 c4 110 NaN
8 0 a1 a2 a3 a4 80 NaN
I have a groupby array in which I need to group by A, then show a count of instances of B separated by B1 and B2 and finally the percentage of those instances that are > 0.1 so I did this to get the first 2:
A B C
id
118 a1 B1 0
119 a1 B1 0
120 a1 B1 101.1
121 a1 B1 106.67
122 a1 B2 103.33
237 a1 B2 100
df = pd.DataFrame(df.groupby(
['A', 'B'])['B'].aggregate('count')).unstack(level=1)
to which I get the first part right:
B
B B1 B2
A
a1 4 2
a2 7 9
a3 9 17
a4 8 8
a5 7 8
But then when I need to get the percentage of the count that is > 0
prcnt_complete = df[['A', 'B', 'C']]
prcnt_complete['passed'] = prcnt_complete['C'].apply(lambda x: (float(x) > 1))
prcnt_complete = prcnt_complete.groupby(['A', 'B', 'passed']).count()
I get weird values that make no sense, sometimes the sum between True and False doesn't even add up. I'm trying to understand what in the order of things I'm doing wrong so that I can make sense of it.
The result I'm looking for is something like this:
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
a2 7 9 7 6
a3 9 17 9 5
Thanks,
You can do:
(df['C'].gt(1).groupby([df['A'],df['B']])
.agg(['size','sum'])
.rename(columns={'size':'B','sum':'passed'})
.unstack('B')
)
Output (from sample data):
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
While working on your problem, I also wanted to see if I can get the average percentage for B (while ignoring 0s). I was able to accomplish this as well while getting the counts.
DataFrame for this exercise:
A B C
0 a1 B1 0.00
1 a1 B1 0.00
2 a1 B1 98.87
3 a1 B1 101.10
4 a1 B2 106.67
5 a1 B2 103.00
6 a2 B1 0.00
7 a2 B1 0.00
8 a2 B1 33.00
9 a2 B1 100.00
10 a2 B2 80.00
11 a3 B1 90.00
12 a3 B2 99.00
Average while excluding the zeros
for this I had to add .replace(0, np.nan) before the groupby function.
A = ['a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a3','a3']
B = ['B1','B1','B1','B1','B2','B2','B1','B1','B1','B1','B2','B1','B2']
C = [0,0,98.87,101.1,106.67,103,0,0,33,100,80,90,99]
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':A,'B':B,'C':C})
df = pd.DataFrame(df.replace(0, np.nan)
.groupby(['A', 'B'])
.agg({'B':'size','C':['count','mean']})
.rename(columns={'size':'Count','count':'Passed','mean':'Avg Score'})).unstack(level=1)
df.columns = df.columns.droplevel(0)
Count Passed Avg Score
B B1 B2 B1 B2 B1 B2
A
a1 4 2 2 2 99.985 104.835
a2 4 1 2 1 66.500 80.000
a3 1 1 1 1 90.000 99.000
I want to remove rows where my target value is null more than 25% of the time with the 25% condition applied to another column. Alternatively I could come up with a threshold as a max count of number of times NaN is acceptable but again based on another column's value.
My goal is to impute values if there are enough observations by group based on another column and if the threshold is not met remove those observations.
My dataframe is much larger but its something like this - suppose 50% of 'a3' values corresponding to col['aid'] are null in col['T']
df = pd.DataFrame([[1,'a1','c1', 111],
[2,'a2','c3', 222],
[3,'a3','c3',],
[4,'a1','c5', 444],
[5,'a3','c4',],
[6,'a3','c5', 666],
[7,'a3','c3', 777]], columns=['pid','aid','cid','T'])
df
pid aid cid T
0 1 a1 c1 111.0
1 2 a2 c3 222.0
2 3 a3 c3 NaN
3 4 a1 c5 444.0
4 5 a3 c4 NaN
5 6 a3 c5 666.0
6 7 a3 c3 777.0
I've tried
df.dropna(thresh=0.25*(df['aid'].value_counts()), axis = 1)
my desired output at threshold of 25% is
pid aid cid T
0 1 a1 c1 111.0
1 2 a2 c3 222.0
3 4 a1 c5 444.0
5 6 a3 c5 666.0
6 7 a3 c3 777.0
at threshold of 51% my dataframe would be unchanged
pid aid cid T
0 1 a1 c1 111.0
1 2 a2 c3 222.0
2 3 a3 c3 NaN
3 4 a1 c5 444.0
4 5 a3 c4 NaN
5 6 a3 c5 666.0
6 7 a3 c3 777.0
any advice would be appreciated
You can use transform
s=df['T'].isnull().groupby(df['aid']).transform('mean')
n=0.25
df.loc[(s<=n)|(df['T'].notnull()),]
Out[39]:
pid aid cid T
0 1 a1 c1 111.0
1 2 a2 c3 222.0
3 4 a1 c5 444.0
5 6 a3 c5 666.0
6 7 a3 c3 777.0
I'd do
thresh = .50
if len(df.query("aid=='a3' and T != T").index) / len(df.index) > thresh:
df = df.dropna(subset=['T'])
or if you dislike query syntax,
thresh = .50
if len(df[(df['aid'] == 'a3') & (df['T'].isna())].index) / len(df.index) > thresh:
df = df.dropna(subset=['T'])
the maxcount version:
maxcount = 2
if len(df[(df['aid'] == 'a3') & (df['T'].isna())].index) > maxcount:
df = df.dropna(subset=['T'])
[edit] Since I don't have enough rep to comment on WeNYoBen's answer, here's the maxcount version of their response, with more pythonic variable names:
aid_var_null_ct = df['T'].isnull().groupby(df['aid']).transform('sum')
thresh = 1
df.loc[(aid_var_null_ct <= thresh) | (df['T'].notnull()),]
I have a Data Frame like this(sample),
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V2 B2 Clearing C2 104457.22
5 V2 B2 Invoice C2 -400073.56
6 V2 B2 Payment C2 297856.45
7 V3 B3 Clearing C3 1989462.95
8 V3 B3 CreditMemo C3 538.95
9 V3 B3 CustomerPayment_Difference C3 2112329.00
10 V3 B3 Invoice C3 -4066485.69
11 V4 B4 Clearing C4 -123946.13
12 V4 B4 CreditMemo C4 127624.66
13 V4 B4 Accounting C4 424774.52
14 V4 B4 Invoice C4 -40446521.41
15 V4 B4 Payment C4 44441419.95
I want to reshape this data frame like below:
A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
V1 B1 C1 NaN 1538884.46 NaN 13537679.7
V2 B2 C2 NaN 104457.22 NaN NaN
V3 B3 C3 NaN 1989462.95 538.95 2112329.0
V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
So far I tried to get help from pivot table,
df.pivot(index='A',columns='C', values='E').reset_index()
It gives result like below:
C A Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 NaN 1538884.46 NaN 13537679.7
1 V2 NaN 104457.22 NaN NaN
2 V3 NaN 1989462.95 538.95 2112329.0
3 V4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
In above table it leave B&C columns, I need that columns as well.
This have provided this sample data for simplicity. But in future data will be like this also,
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
**4 V1 B2 Clearing C1 88.9
5 V1 B2 Clearing C2 79.9**
In this situation my code will throw duplicate index error.
To fix this two problems I need to specify A,B,D as index.
I need a code similar to this,
df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()
this code throw me an error.
How to solve this? How to provide Multiple columns as index in pandas pivot table?
I think need:
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 B1 C1 NaN 1538884.46 NaN 13537679.7
1 V2 B2 C2 NaN 104457.22 NaN NaN
2 V3 B3 C3 NaN 1989462.95 538.95 2112329.0
3 V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
Another solution is use pivot_table:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
But it aggregate if duplicates in A, B, C, D columns. In first solution get error if duplicates:
print (df)
A B C D E
0 V1 B1 Clearing C1 3000.00 <-V1,B1,Clearing,C1
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V1 B1 Cleari7ng C1 1000.00 <-V1,B1,Clearing,C1
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
ValueError: Index contains duplicate entries, cannot reshape
But pivot_table aggregate:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)
C Clearing CustomerPayment_Difference Invoice PaymentDifference
A B D
V1 B1 C1 2000.0 13537679.7 -15771005.81 0.0
So question is: Is good idea always use pivot_table?
In my opinion it depends if need care about duplicates - if use pivot or set_index + unstack get error - you know about dupes, but pivot_table always aggregate, so no idea about dupes.
I have two dataframes that look like this:
df1=
A B
1 A1 B1
2 A2 B2
3 A3 B3
df2 =
A C
4 A4 C4
5 A5 C5
I would like to append df2 to df1, like so:
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 NaN
5 A5 NaN
(Note: I've edited the dataframes so that not all the columns in df1 are necessarily in df2)
Whether I use concat or append, the resulting dataframe I get would have a column called "C" with the first three rows filled with nan. I just want to keep the two original columns in df1, with the new values appended. Is there a way concatenate the dataframes without having to drop the extra column afterwards?
You can first filter columns for appending by subset:
print (df2[['A']])
A
4 A4
5 A5
print (pd.concat([df1, df2[['A']]]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 NaN
5 A5 NaN
print (df1.append(df2[['A']]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 NaN
5 A5 NaN
print (df2[['A','B']])
A B
4 A4 B4
5 A5 B5
print (pd.concat([df1, df2[['A','B']]]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5
Or:
print (df1.append(df2[['A','B']]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5
EDIT by comment:
If columns in df1 and df2 have different columns, use intersection:
print (df1)
A B D
1 A1 B1 R
2 A2 B2 T
3 A3 B3 E
print (df2)
A B C
4 A4 B4 C4
5 A5 B5 C5
print (df1.columns.intersection(df2.columns))
Index(['A', 'B'], dtype='object')
print (pd.concat([df1, df2[df1.columns.intersection(df2.columns)]]))
A B D
1 A1 B1 R
2 A2 B2 T
3 A3 B3 E
4 A4 B4 NaN
5 A5 B5 NaN
Actually the solution is in an obscure corner of this page. Here's the code to use:
pd.concat([df1,df2],join_axes=[df1.columns])