Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Related
Hi there I would like to join all strings within a group with Python datatable in order to avoid pandas. Below is the code I am currently using and which I would like to replicate in datatable.
Does anyone know how to do it? Thank you very much!
from datatable import dt, f, by
df = dt.Frame(group1=[1, 1, 1, 2, 2, 2], group2=[1, 1, 2, 2, 2, 3], text=['a', 'b', 'c', 'd', 'e', 'f'])
df = df.to_pandas()
df2 = df.groupby(['group1', 'group2'])['text'].apply(' '.join).reset_index() # replicate this with datatable
df:
group1 group2 text
0 1 1 a
1 1 1 b
2 1 2 c
3 2 2 d
4 2 2 e
5 2 3 f
df2
group1 group2 text
0 1 1 a b
1 1 2 c
2 2 2 d e
3 2 3 f
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?
You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0
I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.
A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889
you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)
I ran in to this earlier today when creating pivot tables after categorizing a column of values using pd.cut. When creating the pivot tables I was finding that the subsequent index was incorrect. This was not an issue when using groupby instead, or after converting the category column to a different dtype.
Simplified example:
df = pd.DataFrame({'l1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b']
, 'g1': [1, 1, 2, 2, 1, 1, 1, 2, 2, 2]
, 'vals': [3, 1, 3, 1, 3, 2, 2, 3, 2, 2]})
df['l2'] = pd.cut(df.vals, bins=[0, 2, 4], labels=['l', 'h'])
df = df[['l1', 'l2', 'g1', 'vals']]
Using groupby:
df.groupby(['l1', 'l2', 'g1']).vals.agg(('sum', 'count')).unstack()[['count', 'sum']]
count sum
g1 1 2 1 2
l1 l2
a l 1 1 1 1
h 1 1 3 3
b l 2 2 4 4
h 1 1 3 3
Using pd.pivot_table:
pd.pivot_table(df, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 1 1
l 1 1 3 3
b h 2 2 4 4
l 1 1 3 3
Using pd.pivot_table after converting the l2 column to str dtype:
df2 = df.copy()
df2['l2'] = df2.l2.astype(str)
pd.pivot_table(df2, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 3 3
l 1 1 1 1
b h 1 1 3 3
l 2 2 4 4
The order in the last example is reversed, but the values are correct (in contrast to the middle example, where the order is reversed and the values are incorrect).
I have been trying to rearrange my dataframe to use it as input for a factorplot. The raw data would look like this:
A B C D
1 0 1 2 "T"
2 1 2 3 "F"
3 2 1 0 "F"
4 1 0 2 "T"
...
My question is how can I rearrange it into this form:
col val val2
1 A 0 "T"
1 B 1 "T"
1 C 2 "T"
2 A 1 "F"
...
I was trying:
df = DF.cumsum(axis=0).stack().reset_index(name="val")
However this produces only one value column not two.. thanks for your support
I would use melt, and you can sort it how ever you like
pd.melt(df.reset_index(),id_vars=['index','D'], value_vars=['A','B','C']).sort_values(by='index')
Out[40]:
index D variable value
0 1 T A 0
4 1 T B 1
8 1 T C 2
1 2 F A 1
5 2 F B 2
9 2 F C 3
2 3 F A 2
6 3 F B 1
10 3 F C 0
3 4 T A 1
7 4 T B 0
11 4 T C 2
then obviously you can name column as you like
df.set_index('index').rename(columns={'D': 'col', 'variable': 'val2', 'value': 'val'})
consider your dataframe df
df = pd.DataFrame([
[0, 1, 2, 'T'],
[1, 2, 3, 'F'],
[2, 1, 3, 'F'],
[1, 0, 2, 'T'],
], [1, 2, 3, 4], list('ABCD'))
solution
df.set_index('D', append=True) \
.rename_axis(['col'], 1) \
.rename_axis([None, 'val2']) \
.stack().to_frame('val') \
.reset_index(['col', 'val2']) \
[['col', 'val', 'val2']]