Group Pandas dataframe with multuple column and create distribution - python

I have a dataframe as below:
data = [['A', 1], ['A', 0], ['A', 1], ['B', 0], ['B', 1], ['C', 1], ['C', 1], ['C', 1]]
temp_df = pd.DataFrame(data, columns = ['Name', 'effect'])
Name effect
0 A 1
1 A 0
2 A 1
3 B 0
4 B 1
5 C 1
6 C 1
7 C 1
after doing a groupby I'm getting
temp_df.groupby(['Name','effect']).size().reset_index(name='count')
Name effect count
0 A 0 1
1 A 1 2
2 B 0 1
3 B 1 1
4 C 1 3
But I need my result to look like as below:
Name
e0
e1
A
1
2
B
1
1
C
0
3

You can cross-tabulate with crosstab(). To add e to the column names, chain add_prefix():
pd.crosstab(temp_df.Name, temp_df.effect).add_prefix('e')
# effect e0 e1
# Name
# A 1 2
# B 1 1
# C 0 3

Groupby with value counts and unstack:
out = temp_df.groupby("Name")['effect'].value_counts().unstack(fill_value=0)
out = out.add_prefix(out.columns.name).rename_axis(columns=None).reset_index()
print(out)
Name effect0 effect1
0 A 1 2
1 B 1 1
2 C 0 3

You can use .pivot_table():
print(
temp_df.assign(tmp=temp_df["effect"])
.pivot_table(
index="Name",
columns="effect",
values="tmp",
aggfunc="count",
fill_value=0,
)
.add_prefix("e")
.reset_index()
)
Prints:
effect Name e0 e1
0 A 1 2
1 B 1 1
2 C 0 3

data = [['A', 1], ['A', 0], ['A', 1], ['B', 0], ['B', 1], ['C', 1], ['C', 1], ['C', 1]]
temp_df = pd.DataFrame(data, columns = ['Name', 'e0'])
print(temp_df)
temp_df.groupby(['Name','e0']).size().reset_index(name='e1')

Related

How do I exclude a column from pandas pd.get_dummies? [duplicate]

This question already has answers here:
How to create dummies for certain columns with pandas.get_dummies()
(4 answers)
Closed 1 year ago.
I have a dataset with 82 columns, and would like to turn all column values into dummy variables using pd.get_dummies except for the first column "business_id".
How can I define the pd.get_dummies function to only work on the other 81 columns?
You can exclude columns based on location by slicing df.columns:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c'],
'C': [1, 2, 3],
'D': [4, 5, 6]})
df = pd.get_dummies(df, columns=df.columns[1:])
# For Display
print(df.to_string(index=False))
Output:
A B_a B_b B_c C_1 C_2 C_3 D_4 D_5 D_6
a 0 1 0 1 0 0 1 0 0
b 1 0 0 0 1 0 0 1 0
a 0 0 1 0 0 1 0 0 1
For a more general solution, you can filter out particular columns programmatically using filter over your df.columns.
Put whatever column names you want to exclude in columns_to_exclude.
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c'],
'C': [1, 2, 3],
'D': [4, 5, 6]})
columns_to_exclude = ['B', 'D']
df = pd.get_dummies(df, columns=filter(
lambda i: i not in columns_to_exclude,
df.columns))
# For Display
print(df.to_string(index=False))
Output:
A B C D A_a A_b C_1 C_2 C_3
a b 1 4 1 0 1 0 0
b a 2 5 0 1 0 1 0
a c 3 6 1 0 0 0 1

How to encode when u have multiple categories in a column

My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()

Different results from pandas groupby and pivot_table when dtype is categorical

I ran in to this earlier today when creating pivot tables after categorizing a column of values using pd.cut. When creating the pivot tables I was finding that the subsequent index was incorrect. This was not an issue when using groupby instead, or after converting the category column to a different dtype.
Simplified example:
df = pd.DataFrame({'l1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b']
, 'g1': [1, 1, 2, 2, 1, 1, 1, 2, 2, 2]
, 'vals': [3, 1, 3, 1, 3, 2, 2, 3, 2, 2]})
df['l2'] = pd.cut(df.vals, bins=[0, 2, 4], labels=['l', 'h'])
df = df[['l1', 'l2', 'g1', 'vals']]
Using groupby:
df.groupby(['l1', 'l2', 'g1']).vals.agg(('sum', 'count')).unstack()[['count', 'sum']]
count sum
g1 1 2 1 2
l1 l2
a l 1 1 1 1
h 1 1 3 3
b l 2 2 4 4
h 1 1 3 3
Using pd.pivot_table:
pd.pivot_table(df, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 1 1
l 1 1 3 3
b h 2 2 4 4
l 1 1 3 3
Using pd.pivot_table after converting the l2 column to str dtype:
df2 = df.copy()
df2['l2'] = df2.l2.astype(str)
pd.pivot_table(df2, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 3 3
l 1 1 1 1
b h 1 1 3 3
l 2 2 4 4
The order in the last example is reversed, but the values are correct (in contrast to the middle example, where the order is reversed and the values are incorrect).

Create a unique indicator two join two datasets in pandas/python

How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')
This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN

Including missing combinations of values in a pandas groupby aggregation

Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0

Categories