I have a dataframe as below:
data = [['A', 1], ['A', 0], ['A', 1], ['B', 0], ['B', 1], ['C', 1], ['C', 1], ['C', 1]]
temp_df = pd.DataFrame(data, columns = ['Name', 'effect'])
Name effect
0 A 1
1 A 0
2 A 1
3 B 0
4 B 1
5 C 1
6 C 1
7 C 1
after doing a groupby I'm getting
temp_df.groupby(['Name','effect']).size().reset_index(name='count')
Name effect count
0 A 0 1
1 A 1 2
2 B 0 1
3 B 1 1
4 C 1 3
But I need my result to look like as below:
Name
e0
e1
A
1
2
B
1
1
C
0
3
You can cross-tabulate with crosstab(). To add e to the column names, chain add_prefix():
pd.crosstab(temp_df.Name, temp_df.effect).add_prefix('e')
# effect e0 e1
# Name
# A 1 2
# B 1 1
# C 0 3
Groupby with value counts and unstack:
out = temp_df.groupby("Name")['effect'].value_counts().unstack(fill_value=0)
out = out.add_prefix(out.columns.name).rename_axis(columns=None).reset_index()
print(out)
Name effect0 effect1
0 A 1 2
1 B 1 1
2 C 0 3
You can use .pivot_table():
print(
temp_df.assign(tmp=temp_df["effect"])
.pivot_table(
index="Name",
columns="effect",
values="tmp",
aggfunc="count",
fill_value=0,
)
.add_prefix("e")
.reset_index()
)
Prints:
effect Name e0 e1
0 A 1 2
1 B 1 1
2 C 0 3
data = [['A', 1], ['A', 0], ['A', 1], ['B', 0], ['B', 1], ['C', 1], ['C', 1], ['C', 1]]
temp_df = pd.DataFrame(data, columns = ['Name', 'e0'])
print(temp_df)
temp_df.groupby(['Name','e0']).size().reset_index(name='e1')
Related
This question already has answers here:
How to create dummies for certain columns with pandas.get_dummies()
(4 answers)
Closed 1 year ago.
I have a dataset with 82 columns, and would like to turn all column values into dummy variables using pd.get_dummies except for the first column "business_id".
How can I define the pd.get_dummies function to only work on the other 81 columns?
You can exclude columns based on location by slicing df.columns:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c'],
'C': [1, 2, 3],
'D': [4, 5, 6]})
df = pd.get_dummies(df, columns=df.columns[1:])
# For Display
print(df.to_string(index=False))
Output:
A B_a B_b B_c C_1 C_2 C_3 D_4 D_5 D_6
a 0 1 0 1 0 0 1 0 0
b 1 0 0 0 1 0 0 1 0
a 0 0 1 0 0 1 0 0 1
For a more general solution, you can filter out particular columns programmatically using filter over your df.columns.
Put whatever column names you want to exclude in columns_to_exclude.
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c'],
'C': [1, 2, 3],
'D': [4, 5, 6]})
columns_to_exclude = ['B', 'D']
df = pd.get_dummies(df, columns=filter(
lambda i: i not in columns_to_exclude,
df.columns))
# For Display
print(df.to_string(index=False))
Output:
A B C D A_a A_b C_1 C_2 C_3
a b 1 4 1 0 1 0 0
b a 2 5 0 1 0 1 0
a c 3 6 1 0 0 0 1
My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()
I ran in to this earlier today when creating pivot tables after categorizing a column of values using pd.cut. When creating the pivot tables I was finding that the subsequent index was incorrect. This was not an issue when using groupby instead, or after converting the category column to a different dtype.
Simplified example:
df = pd.DataFrame({'l1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b']
, 'g1': [1, 1, 2, 2, 1, 1, 1, 2, 2, 2]
, 'vals': [3, 1, 3, 1, 3, 2, 2, 3, 2, 2]})
df['l2'] = pd.cut(df.vals, bins=[0, 2, 4], labels=['l', 'h'])
df = df[['l1', 'l2', 'g1', 'vals']]
Using groupby:
df.groupby(['l1', 'l2', 'g1']).vals.agg(('sum', 'count')).unstack()[['count', 'sum']]
count sum
g1 1 2 1 2
l1 l2
a l 1 1 1 1
h 1 1 3 3
b l 2 2 4 4
h 1 1 3 3
Using pd.pivot_table:
pd.pivot_table(df, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 1 1
l 1 1 3 3
b h 2 2 4 4
l 1 1 3 3
Using pd.pivot_table after converting the l2 column to str dtype:
df2 = df.copy()
df2['l2'] = df2.l2.astype(str)
pd.pivot_table(df2, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 3 3
l 1 1 1 1
b h 1 1 3 3
l 2 2 4 4
The order in the last example is reversed, but the values are correct (in contrast to the middle example, where the order is reversed and the values are incorrect).
How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')
This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0