Groupby and aggregate using lambda functions - python

I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.

Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1

You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1

Related

Are there any pythonic way to find class counts for pandas dataframe in given condition? [duplicate]

I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1

Convert Ratings into Columns and Columns as rows in Pandas [duplicate]

I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1

pd.get_dummies() with seperator and counts

I have a data that looks like:
index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F
I need to vectorize this stringColumn with counts, ending up with:
index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.
My data is huge so I am looking for vectorized solutions, rather than for loops.
You can use Series.str.split with get_dummies and sum:
df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True),
prefix='', prefix_sep='')
.sum(level=0, axis=1))
Or count values per rows by value_counts, replace missing values by DataFrame.fillna and convert to integers:
df1 = (df['stringColumn'].str.split('_', expand=True)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int))
Or use collections.Counter, performance should be very good:
from collections import Counter
df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
.fillna(0)
.astype(int))
Or reshape by DataFrame.stack and count by SeriesGroupBy.value_counts:
df1 = (df['stringColumn'].str.split('_', expand=True)
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3

Dummy variables when not all categories are present across multiple features & data sets

I want to ask an extension of this question, which talks about adding a label to missing classes to make sure the dummies are encoded as blanks correctly.
Is there a way to do this automatically across multiple sets of data and have the labels automatically synched between the two? (I.e. for Test & Training sets). I.e. the same columns but different classes of data represented in each?
E.g.:
Suppose I have the following two dataframes:
df1 = pd.DataFrame.from_items([('col1', list('abc')), ('col2', list('123'))])
df2 = pd.DataFrame.from_items([('col1', list('bcd')), ('col2', list('234'))])
df1
col1 col2
1 a 1
2 b 2
3 c 3
df2
col1 col2
1 b 2
2 c 3
3 d 4
I want to have:
df1
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 1 0 0 0 1 0 0 0
2 0 1 0 0 0 1 0 0
3 0 0 1 0 0 0 1 0
df2
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 0 1 0 0 0 1 0 0
2 0 0 1 0 0 0 1 0
3 0 0 0 1 0 0 0 1
WITHOUT having to specify in advance that
col1_labels = ['a', 'b', 'c', 'd'], col2_labels = ['1', '2', '3', '4']
And can I do this systematically for many columns all at once? I'm imagining a fuction that when passed in two or more dataframes (assuming columns are the same for all):
reads which columns in the pandas dataframe are categories
figures out what that overall labels are
and then provides the category labels to each column
Does that seem right? Is there a better way?
I think you need reindex by union of all columns if same categorical columns names in both Dataframes:
print (df1)
df1
1 a
2 b
3 c
print (df2)
df1
1 b
2 c
3 d
df1 = pd.get_dummies(df1)
df2 = pd.get_dummies(df2)
union = df1.columns | df2.columns
df1 = df1.reindex(columns=union, fill_value=0)
df2 = df2.reindex(columns=union, fill_value=0)
print (df1)
df1_a df1_b df1_c df1_d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
print (df2)
df1_a df1_b df1_c df1_d
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1

How to use groupby and apply with DataFrames to set all values in a group column to 1 if one of the column values is 1?

I have a DataFrame with the following structure:
I want to transform the DataFrame so that for every unique user_id, if a column contains a 1, then the whole column should contain 1s for that user_id. Assume that I don't know all of the column names in advance. Based on the above input, the output would be:
I have the following code (please excuse how unsuccinct it is):
df = df.groupby('user_id').apply(self.transform_columns)
def transform_columns(self, x):
x.apply(self.transform)
def transform(self, x):
if 1 in x:
for element in x:
element = 1
var = x
At the point of the transform function, x is definitely a series. For some reason this code is returning an empty DataFrame. Btw if you also know a way of excluding certain columns from the transformation (e.g. user_id) that would be great. Please help.
I'm going to explain how I transformed the data into the initial state for the input, as after attempting Jezrael's answer, I am getting a KeyError on the 'user_id' column (which definitely exists in the df). The initial state of the data was as below:
I transformed it to the state shown in the first image in the question with the following code:
df2 = self.add_support_columns(df)
df = df.join(df2)
def add_support_columns(self, df):
df['pivot_column'] = df.apply(self.get_col_name, axis=1)
df['flag'] = 1
df = df.pivot(index='user_id', columns='pivot_column')['flag']
df.reset_index(inplace=True)
df = df.fillna(0)
return df
You can use set_index + groupby + transform with any + reset_index:
It working because 1s are in any function processes like Trues - so if at least one 1 it return 1 else 0.
df = pd.DataFrame({
'user_id' : [33,33,33,33,22,22],
'q1' : [1,0,0,0,0,0],
'q2' : [0,0,0,0,1,0],
'q3' : [0,1,0,0,0,1],
})
df = df.reindex_axis(['user_id','q1','q2','q3'], 1)
print (df)
user_id q1 q2 q3
0 33 1 0 0
1 33 0 0 1
2 33 0 0 0
3 33 0 0 0
4 22 0 1 0
5 22 0 0 1
df = df.set_index('user_id')
.groupby('user_id') # or groupby(level=0)
.transform(lambda x: 1 if x.any() else 0)
.reset_index()
print (df)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
Solution with join:
df = df[['user_id']].join(df.groupby('user_id').transform(lambda x: 1 if x.any() else 0))
print (df)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
EDIT:
More dynamic solution with difference + reindex_axis:
#select only some columns
cols = ['q1','q2']
#all another columns are not transforming
cols2 = df.columns.difference(cols)
df1 = df[cols2].join(df.groupby('user_id')[cols].transform(lambda x: 1 if x.any() else 0))
#if need same order of columns as original
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
user_id q1 q2 q3
0 33 1 0 0
1 33 1 0 1
2 33 1 0 0
3 33 1 0 0
4 22 0 1 0
5 22 0 1 1
Also logic can be inverted:
#select only columns which are not transforming
cols = ['user_id']
#all another columns are transforming
cols2 = df.columns.difference(cols)
df1 = df[cols].join(df.groupby('user_id')[cols2].transform(lambda x: 1 if x.any() else 0))
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
EDIT:
More efficient solution is return only boolean mask and then convert to int:
df1 = df.groupby('user_id').transform('any').astype(int)
Timings:
In [170]: %timeit (df.groupby('user_id').transform(lambda x: 1 if x.any() else 0))
1 loop, best of 3: 514 ms per loop
In [171]: %timeit (df.groupby('user_id').transform('any').astype(int))
10 loops, best of 3: 84 ms per loop
Sample for timings:
np.random.seed(123)
N = 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 3)),
index=np.random.randint(1000, size=N))
df.index.name = 'user_id'
df = df.add_prefix('q').reset_index()
#print (df)

Categories