I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
Related
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I have a dataframe where I want to convert each row into a diagonal dataframe and bind all the resulting dataframes into 1 large dataframe.
Input:
a b c
2021-11-06 1 2 3
2021-11-07 4 5 6
Desired output:
a b c
Date
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
I tried using apply on each row of the original dataframe.
data = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'], index=pd.date_range('2021-11-06', '2021-11-07'))
def convert_dataframe(ser):
df_ser = pd.DataFrame(0.0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data.apply(lambda x: convert_dataframe(x), axis=1)
However, the output is not the multi-index dataframe that I expected.
The output is instead a single index dataframe where each row is a reference to the diagonal dataframe returned.
Any help is much appreciated. Thanks in advance.
Use MultiIndex.droplevel for remove first level of MultiIndex and call function after DataFrame.stack in GroupBy.apply:
def convert_dataframe(ser):
ser = ser.droplevel(0)
df_ser = pd.DataFrame(0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data = data.stack().groupby(level=0).apply(convert_dataframe)
print (data)
a b c
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I have a data that looks like:
index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F
I need to vectorize this stringColumn with counts, ending up with:
index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.
My data is huge so I am looking for vectorized solutions, rather than for loops.
You can use Series.str.split with get_dummies and sum:
df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True),
prefix='', prefix_sep='')
.sum(level=0, axis=1))
Or count values per rows by value_counts, replace missing values by DataFrame.fillna and convert to integers:
df1 = (df['stringColumn'].str.split('_', expand=True)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int))
Or use collections.Counter, performance should be very good:
from collections import Counter
df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
.fillna(0)
.astype(int))
Or reshape by DataFrame.stack and count by SeriesGroupBy.value_counts:
df1 = (df['stringColumn'].str.split('_', expand=True)
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
I have a DataFrame with the following structure:
I want to transform the DataFrame so that for every unique user_id, if a column contains a 1, then the whole column should contain 1s for that user_id. Assume that I don't know all of the column names in advance. Based on the above input, the output would be:
I have the following code (please excuse how unsuccinct it is):
df = df.groupby('user_id').apply(self.transform_columns)
def transform_columns(self, x):
x.apply(self.transform)
def transform(self, x):
if 1 in x:
for element in x:
element = 1
var = x
At the point of the transform function, x is definitely a series. For some reason this code is returning an empty DataFrame. Btw if you also know a way of excluding certain columns from the transformation (e.g. user_id) that would be great. Please help.
I'm going to explain how I transformed the data into the initial state for the input, as after attempting Jezrael's answer, I am getting a KeyError on the 'user_id' column (which definitely exists in the df). The initial state of the data was as below:
I transformed it to the state shown in the first image in the question with the following code:
df2 = self.add_support_columns(df)
df = df.join(df2)
def add_support_columns(self, df):
df['pivot_column'] = df.apply(self.get_col_name, axis=1)
df['flag'] = 1
df = df.pivot(index='user_id', columns='pivot_column')['flag']
df.reset_index(inplace=True)
df = df.fillna(0)
return df
You can use set_index + groupby + transform with any + reset_index:
It working because 1s are in any function processes like Trues - so if at least one 1 it return 1 else 0.
df = pd.DataFrame({
'user_id' : [33,33,33,33,22,22],
'q1' : [1,0,0,0,0,0],
'q2' : [0,0,0,0,1,0],
'q3' : [0,1,0,0,0,1],
})
df = df.reindex_axis(['user_id','q1','q2','q3'], 1)
print (df)
user_id q1 q2 q3
0 33 1 0 0
1 33 0 0 1
2 33 0 0 0
3 33 0 0 0
4 22 0 1 0
5 22 0 0 1
df = df.set_index('user_id')
.groupby('user_id') # or groupby(level=0)
.transform(lambda x: 1 if x.any() else 0)
.reset_index()
print (df)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
Solution with join:
df = df[['user_id']].join(df.groupby('user_id').transform(lambda x: 1 if x.any() else 0))
print (df)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
EDIT:
More dynamic solution with difference + reindex_axis:
#select only some columns
cols = ['q1','q2']
#all another columns are not transforming
cols2 = df.columns.difference(cols)
df1 = df[cols2].join(df.groupby('user_id')[cols].transform(lambda x: 1 if x.any() else 0))
#if need same order of columns as original
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
user_id q1 q2 q3
0 33 1 0 0
1 33 1 0 1
2 33 1 0 0
3 33 1 0 0
4 22 0 1 0
5 22 0 1 1
Also logic can be inverted:
#select only columns which are not transforming
cols = ['user_id']
#all another columns are transforming
cols2 = df.columns.difference(cols)
df1 = df[cols].join(df.groupby('user_id')[cols2].transform(lambda x: 1 if x.any() else 0))
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
EDIT:
More efficient solution is return only boolean mask and then convert to int:
df1 = df.groupby('user_id').transform('any').astype(int)
Timings:
In [170]: %timeit (df.groupby('user_id').transform(lambda x: 1 if x.any() else 0))
1 loop, best of 3: 514 ms per loop
In [171]: %timeit (df.groupby('user_id').transform('any').astype(int))
10 loops, best of 3: 84 ms per loop
Sample for timings:
np.random.seed(123)
N = 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 3)),
index=np.random.randint(1000, size=N))
df.index.name = 'user_id'
df = df.add_prefix('q').reset_index()
#print (df)