I have a data that looks like:
index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F
I need to vectorize this stringColumn with counts, ending up with:
index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.
My data is huge so I am looking for vectorized solutions, rather than for loops.
You can use Series.str.split with get_dummies and sum:
df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True),
prefix='', prefix_sep='')
.sum(level=0, axis=1))
Or count values per rows by value_counts, replace missing values by DataFrame.fillna and convert to integers:
df1 = (df['stringColumn'].str.split('_', expand=True)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int))
Or use collections.Counter, performance should be very good:
from collections import Counter
df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
.fillna(0)
.astype(int))
Or reshape by DataFrame.stack and count by SeriesGroupBy.value_counts:
df1 = (df['stringColumn'].str.split('_', expand=True)
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Related
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)
Why can't I chain the get_dummies() function?
import pandas as pd
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
.get_dummies()
)
This works fine:
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
)
df = pd.get_dummies(df)
DataFrame.pipe can be helpful in chaining methods or function calls which are not natively attached to the DataFrame, like pd.get_dummies:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
Or with lambda:
df = (
df.drop(columns=['sepal_length'])
.pipe(lambda current_df: pd.get_dummies(current_df))
)
Sample DataFrame:
df = pd.DataFrame({'sepal_length': 1, 'a': list('ABACC'), 'b': list('ACCAB')})
df:
sepal_length a b
0 1 A A
1 1 B C
2 1 A C
3 1 C A
4 1 C B
Sample Output:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
df:
a_A a_B a_C b_A b_B b_C
0 1 0 0 1 0 0
1 0 1 0 0 0 1
2 1 0 0 0 0 1
3 0 0 1 1 0 0
4 0 0 1 0 1 0
You can't chain the pd.get_dummies() method since it is not a pd.DataFrame method. However, assuming -
You have a single column left after you drop your columns in the previous step in the chain.
Your column is a string column dtype.
... you can use pd.Series.str.get_dummies() which is a series level method.
### Dummy Dataframe
# A B
# 0 1 x
# 1 2 y
# 2 3 z
pd.read_csv(path).drop(columns=['A'])['B'].str.get_dummies()
x y z
0 1 0 0
1 0 1 0
2 0 0 1
NOTE: Make sure that before you call the get_dummies() method, the data type of the object is series. In this case, I fetch column ['B'] to do that, which kinda makes the previous pd.DataFrame.drop() method unnecessary and useless :)
But this is only for example's sake.
Hi have a pandas dataframe df containing categorical variables.
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
using the function get_dummies I get the following dataframe
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Owever the columns gender_female and gender_male contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?
UPDATED
The use of
df_dummies = pandas.get_dummies(df,drop_first=True)
Would give me
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
but I would like to remove the columns for which originally I had only 2 possibilities
The desired result should be
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
Yes, you can use the argument dropfirst:
drop_first=True
From the documentation:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
To have all dummy columns for eyes, and one for gender, use this:
df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)
Output:
eyes_black eyes_blue eyes_brown gender_male
0 0 1 0 1
1 0 0 1 0
2 1 0 0 1
More general:
gender eyes heigh
0 male blue tall
1 female brown short
2 male black average
for i in df.columns:
if len(df.groupby([i]).size()) > 2:
df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)
Output:
eyes_black eyes_blue eyes_brown heigh_average heigh_short heigh_tall \
0 0 1 0 0 0 1
1 0 0 1 0 1 0
2 1 0 0 1 0 0
gender_male
0 1
1 0
2 1
You could use itertools.combinations to find all pairs of columns, then any potentially redundant pair of columns will be one where for every row one column is True and the other is False - i.e. an XOR:
import pandas as pd
from itertools import combinations
df = pd.DataFrame(data=[['male','blue'],['female','brown'],['male','black']],
columns=['gender','eyes'])
dummies = pd.get_dummies(df)
for c1, c2 in combinations(dummies.columns, 2):
if all(dummies[c1] ^ dummies[c2]):
print(c1,c2)
However, this also notices that in your examples all females have brown eyes, hence we get the following printed:
gender_female gender_male
gender_male eyes_brown
Alternatively, you can split the dataframe into parts you want to apply drop_first=True and parts you don't. Then concatenate them together.
df1 = df.iloc[:, 0:2]
df2 = df.iloc[:, 2:]
df1 = pd.get_dummies(df1 ,drop_first=True)
df = pd.concat([df1, df2], axis=1)