I have a dataset with multiple columns that I wish to one hot encode. However, I don't want to have the encoding for each one of them since said columns are related to the said items. What I want is one "set" of dummies variables that uses all the columns. See my code for a better explanation.
Suppose my dataframe looks like this:
In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
In [104]: dum
Out[104]:
ch1 ch2 ch3
0 A B C
1 C G D
2 A F E
If I execute
pd.get_dummies(dum)
The output will be
ch1_A ch1_C ch2_B ch2_F ch2_G ch3_C ch3_D ch3_E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
However, what I would like to obtain is something like this:
A B C D E F G
1 1 1 0 0 0 0
0 0 1 1 0 0 1
1 0 0 0 1 1 0
Instead of having multiple columns representing the encoding, e.g. ch1_A and ch1_C, I only wish to have one group (A, B, and so on) with value 1 when any of the values in the columns ch1, ch2, ch3 show up.
To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once; it will just appear on one of the columns.
Using stack and str.get_dummies
dum.stack().str.get_dummies().sum(level=0)
Out[938]:
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
You could use pd.crosstab to create a frequency table:
import pandas as pd
dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
stacked = dum.stack()
index = stacked.index.get_level_values(0)
result = pd.crosstab(index=index, columns=stacked)
result.index.name = None
result.columns.name = None
print(result)
yields
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
Call it this way
x = pd.get_dummies(dum, prefix="", prefix_sep="")
And then print using
print(x.to_string(index=False))
You can create dummies for separate columns and concat the results:
temp = pd.concat([pd.get_dummies(dum[col]) for col in dum], axis=1)
A C B F G C D E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
temp.groupby(level=0, axis=1).sum()
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
Related
Let's say we have the following df:
id
A
B
C
D
123
1
1
0
0
456
0
1
1
0
786
1
0
0
0
The id column represents a unique client.
Columns A, B, C, and D represent a product. These columns' values are binary.
1 means the client has that product.
0 means the client doesn't have that product.
I want to create a matrix table of sorts that counts the number of combinations of products that exist for all users.
This would be the desired output, given the df provided above:
A
B
C
D
A
2
1
0
0
B
0
2
1
0
C
0
1
1
0
D
0
0
1
0
import pandas as pd
df = pd.read_fwf('table.dat', infer_nrows=1001)
cols = ['A', 'B', 'C', 'D']
df2 = df[cols]
df2.T.dot(df2)
Result:
A B C D
A 2 1 0 0
B 1 2 1 0
C 0 1 1 0
D 0 0 0 0
I think you want a dot product:
df2 = df.set_index('id')
out = df2.T.dot(df2)
Output:
A B C D
A 2 1 0 0
B 1 2 1 0
C 0 1 1 0
D 0 0 0 0
Below is the input data
Name A B C D E F G Total
Ray 1 2 2 0 0 0 0 5
Tom 0 0 0 2 1 0 0 3
Sam 0 0 0 0 0 3 1 4
Below is the intended output
Name A B C D E F G Total A:B:C D:E F:G
Ray 1 2 2 0 0 0 0 5 20:40:40 0:0 0:0
Tom 0 0 0 2 1 0 0 3 0:0:0 67:33 0:0
Sam 0 0 0 0 0 3 1 4 0:0:0 0:0 75:25
Idea is create columns groups in list cols and then in loop divide selected columns by sum, replace NaNs, round and convert to integers, last join string:
#check columns names
print (df.columns.tolist())
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Total']
cols = [('A','B','C'), ('D','E'), ('F','G')]
for c in cols:
#for test
print (c)
print(df.loc[:, c])
df[f'{":".join(c)}'] = (df.loc[:, c]
.div(df.loc[:, c].sum(axis=1), axis=0)
.fillna(0)
.mul(100)
.round()
.astype(int)
.astype(str)
.agg(':'.join, axis=1))
print (df)
A B C D E F G Total A:B:C D:E F:G
0 1 2 2 0 0 0 0 5 20:40:40 0:0 0:0
1 0 0 0 2 1 0 0 3 0:0:0 67:33 0:0
2 0 0 0 0 0 3 1 4 0:0:0 0:0 75:25
I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:
DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})
Col
0 A
1 A, B
2 A, C
3 B, C, D
4 D
My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:
DF2 = DF['Col'].str.split(', ', expand = True)
0 1 2
0 A None None
1 A B None
2 A C None
3 B C D
4 D None None
pd.get_dummies(DF2)
0_A 0_B 0_D 1_B 1_C 2_D
0 1 0 0 0 0 0
1 1 0 0 1 0 0
2 1 0 0 0 1 0
3 0 1 0 0 1 1
4 0 0 1 0 0 0
Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?
Simplest way is
DF.Col.str.get_dummies(', ')
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Slightly more complicated
from sklearn.preprocessing import MultiLabelBinarizer
from numpy.core.defchararray import split
mlb = MultiLabelBinarizer()
s = DF.Col.values.astype(str)
d = mlb.fit_transform(split(s, ', '))
pd.DataFrame(d, columns=mlb.classes_)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
By using pd.crosstab
import pandas as pd
df = pd.DataFrame({'Col':['A', 'A,B', 'A,C', 'B,C,D', 'D']})
df.Col=df.Col.str.split(',')
df1=df.Col.apply(pd.Series).stack()
pd.crosstab(df1.index.get_level_values(0),df1)
Out[893]:
col_0 A B C D
row_0
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Example
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Now I would like to map a and b to a dummy variable, but nothing else. How can I do that?
What I tried
>>> pd.get_dummies(s, columns=['a', 'b'])
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
A simpler method is to just mask the resultant df with the cols of interest:
In[16]:
pd.get_dummies(s)[list('ab')]
Out[16]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
So this will sub-select the resultant dummies df with the cols of interest
If you don't want to calculate the dummies column for the columns that you are not interested in the first place, then you could filter out the rows of interest but this requires reindexing with a fill_value (thanks to #jezrael for the suggestion):
In[20]:
pd.get_dummies(s[s.isin(list('ab'))]).reindex(s.index, fill_value=0)
Out[20]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Setting everything else to nan is one option:
s[~((s == 'a') | (s == 'b'))] = float('nan')
which yields:
>>> pd.get_dummies(s)
a b
0 1 0
1 0 1
2 0 0
3 1 0
Another way
In [3907]: pd.DataFrame({c:s.eq(c).astype(int) for c in ['a', 'b']})
Out[3907]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Or, (s==c).astype(int)