I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:
DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})
Col
0 A
1 A, B
2 A, C
3 B, C, D
4 D
My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:
DF2 = DF['Col'].str.split(', ', expand = True)
0 1 2
0 A None None
1 A B None
2 A C None
3 B C D
4 D None None
pd.get_dummies(DF2)
0_A 0_B 0_D 1_B 1_C 2_D
0 1 0 0 0 0 0
1 1 0 0 1 0 0
2 1 0 0 0 1 0
3 0 1 0 0 1 1
4 0 0 1 0 0 0
Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?
Simplest way is
DF.Col.str.get_dummies(', ')
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Slightly more complicated
from sklearn.preprocessing import MultiLabelBinarizer
from numpy.core.defchararray import split
mlb = MultiLabelBinarizer()
s = DF.Col.values.astype(str)
d = mlb.fit_transform(split(s, ', '))
pd.DataFrame(d, columns=mlb.classes_)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
By using pd.crosstab
import pandas as pd
df = pd.DataFrame({'Col':['A', 'A,B', 'A,C', 'B,C,D', 'D']})
df.Col=df.Col.str.split(',')
df1=df.Col.apply(pd.Series).stack()
pd.crosstab(df1.index.get_level_values(0),df1)
Out[893]:
col_0 A B C D
row_0
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Related
Let's say we have the following df:
id
A
B
C
D
123
1
1
0
0
456
0
1
1
0
786
1
0
0
0
The id column represents a unique client.
Columns A, B, C, and D represent a product. These columns' values are binary.
1 means the client has that product.
0 means the client doesn't have that product.
I want to create a matrix table of sorts that counts the number of combinations of products that exist for all users.
This would be the desired output, given the df provided above:
A
B
C
D
A
2
1
0
0
B
0
2
1
0
C
0
1
1
0
D
0
0
1
0
import pandas as pd
df = pd.read_fwf('table.dat', infer_nrows=1001)
cols = ['A', 'B', 'C', 'D']
df2 = df[cols]
df2.T.dot(df2)
Result:
A B C D
A 2 1 0 0
B 1 2 1 0
C 0 1 1 0
D 0 0 0 0
I think you want a dot product:
df2 = df.set_index('id')
out = df2.T.dot(df2)
Output:
A B C D
A 2 1 0 0
B 1 2 1 0
C 0 1 1 0
D 0 0 0 0
I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?
Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Suppose now I have the following pandas data frame:
id text
1 A B C
2 B D
3 A D
And I want to get the following result:
id A B C D
1 1 1 1 0
2 0 1 0 1
3 1 0 0 1
I don't how to describe this transformation, it looks like one-hot encoding but they should be totally different.
Anyone knows how to do this transformation and what's the name of such transformation?
Something like str.get_dummies
pd.concat([df['id'],df.text.str.get_dummies(sep=' ')],1)
Out[249]:
id A B C D
0 1 1 1 1 0
1 2 0 1 0 1
2 3 1 0 0 1
One way is via pd.get_dummies:
df = pd.DataFrame({'id': [1, 2, 3],
'text': ['A B C', 'B D', 'A D']})
df['text'] = df['text'].str.split(' ').str.join('|')
df = df.join(df['text'].str.get_dummies()).drop('text', 1)
# id A B C D
# 0 1 1 1 1 0
# 1 2 0 1 0 1
# 2 3 1 0 0 1
Example
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Now I would like to map a and b to a dummy variable, but nothing else. How can I do that?
What I tried
>>> pd.get_dummies(s, columns=['a', 'b'])
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
A simpler method is to just mask the resultant df with the cols of interest:
In[16]:
pd.get_dummies(s)[list('ab')]
Out[16]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
So this will sub-select the resultant dummies df with the cols of interest
If you don't want to calculate the dummies column for the columns that you are not interested in the first place, then you could filter out the rows of interest but this requires reindexing with a fill_value (thanks to #jezrael for the suggestion):
In[20]:
pd.get_dummies(s[s.isin(list('ab'))]).reindex(s.index, fill_value=0)
Out[20]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Setting everything else to nan is one option:
s[~((s == 'a') | (s == 'b'))] = float('nan')
which yields:
>>> pd.get_dummies(s)
a b
0 1 0
1 0 1
2 0 0
3 1 0
Another way
In [3907]: pd.DataFrame({c:s.eq(c).astype(int) for c in ['a', 'b']})
Out[3907]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Or, (s==c).astype(int)