How do I convert text into multiple columns in python? - python

Suppose now I have the following pandas data frame:
id text
1 A B C
2 B D
3 A D
And I want to get the following result:
id A B C D
1 1 1 1 0
2 0 1 0 1
3 1 0 0 1
I don't how to describe this transformation, it looks like one-hot encoding but they should be totally different.
Anyone knows how to do this transformation and what's the name of such transformation?

Something like str.get_dummies
pd.concat([df['id'],df.text.str.get_dummies(sep=' ')],1)
Out[249]:
id A B C D
0 1 1 1 1 0
1 2 0 1 0 1
2 3 1 0 0 1

One way is via pd.get_dummies:
df = pd.DataFrame({'id': [1, 2, 3],
'text': ['A B C', 'B D', 'A D']})
df['text'] = df['text'].str.split(' ').str.join('|')
df = df.join(df['text'].str.get_dummies()).drop('text', 1)
# id A B C D
# 0 1 1 1 1 0
# 1 2 0 1 0 1
# 2 3 1 0 0 1

Related

How do you count the common 1's in pandas data frame?

I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?
Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Add column from other data frame based on condition

I have two data frames:
df1 =
ID Num
a 0
b 0
c 1
d 1
And 2-nd:
df =
ID
a
a
b
b
c
c
d
I want to add Num column to df with the following rule:
If in df1 a is 0, then every a in df should be 0 and so on.
Desired output:
df1 =
ID Num
a 0
a 0
b 0
b 0
c 1
c 1
d 1
I did it with if condition, but it appears very long and hard coding
Try this:
nummap = df1.set_index('ID').to_dict()['Num']
df['Num'] = df['ID'].map(nummap)
output
In [387]: df
Out[387]:
ID Num
0 a 0
1 a 0
2 b 0
3 b 0
4 c 1
5 c 1
6 d 1
Let us try merge
df=df.merge(df1)
ID Num
0 a 0
1 a 0
2 b 0
3 b 0
4 c 1
5 c 1
6 d 1

Cross addition in pandas

How to apply cross addition (OR) in my pandas dataframe like below.
Input:
A B C D
0 0 1 0 1
Output:
A B C D
0 0 1 0 1
1 1 1 1 1
2 0 1 0 1
3 1 1 1 1
So far I can achieve using this,
cols=df.columns
n=len(cols)
df1=pd.concat([df]*n,ignore_index=True).eq(1)
df2= pd.concat([df.T]*n,axis=1,ignore_index=True).eq(1)
df2.columns=cols
df2=df2.reset_index(drop=True)
print (df1|df2).astype(int)
I think there is much more simpler way to handle this case.
You can use numpy | operation with broadcast as:
data = df.values
df = pd.DataFrame((data.T | data), columns=df.columns)
Or using np.logical_or as:
df = pd.DataFrame(np.logical_or(data,data.T).astype(int), columns=df.columns)
print(df)
A B C D
0 0 1 0 1
1 1 1 1 1
2 0 1 0 1
3 1 1 1 1
Numpy solution:
First extract first row to 1d array with iloc and then broadcast by a[:, None] for change shape to Mx1:
a = df.iloc[0].values
df = pd.DataFrame(a | a[:, None], columns=df.columns)
print (df)
A B C D
0 0 1 0 1
1 1 1 1 1
2 0 1 0 1
3 1 1 1 1

How to categorize two categories in one dataframe in Pandas

I have one pd including two categorical columns with 150 categories. May be a value in column A is not appeared in Column B. For example
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
a['A'] = a['A'].astype('category')
a['B'] = a['B'].astype('category')
The output is
Out[217]:
A B
0 b c
1 b c
2 a c
3 b a
4 a a
And also
cat_columns = a.select_dtypes(['category']).columns
a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
a
The output is
Out[220]:
A B
0 1 1
1 1 1
2 0 1
3 1 0
4 0 0
My problem is that in column A, the b is considered as 1, but in column B, the c is considered as 1. However, I want something like this:
Out[220]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
which 2 is considered as c.
Please note that I have 150 different labels.
Using pd.Categorical() you can specify a list of categories:
In [44]: cats = a[['A','B']].stack().sort_values().unique()
In [45]: cats
Out[45]: array(['a', 'b', 'c'], dtype=object)
In [46]: a['A'] = pd.Categorical(a['A'], categories=cats)
In [47]: a['B'] = pd.Categorical(a['B'], categories=cats)
In [48]: a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
In [49]: a
Out[49]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
We can use pd.factorize all at once.
pd.DataFrame(
pd.factorize(a.values.ravel())[0].reshape(a.shape),
a.index, a.columns
)
A B
0 0 1
1 0 1
2 2 1
3 0 2
4 2 2
Or if you wanted to factorize by sorted category value, use the sort=True argument
pd.DataFrame(
pd.factorize(a.values.ravel(), True)[0].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Or equivalently with np.unique
pd.DataFrame(
np.unique(a.values.ravel(), return_inverse=True)[1].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
If you are only interested in converting to categorical codes and being able to access the mapping via a dictionary, pd.factorize may be more convenient.
Algorithm for getting unique values across columns via #AlexRiley.
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
fact = dict(zip(*pd.factorize(pd.unique(a[['A', 'B']].values.ravel('K')))[::-1]))
b = a.applymap(fact.get)
Result:
A B
0 0 2
1 0 2
2 1 2
3 0 1
4 1 1

Pandas: Turn multiple variables into a single set of dummy variables

I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:
DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})
Col
0 A
1 A, B
2 A, C
3 B, C, D
4 D
My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:
DF2 = DF['Col'].str.split(', ', expand = True)
0 1 2
0 A None None
1 A B None
2 A C None
3 B C D
4 D None None
pd.get_dummies(DF2)
0_A 0_B 0_D 1_B 1_C 2_D
0 1 0 0 0 0 0
1 1 0 0 1 0 0
2 1 0 0 0 1 0
3 0 1 0 0 1 1
4 0 0 1 0 0 0
Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?
Simplest way is
DF.Col.str.get_dummies(', ')
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Slightly more complicated
from sklearn.preprocessing import MultiLabelBinarizer
from numpy.core.defchararray import split
mlb = MultiLabelBinarizer()
s = DF.Col.values.astype(str)
d = mlb.fit_transform(split(s, ', '))
pd.DataFrame(d, columns=mlb.classes_)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
By using pd.crosstab
import pandas as pd
df = pd.DataFrame({'Col':['A', 'A,B', 'A,C', 'B,C,D', 'D']})
df.Col=df.Col.str.split(',')
df1=df.Col.apply(pd.Series).stack()
pd.crosstab(df1.index.get_level_values(0),df1)
Out[893]:
col_0 A B C D
row_0
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1

Categories