Converting pandas column of comma-separated strings into dummy variables - python

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1

Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1

The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

pandas.DataFrame and pandas.Series objects act differently for pandas.get_dummies()

I have a dataframe by the name train with a column 'quality'.
>>>train['quality'].unique()
array([5, 6, 7, 4, 8, 3], dtype=int64)
Now get_dummies with train[['quality']] gives
>>>pd.get_dummies(train[['quality']]).head()
quality
0 5
1 5
2 5
3 6
4 5
but with train['quality']
>>>pd.get_dummies(train['quality']).head()
3 4 5 6 7 8
0 0 0 1 0 0 0
1 0 0 1 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
Data Types of train[['quality']] and train['quality'] are:-
>>>print(type(train['quality']))
<class 'pandas.core.series.Series'>
>>>print(type(train[['quality']]))
<class 'pandas.core.frame.DataFrame'>
the get_dummies() doc states: data : array-like, Series, or DataFrame
So if I can give in both a Series or DataFrame then why are the outputs different?
The pd.get_dummies documentation makes this pretty clear:
columns : list-like, default None
Column names in the DataFrame to be
encoded. If columns is None then all the columns with object or
category dtype will be converted.
So, the solution is to either specify a columns parameter, thus overriding the requirement for the column to be categorical/object type to begin with,
pd.get_dummies(df, columns=['quality'])
quality_5 quality_6
0 1 0
1 1 0
2 1 0
3 0 1
4 1 0
Or, convert the column to categorical.
pd.get_dummies(df[['quality']].astype('category'))
quality_5 quality_6
0 1 0
1 1 0
2 1 0
3 0 1
4 1 0
Data needs to be converted to categorical types for get_dummies to work. If a series is passed in, the conversion occurs automatically. As outlined in the documentation and by coldspeed, if a DataFrame is passed in, all object or category dtypes (series of these datatypes) are converted to categorical and will result in dummy columns. For example:
pandas.get_dummies(pandas.DataFrame(list("abcdabcd")))
0_a 0_b 0_c 0_d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 0 1
This works because the list of strings becomes a column of strings which are objects.
Perhaps a bit unintuitively, your integer-type column is not of type "object" and thus is not converted into categorical, so dummy columns are not returned and the original DataFrame is returned. Numerical types in pandas are distinct from objects. You can work around this by simply passing df[["quality"]].astype("category") since this will force the conversion of your integer column to categorical which will then return dummy columns.
EDIT: To expand a bit, one has to keep in mind that dummy variables are a construct for regression (or extensions of regression). If a Dataframe contains dtypes that are both numeric and objects, more often than not, the numeric types are intended to be used directly as inputs for the model. However, object types have no value in regression unless converted to dummy variables. Thus if someone were to pass to get_dummies a DataFrame with three numeric types and one object type, the one object type would be converted to a dummy variable. This is only the default behaviour if the columns parameter is left unspecified. The columns parameter exists in the case that the default behaviour does not suit your needs e.g. you do not want all object/categorical dtype columns converted, or you want a column of numeric dtype converted.

Mode of each digit in binary strings of a pandas column

I would like to find the mode value of each digit in binary strings of a pandas column. Suppose I have the following data
df = pd.DataFrame({'categories':['A','B','C'],'values':['001','110','111']})
so my data look like this
categories values
A 001
B 110
C 111
If we consider the column "values" at the first digit (0, 1, 1) of A, B, and C respectively, the mode value is 1. If we do the same for other digits, my expected output should be 111.
I can find a mode value of a particular column. If I split each bit into a new column and find the mode value. I could get the expected output by concatenation later. However, when the data has much more columns of binary strings, I'm not sure whether this method still be a good way to do. I'm looking for the more elegant method do this. May I have your suggestion?
I think you can use apply with Series and list for convert digits to columns and then mode:
print (df['values'].apply(lambda x: pd.Series(list(x))))
0 1 2
0 0 0 1
1 1 1 0
2 1 1 1
df1 = df['values'].apply(lambda x: pd.Series(list(x))).mode()
print (df1)
0 1 2
0 1 1 1
Last select row, create list and join:
print (''.join(df1.iloc[0].tolist()))
111
Another possible solution with list comprehension:
df = pd.DataFrame([list(x) for x in df['values']])
print (df)
0 1 2
0 0 0 1
1 1 1 0
2 1 1 1
If output is DataFrame is possible use apply join:
df = pd.DataFrame({'categories':['A','B','C', 'D'],'values':['001','110','111', '000']})
print (df)
categories values
0 A 001
1 B 110
2 C 111
3 D 000
print (pd.DataFrame([list(x) for x in df['values']]).mode())
0 1 2
0 0 0 0
1 1 1 1
df1 = pd.DataFrame([list(x) for x in df['values']]).mode().apply(''.join, axis=1)
print (df1)
0 000
1 111
dtype: object

Copy pandas DataFrame row to multiple other rows

Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select

Pandas - Get dummies for only certain values

I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0

Categories