pandas: convert a list in a column into separate columns - python

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1

You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]

One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

Related

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

Crosstabin Pandas?

I have DataFrame in Pandas like this:
df = pd.DataFrame({"price_range": [0,1,2,3,0,2], "blue":[0,0,1,0,1,1], "four_g":[0,0,0,1,0,1]})
I have line like this: pd.crosstab(df['price_range'], df["blue"])
Nevertheless, now I only see only for example how many "blue" 0 and 1 is for each "price_range", but I want to exapnd this code and also know how many "four_g" 0 and 1 is for each "price_range". How can do that? Please help me
One way is to use 'melt':
df_out = df.melt('price_range')
pd.crosstab(df_out['price_range'], df_out['variable'], df_out['value'], aggfunc='sum')
Output:
variable blue four_g
price_range
0 1 0
1 0 0
2 2 1
3 0 1
Another way is to use groupby:
df.groupby('price_range')[['blue','four_g']].sum()
Output:
blue four_g
price_range
0 1 0
1 0 0
2 2 1
3 0 1
a simplest way is using 2 crosstab through list comprehension with concat
cols = ['blue', 'four_g']
df_out = pd.concat([pd.crosstab(df['price_range'], df[col])
for col in cols], keys=cols, axis=1)
Out[1116]:
blue four_g
blue 0 1 0 1
price_range
0 1 1 2 0
1 1 0 1 0
2 0 2 1 1
3 1 0 0 1

creating conditions on np.where in Pandas based on value in current column

I have a dataframe in Pandas (subset below).
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1 IN & 200D_MA both =1, results 1
11/4/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/5/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0 PREVIOUS TEST ROW =1 & 200DM_A = 0, TEST ans=0
This is easy to do in excel so I thought it would be easy to do in python. I have this code using nested np.where formulas
df3['TEST'] = np.where( (df3['IN'] == 1) & (df3['200D_MA'] == 1),1,\
np.where( (df3['TEST'].shift(-1) == 1)\
& (df3['200D_MA'] == 1),1,0))
but it throws a KeyError: 'IN' > presumably because I am using a condition from column that has not been created yet. Can anyone help me figure out how to do this?
Seems like you need condition ffill
df['TEST']=df.loc[df.IN==1,'IN']
df.loc[df['200D_MA']==1,'TEST']=df.loc[df['200D_MA']==1,'TEST'].ffill()
df.fillna(0,inplace=True)
df.TEST=df.TEST.astype(int)
df
Out[349]:
DATE IN 200D_MA TEST
0 10/30/2013 0 1 0
1 10/31/2013 0 1 0
2 11/1/2013 1 1 1
3 11/4/2013 0 1 1
4 11/5/2013 0 1 1
5 11/6/2013 0 1 1
6 11/7/2013 0 1 1
7 11/8/2013 0 1 1
8 11/11/2013 0 0 0
I think you can use rolling to calculate previous TEST row.
df['TEST'] = (df['IN 200D_MA'] & df['IN 200D_MA'].rolling(2).min().shift(1)).astype(int)
Output:
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1
11/4/2013 0 1 1
11/5/2013 0 1 1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0

How to transform a data frame with duplicated key for different value to a single key with values as column

I have a given data shown below
Given data frame (input)
I want to change this given data to the following data frame
modifide dataframe (output)
I am using panda from Python library for my work. I am new to panda and python can anyone please help me how to solve this problem using any panda function like pivot table = pd.pivot_table(table, ......)or any other python libraries.
Edit :
Sample Data
df=pd.DataFrame({'Acc':[1,2,4,2,1,3],'Event':list('ABCACA'),'exit':[0,1,1,1,0,0]})
EDIT:
I am` sorry #jpp here is an example for my question:
Let's say this is given input like this
`DataFrame({'Acc':[1,2,4,2,1,1,3],'Event':list('ABCACBA'),'exit':[1,1,1,1,0,0,0]})`
I am expecting this kind of output
# Acc A B C exit
# 1 1 1 1 1
# 2 1 1 0 1
# 4 0 0 1 1
# 3 1 0 0 0
IIUC
pd.concat([pd.crosstab(df.Acc,df.Event),df.groupby('Acc').exit.last()],1)
Out[37]:
A B C exit
Acc
1 1 0 1 0
2 1 1 0 1
3 1 0 0 0
4 0 0 1 1
Data input
df=pd.DataFrame({'Acc':[1,2,4,2,1,3],'Event':list('ABCACA'),'exit':[0,1,1,1,0,0]})
Here is one solution:
import pandas as pd
df = pd.DataFrame({'Acc':[1,2,4,2,1,3],'Event':list('ABCACA'),'exit':[0,1,1,1,0,0]})
df2 = df.groupby(['Acc', 'exit'])['Event'].apply(list).reset_index()
df2 = df2.join(pd.get_dummies(df2['Event'].apply(pd.Series).stack()).sum(level=0))\
.drop('Event', 1)
# Acc exit A B C
# 0 1 0 1 0 1
# 1 2 1 1 1 0
# 2 3 0 1 0 0
# 3 4 1 0 0 1

Convert Dictionary to Pandas in Python

I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1

Categories