Getting "SettingWithCopyWarning" while performing one hot encoding with pandas - python

I encountered the SettingWithCopyWarning in Python. I searched online but it seems that all the solutions do not work for me.
The input data is like this:
id genre
0 1 Drama, Romance
1 2 Action, Drama
2 3 Action, Comedy
3 4 Thriller
The expected outcome should be:
id Drama Romance Action Comedy Thriller
0 1 1 1 0 0 0
1 2 1 0 1 0 0
2 3 0 0 1 1 0
3 4 0 0 0 0 1
I want to get the subset of the input data, add some columns and modify the added column, and return the subset. Basically, I DO NOT want to modify the original data, I just want to get a subset, which should be a brand new dataframe :
# the function to deal with the genre
def genre(data):
subset = data[['id', 'genre']]
for i, row in subset.iterrows():
if isinstance(row['genre'], float):
continue
genreList = row['genre'].split(', ')
for genre in genreList:
if genre in list(subset):
subset.loc[i][genre] = 1
else:
subset.loc[:][genre] = 0
subset.loc[i][genre] = 1
return subset
I tried many ways, but neither of them gets rid of the SettingWithCopyWarning :
subset = data[['A', 'B']].copy().
subset = data.filter(['A','B'], axis=1)
subset = pd.Dataframe(data[['A', 'B']])
subset = data.copy()subset.drop(columns =['C','D'])
subset = pd.DataFrame({'id': list(data.id), 'genre': list(data.genre)})
Does anyone have any idea how to fix this? Or is there a way to surpress the warning?
Thanks in advance!!

Iteration is not needed, and neither is subsetting. You can use str.get_dummies.
df.drop('genre', 1).join(df['genre'].str.get_dummies(sep=', '))
id Action Comedy Drama Romance Thriller
0 1 0 0 1 1 0
1 2 1 0 1 0 0
2 3 1 1 0 0 0
3 4 0 0 0 0 1
The result is a new DataFrame, you can assign this to something else (df2 = ...).

Related

Create separate pandas dataframes based on a column and operate on them

I have the following code, which works, but surely there has to be a more efficient way to loop through these steps.
First, here's the data frame. You will see we have some tweets about some cereals, nothing fancy.
import pandas as pd
df = pd.DataFrame([['Cheerios', 'I love Cheerios they are the best'], ['FrostedFlakes', 'Frosted Flakes taste delicious'], ['FruityPebbles', 'Fruity Pebbles is a terrible cereal'], ['Cheerios', 'Honey Nut Cheerios are the greatest cereal'], ['FrostedFlakes', 'Frosted Flakes are grrrreat'], ['FruityPebbles', 'Fruity Pebbles are terrible']], columns=['Label', 'Tweet'])
Now, I create separate data frames for each value of the column "Label," i.e. a data frame for each cereal
cereals0 = df[df["Label"] == 'Cheerios']
cereals1 = df[df["Label"] == 'FrostedFlakes']
cereals2 = df[df["Label"] == 'FruityPebbles']
Now I split the text in the "Tweet" column for each data frame, then count those words, then sort the data frames by that count
cereals0 = cereals0.Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereals1 = cereals1.Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereals2 = cereals2.Tweet.str.split(expand=True).stack().value_counts().reset_index()
Finally I add labels to the columns
cereals0.columns = ['Word', 'Frequency']
cereals1.columns = ['Word', 'Frequency']
cereals2.columns = ['Word', 'Frequency']
Is there a way to do these three steps in a FOR loop so I can avoid copying and pasting?
I have tried something like
for cereal in df.Label.unique():
cereal = df[df["Label"] == cereal].Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereal.columns = ['Word', 'Frequency']
But this gets me nothing.
Thank you!
Looking at your examples you probably want to look at .pivot_table or pd.crosstab:
df = df.assign(Tweet=df["Tweet"].str.split()).explode("Tweet")
print(pd.crosstab(df["Tweet"], df["Label"]))
Prints:
Label Cheerios FrostedFlakes FruityPebbles
Tweet
Cheerios 2 0 0
Flakes 0 2 0
Frosted 0 2 0
Fruity 0 0 2
Honey 1 0 0
I 1 0 0
Nut 1 0 0
Pebbles 0 0 2
a 0 0 1
are 2 1 1
best 1 0 0
cereal 1 0 1
delicious 0 1 0
greatest 1 0 0
grrrreat 0 1 0
is 0 0 1
love 1 0 0
taste 0 1 0
terrible 0 0 2
the 2 0 0
they 1 0 0

pandas: convert a list in a column into separate columns

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1
You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]
One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

How to clean up columns with values '10-12' (represented in range) in pandas dataframe?

I have Car Sales price dataset, where I am trying to predict the sales price given the features of a car. I have a variable called 'Fuel Economy city' which is having values like 10,12,10-12,13-14,.. in pandas dataframe. I need to convert this into numerical to apply regression algorithm. I don't have domain knowledge about automobiles. Please help.
I tried removing the hyphen, but it is treating as a four digit value which I don't think is correct in this context.
You could try pd.get_dummies() which will make a separate column for the various ranges, marking each column True (1) or False (0). These can then be used in lieu of the ranges (which are considered categorical data.)
import pandas as pd
data = [[10,"blue", "Ford"], [12,"green", "Chevy"],["10-12","white", "Chrysler"],["13-14", "red", "Subaru"]]
df = pd.DataFrame(data, columns = ["Fuel Economy city", "Color", "Make"])
print(df)
df = pd.get_dummies(df)
print(df)
OUTPUT:
Fuel Economy city_10 Fuel Economy city_12 Fuel Economy city_10-12 \
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 0
Fuel Economy city_13-14 Color_blue Color_green Color_red Color_white \
0 0 1 0 0 0
1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 1 0
Make_Chevy Make_Chrysler Make_Ford Make_Subaru
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1

Python - Attempting to create binary features from a column with lists of strings

It was hard for me to come up with clear title but an example should make things more clear.
Index C1
1 [dinner]
2 [brunch, food]
3 [dinner, fancy]
Now, I'd like to create a set of binary features for each of the unique values in this column.
The example above would turn into:
Index C1 dinner brunch fancy food
1 [dinner] 1 0 0 0
2 [brunch, food] 0 1 0 1
3 [dinner, fancy] 1 0 1 0
Any help would be much appreciated.
For a performant solution, I recommend creating a new DataFrame by listifying your column.
pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
brunch dinner fancy food
0 0 1 0 0
1 1 0 0 1
2 0 1 1 0
This is going to be so much faster than apply(pd.Series).
This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:
(pd.get_dummies(
pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
Well, if your data is like this, then what you're looking for isn't "binary" anymore.
Maybe using MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]:
Index brunch dinner fancy food
0 1 0 1 0 0
1 2 1 0 0 1
2 3 0 1 1 0

iterate through the rows of a pandas df to generate new df (depending on conditions)

I have a df with badminton scores. Each sets of a games for a team are on rows and the score at each point on the columns like so:
0 0 1 1 2 3 4
0 1 2 3 3 4 4
I want to obtain only O and 1 when a point is scored, like so: (to analyse if there any pattern in the points):
0 0 1 0 1 1 1
0 1 1 1 0 1 0
I was thinking of using df.itertuples() and iloc and conditions to attribute 1 to new dataframe if next score = score+1 or 0 if next score = score + 1
But I dont know how to iterate through the generated tuples and how to generate my new df with the 0 and 1 at the good locations.
Hope that is clear thanks for your help.
Oh also, any suggestions to analyse the patterns after that ?
You just need diff(If you need convert it back try cumsum)
df.diff(axis=1).fillna(0).astype(int)
Out[1382]:
1 2 3 4 5 6 7
0 0 0 1 0 1 1 1
1 0 1 1 1 0 1 0

Categories