I have the following code, which works, but surely there has to be a more efficient way to loop through these steps.
First, here's the data frame. You will see we have some tweets about some cereals, nothing fancy.
import pandas as pd
df = pd.DataFrame([['Cheerios', 'I love Cheerios they are the best'], ['FrostedFlakes', 'Frosted Flakes taste delicious'], ['FruityPebbles', 'Fruity Pebbles is a terrible cereal'], ['Cheerios', 'Honey Nut Cheerios are the greatest cereal'], ['FrostedFlakes', 'Frosted Flakes are grrrreat'], ['FruityPebbles', 'Fruity Pebbles are terrible']], columns=['Label', 'Tweet'])
Now, I create separate data frames for each value of the column "Label," i.e. a data frame for each cereal
cereals0 = df[df["Label"] == 'Cheerios']
cereals1 = df[df["Label"] == 'FrostedFlakes']
cereals2 = df[df["Label"] == 'FruityPebbles']
Now I split the text in the "Tweet" column for each data frame, then count those words, then sort the data frames by that count
cereals0 = cereals0.Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereals1 = cereals1.Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereals2 = cereals2.Tweet.str.split(expand=True).stack().value_counts().reset_index()
Finally I add labels to the columns
cereals0.columns = ['Word', 'Frequency']
cereals1.columns = ['Word', 'Frequency']
cereals2.columns = ['Word', 'Frequency']
Is there a way to do these three steps in a FOR loop so I can avoid copying and pasting?
I have tried something like
for cereal in df.Label.unique():
cereal = df[df["Label"] == cereal].Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereal.columns = ['Word', 'Frequency']
But this gets me nothing.
Thank you!
Looking at your examples you probably want to look at .pivot_table or pd.crosstab:
df = df.assign(Tweet=df["Tweet"].str.split()).explode("Tweet")
print(pd.crosstab(df["Tweet"], df["Label"]))
Prints:
Label Cheerios FrostedFlakes FruityPebbles
Tweet
Cheerios 2 0 0
Flakes 0 2 0
Frosted 0 2 0
Fruity 0 0 2
Honey 1 0 0
I 1 0 0
Nut 1 0 0
Pebbles 0 0 2
a 0 0 1
are 2 1 1
best 1 0 0
cereal 1 0 1
delicious 0 1 0
greatest 1 0 0
grrrreat 0 1 0
is 0 0 1
love 1 0 0
taste 0 1 0
terrible 0 0 2
the 2 0 0
they 1 0 0
Related
dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1
You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]
One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.
I am trying to combine results of a data frame df2 column into an anther data frame that is called df in order to hot encode it and add to the pipeline of the df
Question What would be the proper order and method to combine them?
My Thoughts
To combine the results of my topic labeling data frame df2 by the column df2['Topic Label'] in order to hot encode it and then add it to the recommender as a 4th factor.
What I tried
I have two separate working data frames so I create df2 in order to get the results of df2['Topic Label'] first so I can combine it to the other data frame that is just df. Both of these data frames are fully working before this.
The Code
Jupyter Notebook at github Full code
excel for data at github Dataset
df = pd.read_excel('dnd-dataframe.xlsx', sheet_name=0, usecols=['name', 'weapons','herotype','spells']) Toy Dataset
df.head(30)
dummies1 = df['weapons'].str.get_dummies(sep=',')
dummies2 = df['spells'].str.get_dummies(sep=',')
dummies3 = df['herotype'].str.get_dummies(sep=',')
dummies4 = df2['Topic Label'].str.get_dummies()
genre_data = pd.concat([df, dummies1, dummies2, dummies3] + [df2, dummies4], axis=1)
After some more reading and trying I figured it out.
First I took the data frame (df2) that I want to add to the existing one for the final data frame.
so I created the one-hot encoding or dummies for the topics
dummies4 = df2['Topic Label'].str.get_dummies()
dummies4.head()
adventure lucky musical pedigree religion survivor tricky
0 0 0 0 0 0 0 1
1 0 0 0 1 0 0 0
2 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0
4 0 0 0 0 1 0 0
then I concatenated it using the dummies4 into the df from df2
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
genre_data = pd.concat([df, dummies4], axis=1).reindex(df.index)
genre_data.head()
And finally did the orginal df get dummies but this time last on order instead of first
name herotype weapons spells adventure lucky musical pedigree religion survivor tricky
0 bam Bard Dagger, sling, club Transmutation, Enchantment 0 0 0 0 0 0 1
1 niem Sorcerer light crossbow, battleaxe Necromancy 0 0 0 1 0 0 0
2 aem Paladin Greataxe Abjuration, Conjuration 0 0 0 1 0 0 0
3 yaeks Rogue club, battleaxe Conjuration, Evocation, Transmutation 0 0 0 0 0 1 0
4 jeeks Druid Dagger, Greataxe Evocation, Transmutation, Necromancy 0 0 0 0 1 0 0
Finally, now that both the recommender and the topic labels are hot encoding I can now run the hybrid and I get a slightly more interesting result. That take the topic into consideration in sorting the results.
I have Car Sales price dataset, where I am trying to predict the sales price given the features of a car. I have a variable called 'Fuel Economy city' which is having values like 10,12,10-12,13-14,.. in pandas dataframe. I need to convert this into numerical to apply regression algorithm. I don't have domain knowledge about automobiles. Please help.
I tried removing the hyphen, but it is treating as a four digit value which I don't think is correct in this context.
You could try pd.get_dummies() which will make a separate column for the various ranges, marking each column True (1) or False (0). These can then be used in lieu of the ranges (which are considered categorical data.)
import pandas as pd
data = [[10,"blue", "Ford"], [12,"green", "Chevy"],["10-12","white", "Chrysler"],["13-14", "red", "Subaru"]]
df = pd.DataFrame(data, columns = ["Fuel Economy city", "Color", "Make"])
print(df)
df = pd.get_dummies(df)
print(df)
OUTPUT:
Fuel Economy city_10 Fuel Economy city_12 Fuel Economy city_10-12 \
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 0
Fuel Economy city_13-14 Color_blue Color_green Color_red Color_white \
0 0 1 0 0 0
1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 1 0
Make_Chevy Make_Chrysler Make_Ford Make_Subaru
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.
Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]
I encountered the SettingWithCopyWarning in Python. I searched online but it seems that all the solutions do not work for me.
The input data is like this:
id genre
0 1 Drama, Romance
1 2 Action, Drama
2 3 Action, Comedy
3 4 Thriller
The expected outcome should be:
id Drama Romance Action Comedy Thriller
0 1 1 1 0 0 0
1 2 1 0 1 0 0
2 3 0 0 1 1 0
3 4 0 0 0 0 1
I want to get the subset of the input data, add some columns and modify the added column, and return the subset. Basically, I DO NOT want to modify the original data, I just want to get a subset, which should be a brand new dataframe :
# the function to deal with the genre
def genre(data):
subset = data[['id', 'genre']]
for i, row in subset.iterrows():
if isinstance(row['genre'], float):
continue
genreList = row['genre'].split(', ')
for genre in genreList:
if genre in list(subset):
subset.loc[i][genre] = 1
else:
subset.loc[:][genre] = 0
subset.loc[i][genre] = 1
return subset
I tried many ways, but neither of them gets rid of the SettingWithCopyWarning :
subset = data[['A', 'B']].copy().
subset = data.filter(['A','B'], axis=1)
subset = pd.Dataframe(data[['A', 'B']])
subset = data.copy()subset.drop(columns =['C','D'])
subset = pd.DataFrame({'id': list(data.id), 'genre': list(data.genre)})
Does anyone have any idea how to fix this? Or is there a way to surpress the warning?
Thanks in advance!!
Iteration is not needed, and neither is subsetting. You can use str.get_dummies.
df.drop('genre', 1).join(df['genre'].str.get_dummies(sep=', '))
id Action Comedy Drama Romance Thriller
0 1 0 0 1 1 0
1 2 1 0 1 0 0
2 3 1 1 0 0 0
3 4 0 0 0 0 1
The result is a new DataFrame, you can assign this to something else (df2 = ...).