Combining hot encodings from two dataframes - python

I am trying to combine results of a data frame df2 column into an anther data frame that is called df in order to hot encode it and add to the pipeline of the df
Question What would be the proper order and method to combine them?
My Thoughts
To combine the results of my topic labeling data frame df2 by the column df2['Topic Label'] in order to hot encode it and then add it to the recommender as a 4th factor.
What I tried
I have two separate working data frames so I create df2 in order to get the results of df2['Topic Label'] first so I can combine it to the other data frame that is just df. Both of these data frames are fully working before this.
The Code
Jupyter Notebook at github Full code
excel for data at github Dataset
df = pd.read_excel('dnd-dataframe.xlsx', sheet_name=0, usecols=['name', 'weapons','herotype','spells']) Toy Dataset
df.head(30)
dummies1 = df['weapons'].str.get_dummies(sep=',')
dummies2 = df['spells'].str.get_dummies(sep=',')
dummies3 = df['herotype'].str.get_dummies(sep=',')
dummies4 = df2['Topic Label'].str.get_dummies()
genre_data = pd.concat([df, dummies1, dummies2, dummies3] + [df2, dummies4], axis=1)

After some more reading and trying I figured it out.
First I took the data frame (df2) that I want to add to the existing one for the final data frame.
so I created the one-hot encoding or dummies for the topics
dummies4 = df2['Topic Label'].str.get_dummies()
dummies4.head()
adventure lucky musical pedigree religion survivor tricky
0 0 0 0 0 0 0 1
1 0 0 0 1 0 0 0
2 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0
4 0 0 0 0 1 0 0
then I concatenated it using the dummies4 into the df from df2
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
genre_data = pd.concat([df, dummies4], axis=1).reindex(df.index)
genre_data.head()
And finally did the orginal df get dummies but this time last on order instead of first
name herotype weapons spells adventure lucky musical pedigree religion survivor tricky
0 bam Bard Dagger, sling, club Transmutation, Enchantment 0 0 0 0 0 0 1
1 niem Sorcerer light crossbow, battleaxe Necromancy 0 0 0 1 0 0 0
2 aem Paladin Greataxe Abjuration, Conjuration 0 0 0 1 0 0 0
3 yaeks Rogue club, battleaxe Conjuration, Evocation, Transmutation 0 0 0 0 0 1 0
4 jeeks Druid Dagger, Greataxe Evocation, Transmutation, Necromancy 0 0 0 0 1 0 0
Finally, now that both the recommender and the topic labels are hot encoding I can now run the hybrid and I get a slightly more interesting result. That take the topic into consideration in sorting the results.

Related

Create separate pandas dataframes based on a column and operate on them

I have the following code, which works, but surely there has to be a more efficient way to loop through these steps.
First, here's the data frame. You will see we have some tweets about some cereals, nothing fancy.
import pandas as pd
df = pd.DataFrame([['Cheerios', 'I love Cheerios they are the best'], ['FrostedFlakes', 'Frosted Flakes taste delicious'], ['FruityPebbles', 'Fruity Pebbles is a terrible cereal'], ['Cheerios', 'Honey Nut Cheerios are the greatest cereal'], ['FrostedFlakes', 'Frosted Flakes are grrrreat'], ['FruityPebbles', 'Fruity Pebbles are terrible']], columns=['Label', 'Tweet'])
Now, I create separate data frames for each value of the column "Label," i.e. a data frame for each cereal
cereals0 = df[df["Label"] == 'Cheerios']
cereals1 = df[df["Label"] == 'FrostedFlakes']
cereals2 = df[df["Label"] == 'FruityPebbles']
Now I split the text in the "Tweet" column for each data frame, then count those words, then sort the data frames by that count
cereals0 = cereals0.Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereals1 = cereals1.Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereals2 = cereals2.Tweet.str.split(expand=True).stack().value_counts().reset_index()
Finally I add labels to the columns
cereals0.columns = ['Word', 'Frequency']
cereals1.columns = ['Word', 'Frequency']
cereals2.columns = ['Word', 'Frequency']
Is there a way to do these three steps in a FOR loop so I can avoid copying and pasting?
I have tried something like
for cereal in df.Label.unique():
cereal = df[df["Label"] == cereal].Tweet.str.split(expand=True).stack().value_counts().reset_index()
cereal.columns = ['Word', 'Frequency']
But this gets me nothing.
Thank you!
Looking at your examples you probably want to look at .pivot_table or pd.crosstab:
df = df.assign(Tweet=df["Tweet"].str.split()).explode("Tweet")
print(pd.crosstab(df["Tweet"], df["Label"]))
Prints:
Label Cheerios FrostedFlakes FruityPebbles
Tweet
Cheerios 2 0 0
Flakes 0 2 0
Frosted 0 2 0
Fruity 0 0 2
Honey 1 0 0
I 1 0 0
Nut 1 0 0
Pebbles 0 0 2
a 0 0 1
are 2 1 1
best 1 0 0
cereal 1 0 1
delicious 0 1 0
greatest 1 0 0
grrrreat 0 1 0
is 0 0 1
love 1 0 0
taste 0 1 0
terrible 0 0 2
the 2 0 0
they 1 0 0

Converting objects into suitable numerical values

This is a salary dataset composed of the following columns:
['work_year', 'experience_level', 'employment_type',
'job_title', 'salary', 'salary_currency', 'salary_in_usd',
'employee_residence', 'remote_ratio', 'company_location',
'company_size'],
dtype='object')
I want to look at the comparison between the features (experience_lvl, employment_type, job_title, salary_currency and remote ratio) and the label (salary).
I have to make the feature engineering part, which includes converting experience level, employment type and salary currency to suitable numerical values.
How can that be done? What is the optimal solution in this case?
The three columns that have to be converted
You could use e.g. one-hot encoding to transform a dataframe like
index
experience_level
employment_type
0
EX
CT
1
EX
PT
2
MI
CT
3
MI
FT
4
EX
PT
to a dataframe like
index
experience_level_EN
experience_level_EX
experience_level_MI
experience_level_SE
employment_type_CT
employment_type_FL
employment_type_FT
employment_type_PT
0
0
1
0
0
1
0
0
0
1
0
1
0
0
0
0
0
1
2
0
0
1
0
1
0
0
0
3
0
0
1
0
0
0
1
0
4
0
1
0
0
0
0
0
1
as follows:
cat_cols = ['experience_level', 'employment_type']
df_encoded = df.drop(columns=cat_cols)
for col in cat_cols:
encoder = LabelBinarizer().fit(df[col])
cols = [f'{col}_{c}' for c in encoder.classes_]
encoded = pd.DataFrame(encoder.transform(df[col]), columns=cols)
df_encoded = pd.concat([encoded, df_encoded], axis=1)
You may also want to apply this to other columns such as company location, employee_residence, etc., which I assume to be categorical too.

How to clean up columns with values '10-12' (represented in range) in pandas dataframe?

I have Car Sales price dataset, where I am trying to predict the sales price given the features of a car. I have a variable called 'Fuel Economy city' which is having values like 10,12,10-12,13-14,.. in pandas dataframe. I need to convert this into numerical to apply regression algorithm. I don't have domain knowledge about automobiles. Please help.
I tried removing the hyphen, but it is treating as a four digit value which I don't think is correct in this context.
You could try pd.get_dummies() which will make a separate column for the various ranges, marking each column True (1) or False (0). These can then be used in lieu of the ranges (which are considered categorical data.)
import pandas as pd
data = [[10,"blue", "Ford"], [12,"green", "Chevy"],["10-12","white", "Chrysler"],["13-14", "red", "Subaru"]]
df = pd.DataFrame(data, columns = ["Fuel Economy city", "Color", "Make"])
print(df)
df = pd.get_dummies(df)
print(df)
OUTPUT:
Fuel Economy city_10 Fuel Economy city_12 Fuel Economy city_10-12 \
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 0
Fuel Economy city_13-14 Color_blue Color_green Color_red Color_white \
0 0 1 0 0 0
1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 1 0
Make_Chevy Make_Chrysler Make_Ford Make_Subaru
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1

Python - Attempting to create binary features from a column with lists of strings

It was hard for me to come up with clear title but an example should make things more clear.
Index C1
1 [dinner]
2 [brunch, food]
3 [dinner, fancy]
Now, I'd like to create a set of binary features for each of the unique values in this column.
The example above would turn into:
Index C1 dinner brunch fancy food
1 [dinner] 1 0 0 0
2 [brunch, food] 0 1 0 1
3 [dinner, fancy] 1 0 1 0
Any help would be much appreciated.
For a performant solution, I recommend creating a new DataFrame by listifying your column.
pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
brunch dinner fancy food
0 0 1 0 0
1 1 0 0 1
2 0 1 1 0
This is going to be so much faster than apply(pd.Series).
This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:
(pd.get_dummies(
pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
Well, if your data is like this, then what you're looking for isn't "binary" anymore.
Maybe using MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]:
Index brunch dinner fancy food
0 1 0 1 0 0
1 2 1 0 0 1
2 3 0 1 1 0

Create a Model for Dummy Variables

Starting with a training data set for a variable var1 as:
var1
A
B
C
D
I want to create a model (let's call it dummy_model1) that would then transform the training data set to:
var1_A var1_B var1_C var1_D
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
This functionality (or similar) exists in, among others, the dummies package in R and get_dummies in Pandas, or even case statements in SQL.
I'd like to then be able to apply dummy_model1 to a new data set:
var1
C
7
#
A
and get the following output:
var1_A var1_B var1_C var1_D
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
I know I can do this in SQL with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7 and # in the above example).
Somewhat language agnostic (as long as its open source) but would prefer Python or R. Please note the data is over 500GB so that limits some of my options. Thanks in advance.
Assuming var1 fits in memory on its own, here is a possible solution:
First, read in var1.
Next, use get_dummies to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.
Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).
Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except var1). This will be the final columns encoding.
Then, read in your test data. Use get_dummies to encode var1 in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.
After reindexing, you will end up a test dataset with var1 dummies consistent with your training var1.
To illustrate:
import pandas as pd
import numpy as np
training = pd.DataFrame({
'var1': ['a','b','c'],
'other_var':[4,7,3],
'yet_another':[8,0,2]
})
print training
other_var var1 yet_another
0 4 a 8
1 7 b 0
2 3 c 2
test = pd.DataFrame({
'var1': ['a','b','q'],
'other_var':[9,4,2],
'yet_another':[9,1,5]
})
print test
other_var var1 yet_another
0 9 a 9
1 4 b 1
2 2 q 5
var1_dummied = pd.get_dummies(training.var1, prefix='var1')
var_dummy_columns = var1_dummied.columns.values
print var_dummy_columns
array(['var1_a', 'var1_b', 'var1_c'], dtype=object)
final_encoding_columns = np.append(training.drop(['var1'], axis = 1).columns, var_dummy_columns)
print final_encoding_columns
array(['other_var', 'yet_another', 'var1_a', 'var1_b', 'var1_c'], dtype=object)
test_encoded = pd.get_dummies(test, columns=['var1'])
print test_encoded
other_var yet_another var1_a var1_b var1_q
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 1
test_encoded_reindexed = test_encoded.reindex(columns = final_encoding_columns, fill_value=0)
print test_encoded_reindexed
other_var yet_another var1_a var1_b var1_c
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 0
This should be what you want, based on the expected output in your question and the comments.
If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update final_encoding_columns iteratively for each training variable you want to encode. Then pass all of those columns to the columns= argument when reindexing the test data. Reindex with your complete final_encoding_columns and you should be all set.
just a try:
# first set the variable to factor with levels specified
df$var1 <- factor(df$var1, levels = LETTERS[1:4])
model.matrix(data = df, ~var1-1)
# var1A var1B var1C var1D
#1 0 0 1 0
#4 1 0 0 0
# or even
sapply(LETTERS[1:4], function(x) as.numeric(x==df$var1))
# A B C D
#[1,] 0 0 1 0
#[2,] 0 0 0 0
#[3,] 0 0 0 0
#[4,] 1 0 0 0

Categories