Create a Model for Dummy Variables - python

Starting with a training data set for a variable var1 as:
var1
A
B
C
D
I want to create a model (let's call it dummy_model1) that would then transform the training data set to:
var1_A var1_B var1_C var1_D
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
This functionality (or similar) exists in, among others, the dummies package in R and get_dummies in Pandas, or even case statements in SQL.
I'd like to then be able to apply dummy_model1 to a new data set:
var1
C
7
#
A
and get the following output:
var1_A var1_B var1_C var1_D
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
I know I can do this in SQL with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7 and # in the above example).
Somewhat language agnostic (as long as its open source) but would prefer Python or R. Please note the data is over 500GB so that limits some of my options. Thanks in advance.

Assuming var1 fits in memory on its own, here is a possible solution:
First, read in var1.
Next, use get_dummies to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.
Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).
Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except var1). This will be the final columns encoding.
Then, read in your test data. Use get_dummies to encode var1 in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.
After reindexing, you will end up a test dataset with var1 dummies consistent with your training var1.
To illustrate:
import pandas as pd
import numpy as np
training = pd.DataFrame({
'var1': ['a','b','c'],
'other_var':[4,7,3],
'yet_another':[8,0,2]
})
print training
other_var var1 yet_another
0 4 a 8
1 7 b 0
2 3 c 2
test = pd.DataFrame({
'var1': ['a','b','q'],
'other_var':[9,4,2],
'yet_another':[9,1,5]
})
print test
other_var var1 yet_another
0 9 a 9
1 4 b 1
2 2 q 5
var1_dummied = pd.get_dummies(training.var1, prefix='var1')
var_dummy_columns = var1_dummied.columns.values
print var_dummy_columns
array(['var1_a', 'var1_b', 'var1_c'], dtype=object)
final_encoding_columns = np.append(training.drop(['var1'], axis = 1).columns, var_dummy_columns)
print final_encoding_columns
array(['other_var', 'yet_another', 'var1_a', 'var1_b', 'var1_c'], dtype=object)
test_encoded = pd.get_dummies(test, columns=['var1'])
print test_encoded
other_var yet_another var1_a var1_b var1_q
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 1
test_encoded_reindexed = test_encoded.reindex(columns = final_encoding_columns, fill_value=0)
print test_encoded_reindexed
other_var yet_another var1_a var1_b var1_c
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 0
This should be what you want, based on the expected output in your question and the comments.
If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update final_encoding_columns iteratively for each training variable you want to encode. Then pass all of those columns to the columns= argument when reindexing the test data. Reindex with your complete final_encoding_columns and you should be all set.

just a try:
# first set the variable to factor with levels specified
df$var1 <- factor(df$var1, levels = LETTERS[1:4])
model.matrix(data = df, ~var1-1)
# var1A var1B var1C var1D
#1 0 0 1 0
#4 1 0 0 0
# or even
sapply(LETTERS[1:4], function(x) as.numeric(x==df$var1))
# A B C D
#[1,] 0 0 1 0
#[2,] 0 0 0 0
#[3,] 0 0 0 0
#[4,] 1 0 0 0

Related

How can I find differences between two dataframe rows?

I have two data frames that I merged together on a common ID. I am trying to uncover when values in each row for a matching ID are different.
I merged the files so that I have the below table. I think I might be able to approach this with a series of if statements but the actual data file has hundreds of column attributes which doesn't seem efficient at all. I'm trying to determine if there's an easy way to do this.
x Loan_ID Trade_Quantity_x Principal_x Interest_x Late_Fee_x Trade_Quantity_y Principal_y Interest_y Late_Fee_y
0 1 10 30 0 0 10 30 0 0
1 2 10 0 0 5 10 0 0 0
2 3 10 0 50 0 10 0 0 0
3 4 10 0 0 0 10 0 0 0
4 5 10 100 10 0 10 100 10 0
5 6 9 0 0 0 9 0 0 0
6 7 10 0 0 0 10 0 0 0
Expected output should be:
2. Late_Fee_y
3. Interest_y
I am assuming that what you are after is to compare two data frames of the same structure, i.e. having the same list of columns and the same number of rows identified by values of special Loan_ID.
The goal is to list all "cells" which are different between the two frames, cell location is by the id from Loan_ID and column name.
Can I suggest to merge the two frames differently first, to get a list of values and then find differences by scanning melted frames, or by applying a filter?
Example data (think of id as Loan_ID)
x = {'id':[1,2],'A':[0,1],'B':[2,3]}
y = {'id':[1,2],'A':[0,2],'B':[2,4]}
df_x = pd.DataFrame(x)
df_y = pd.DataFrame(y)
print(df_x)
print(df_y)
melted
df_xm = pd.melt(df_x, id_vars=['id'])
df_xm['source']='x'
df_ym = pd.melt(df_y, id_vars=['id'])
df_ym['source']='y'
print(df_xm)
print(df_ym)
Assuming that both frames are sorted by id correspondingly
for i in df_xm.index:
if df_xm['value'][i] != df_ym['value'][i]:
print(f"{df_xm['id'][i]},{df_xm['variable'][i]}")
Second method :
merged = df_xm.merge(df_ym, left_on= ['id','variable'], right_on=['id','variable'])
print(merged)
filter_diff = merged['value_x'] != merged['value_y']
print('differences:')
print(merged[ filter_diff ])
I'm sure this can be improved for efficiency but this is my general idea how to tackle the "difference between two table snapshots" with general frame/tables and filter operations.

how to deal with numerical variables like branch_id or state_id?

There a few columns like branch_id or state_id or country_id. These arent unique values for each row like id.
How to deal with such columns while working on a machine learning project?
I usually just convert them into nominal categories
train.branch_id = train.branch_id.astype('category',ordered =False)
You would need to LabelEncode or OneHotEncode them (usually the latter).
The simplest way of doing this is pandas.get_dummies.
Say you have a Series like the following:
s = pd.Series(list('abca'))
Output:
0 a
1 b
2 c
3 a
And then:
pd.get_dummies(s)
Output:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
This then goes into your train data set as separate features.
However, be careful of this trap if you are doing a regression model.

pandas.DataFrame and pandas.Series objects act differently for pandas.get_dummies()

I have a dataframe by the name train with a column 'quality'.
>>>train['quality'].unique()
array([5, 6, 7, 4, 8, 3], dtype=int64)
Now get_dummies with train[['quality']] gives
>>>pd.get_dummies(train[['quality']]).head()
quality
0 5
1 5
2 5
3 6
4 5
but with train['quality']
>>>pd.get_dummies(train['quality']).head()
3 4 5 6 7 8
0 0 0 1 0 0 0
1 0 0 1 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
Data Types of train[['quality']] and train['quality'] are:-
>>>print(type(train['quality']))
<class 'pandas.core.series.Series'>
>>>print(type(train[['quality']]))
<class 'pandas.core.frame.DataFrame'>
the get_dummies() doc states: data : array-like, Series, or DataFrame
So if I can give in both a Series or DataFrame then why are the outputs different?
The pd.get_dummies documentation makes this pretty clear:
columns : list-like, default None
Column names in the DataFrame to be
encoded. If columns is None then all the columns with object or
category dtype will be converted.
So, the solution is to either specify a columns parameter, thus overriding the requirement for the column to be categorical/object type to begin with,
pd.get_dummies(df, columns=['quality'])
quality_5 quality_6
0 1 0
1 1 0
2 1 0
3 0 1
4 1 0
Or, convert the column to categorical.
pd.get_dummies(df[['quality']].astype('category'))
quality_5 quality_6
0 1 0
1 1 0
2 1 0
3 0 1
4 1 0
Data needs to be converted to categorical types for get_dummies to work. If a series is passed in, the conversion occurs automatically. As outlined in the documentation and by coldspeed, if a DataFrame is passed in, all object or category dtypes (series of these datatypes) are converted to categorical and will result in dummy columns. For example:
pandas.get_dummies(pandas.DataFrame(list("abcdabcd")))
0_a 0_b 0_c 0_d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 0 1
This works because the list of strings becomes a column of strings which are objects.
Perhaps a bit unintuitively, your integer-type column is not of type "object" and thus is not converted into categorical, so dummy columns are not returned and the original DataFrame is returned. Numerical types in pandas are distinct from objects. You can work around this by simply passing df[["quality"]].astype("category") since this will force the conversion of your integer column to categorical which will then return dummy columns.
EDIT: To expand a bit, one has to keep in mind that dummy variables are a construct for regression (or extensions of regression). If a Dataframe contains dtypes that are both numeric and objects, more often than not, the numeric types are intended to be used directly as inputs for the model. However, object types have no value in regression unless converted to dummy variables. Thus if someone were to pass to get_dummies a DataFrame with three numeric types and one object type, the one object type would be converted to a dummy variable. This is only the default behaviour if the columns parameter is left unspecified. The columns parameter exists in the case that the default behaviour does not suit your needs e.g. you do not want all object/categorical dtype columns converted, or you want a column of numeric dtype converted.

Converting pandas column of comma-separated strings into dummy variables

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Pandas - Get dummies for only certain values

I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0

Categories