I am trying to apply one-hot-encoding for a pandas dataframe but I can't give a categories argument. My idea is to have the corresponding between categories and the encoding, for example:
CATEGORIES = ['A','B','C']
Y = pd.get_dummies(data['Article_Topic_1']).values
For example, Y will be [0,0,1] for category 'A', but I would like to prescribe the encoding for 'A' to be [1,0,0].
If this is not possible, is there a way to prescribe the encoding and know the exact string that was there?
I don't think you can do this with get_dummies() directly. But how about just reorganizing the result? If I got your question correctly, you want to reorder the columns of the one-hot-encoded data to match a prescribed ordering.
categories = ["A", "B", "C"]
Y = pd.get_dummies(data["Article_Topic_1"])
Y = Y[categories].values
Here a function checking some of the assumptions made that this solution works.
def get_dummies_for_coding(series, ordering):
# Ordering must contain only values present in series.
assert(len(set(ordering)-set(series.unique()))==0)
# It's easier to work with series here, because pd.get_dummies()
# will operate with string prefixes for data-frames, which make
# things a bit more complicated.
assert(isinstance(series, pd.Series))
dummies = pd.get_dummies(series)
dummies = dummies[ordering]
#return dummies
return dummies.values
# Example
df = pd.DataFrame([["a", "foo"],
["a", "bar"],
["b", "bar"],
["a", "baz"],
["b", "bar"]],
columns=["colA", "colB"])
orderingA = ["b", "a"]
orderingB = ["baz", "bar", "foo"]
ret = get_dummies_for_coding(df["colA"], orderingA)
print(ret)
ret = get_dummies_for_coding(df["colB"], orderingB)
print(ret)
Maybe you can try using scikit-learn for the one-hot encoding?
Find here a comprehensive example.
Related
I'm trying to apply simple functions to groups in pandas. I have this dataframe which I can group by type:
df = pandas.DataFrame({"id": ["a", "b", "c", "d"], "v": [1,2,3,4], "type": ["X", "Y", "Y", "Y"]}).set_index("id")
df.groupby("type").mean() # gets the mean per type
I want to apply a function like np.log2 only to the groups before taking the mean of each group. This does not work since apply is element wise and type (as well as potentially other columns in df in a real scenario) is not numeric:
# fails
df.apply(np.log2).groupby("type").mean()
is there a way to apply np.log2 only to the groups prior to taking the mean? I thought transform would be the answer but the problem is that it returns a dataframe that does not have the original type columns:
df.groupby("type").transform(np.log2)
v
id
a 0.000000
b 1.000000
c 1.584963
d 2.000000
Variants like grouping and then applying do not work: df.groupby("type").apply(np.log2). What is the correct way to do this?
The problem is that np.log2 cannot deal with the first column. Instead, you need to pass just your numeric column. You can do this as suggested in the comments, or define a lambda:
df.groupby('type').apply(lambda x: np.mean(np.log2(x['v'])))
As per comments, I would define a function:
df['w'] = [5, 6, 7,8]
def foo(x):
return x._get_numeric_data().apply(axis=0, func=np.log2).mean()
df.groupby('type').apply(foo)
# v w
# type
# X 0.000000 2.321928
# Y 1.528321 2.797439
For example I'd like to assert that two Pyspark DataFrame's have the same data, however just using == checks that they are the same object. Ideally I'd also like to be specify whether order matters or not.
I've tried writing a function that raises an AssertionError but that adds a lot of noise to the pytest output as it shows the traceback from that function.
The other thought I had was to mock the __eq__ method of the DataFrames but I'm not confident that's the right way to go.
Edit:
I considered just using a function that returns true or false instead of an operator, however that doesn't seem to work with pytest_assertrepr_compare. I'm not familiar enough with how that hook works so it's possible there is a way to use it with a function instead of an operator.
My current solution is to use a patch to override the DataFrame's __eq__ method. Here's an example with Pandas as it's faster to test with, the idea should apply to any object.
import pandas as pd
# use this import for python3
# from unittest.mock import patch
from mock import patch
def custom_df_compare(self, other):
# Put logic for comparing df's here
# Returning True for demonstration
return True
#patch("pandas.DataFrame.__eq__", custom_df_compare)
def test_df_equal():
df1 = pd.DataFrame(
{"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
)
df2 = pd.DataFrame(
{"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
)
assert df1 == df2
Haven't tried it yet but am planning on adding it as a fixture and using autouse to use it for all tests automatically.
In order to elegantly handle the "order matters" indicator, I'm playing with an approach similar to pytest.approx which returns a new class with it's own __eq__ for example:
class SortedDF(object):
"Indicates that the order of data matters when comparing to another df"
def __init__(self, df):
self.df = df
def __eq__(self, other):
# Put logic for comparing df's including order of data here
# Returning True for demonstration purposes
return True
def test_sorted_df():
df1 = pd.DataFrame(
{"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
)
df2 = pd.DataFrame(
{"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
)
# Passes because SortedDF.__eq__ is used
assert SortedDF(df1) == df2
# Fails because df2's __eq__ method is used
assert df2 == SortedDF(df2)
The minor issue I haven't been able to resolve is the failure of the second assert, assert df2 == SortedDF(df2). This order works fine with pytest.approx but doesn't here. I've tried reading up on the == operator but haven't been able to figure out how to fix the second case.
To do a raw comparison between the values of the DataFrames (must be exact order), you can do something like this:
import pandas as pd
from pyspark.sql import Row
df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
pd.testing.assert_frame_equal(df1.toPandas(), df2.toPandas())
If you want to specify by order, you can do some transformations on the pandas DataFrame to sort by a particular column first using the following function:
def assert_frame_equal_with_sort(results, expected, keycolumns):
results = results.reindex(sorted(results.columns), axis=1)
expected = expected.reindex(sorted(expected.columns), axis=1)
results_sorted = results.sort_values(by=keycolumns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=keycolumns).reset_index(drop=True)
pd.testing.assert_frame_equal(results_sorted, expected_sorted)
df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=3, c=3), Row(a=1, b=2, c=3)])
assert_frame_equal_with_sort(df1.toPandas(), df2.toPandas(), ['b'])
just use the pandas.Dataframe.equals method
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html
For example
assert df1.equals(df2)
assert can be used with anything that returns a boolean. So yes you can write any custom comparison function to compare two objects. As long as the custom function returns a boolean. However, in this case there is no need for a custom function as pandas already provides one
You can use one of pytest hooks, particularity the pytest_assertrepr_compare. In there you can define what tyou you want to compare and how, also docs are pretty good and with examples. Best of luck. :)
I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
Therefore, I am looking for other methods to do one-hot coding.
What are possible ways to do one hot encoding in python (pandas/sklearn)?
Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
test_mat = label_binarizer.transform(test_df.Label)
In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.
Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
For the text columns, you can try this
from sklearn.feature_extraction.text import CountVectorizer
data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)
For Output:
for i in range(len(data)):
print(vectors[i, :].toarray())
Output:
[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]
The pandas documentation of .loc clearly states:
.loc is strictly label based, will raise KeyError when the items are
not found, allowed inputs are:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label
of the index. This use is not an integer position along the index)
Contrary to that, this surprisingly works for pd.Series, not for pd.DataFrame:
import numpy as np
a = np.array([1,3,1,2])
import pandas as pd
s = pd.Series(a, index=["a", "b", "c", "d"])
s.loc["a"] # yields 1
s.loc[0] # should be strictly label-based, but it works and also yields 1
Do you know why?
Suppose I have a Pandas DataFrame like the below and I'm encoding categorical_1 for training in scikit-learn:
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'])
The values for 'categorical_1' are A, B, or C so I end up with 3 columns in dummy_values. However, categorical_1 can in reality take on values A, B, C, D, or E so there is no column represented for values D or E.
In R I would specify levels when factoring that column - is there a corresponding way to do this with Pandas or would I need to handle that manually?
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
First, if you want pandas to take more values simply add them to the list sent to the get_dummies method
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])
as in python + on lists works as a concatenate operation, so
['A','B','C','B','B'] + ['D','E']
results in
['A', 'B', 'C', 'B', 'B', 'D', 'E']
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with
A B C
0 0 0
(i assume that you represent the 'B' value with 0 1 0, 'C' with 0 0 1 etc.)
because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'
The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.
If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.
I encountered the same problem as yours, that is how to unify the dummy categories between training data and testing data when using get_dummies() in Pandas. Then I found a solution when exploring the House Price competition in Kaggle, that is to process training data and testing data at the same time. Suppose you have two dataframes df_train and df_test (not containing target data in them).
all_data = pd.concat([df_train,df_test], axis=0)
all_data = pd.get_dummies(all_data)
X_train = all_data[:df_train.shape[0]] # select the processed training data
X_test = all_data[-df_test.shape[0]:] # select the processed testing data
Hope it helps.
Isn't this a better answer?
data = pd.DataFrame({
"values": [1, 2, 3, 4, 5, 6, 7],
"categories": ["A", "A", "B", "B", "C", "C", "D"]
})
possibilites = ["A", "B", "C", "D", "E", "F"]
exists = data["categories"].tolist()
difference = pd.Series([item for item in possibilites if item not in exists])
target = data["categories"].append(pd.Series(difference))
target = target.reset_index(drop=True)
dummies = pd.get_dummies(
target
)
dummies = dummies.drop(dummies.index[list(range(len(dummies)-len(difference), len(dummies)))])
To handle the mismatch between the set of categorical values in train and test sets I used;
length = train_categorical_data.shape[0]
empty_col = np.zeros((length,1))
test_categorical_data_processed = pd.DataFrame()
for col in train_categorical_data.columns:
test_categorical_data_processed[col] = test_categorical_data.get(col, empty_col)