Unpack error while using sklearn columntransfromer - python

I was trying to one hot encode a dataframe for some testing.
I tried using the regular OneHotEncoder from sklearn but It seemed to have some issues with NaN values (NaN values that were not present on columns I wanted to encode)
From what I searched, a solution was to use a column transformer, which could apply the encoding only to certain columns, something like the following
ct = ColumnTransformer([(OneHotEncoder(categories = categories_list),['col1','col2','col3'])])
In which categories_list is a list of all present categories.
The problem is that when I try to apply this transformer to my dataframe, I always get not enough values to unpack error.
Im transforming like this
ct.fit_transform(df_train_xgboost)
Any idea on what should I do?
EDIT:
Some example Data
id | col1 | col2 | col3 | price | has_something
1 blue car new 23781 NaN
2 green truck used 24512 1
3 red van new 44521 0
Some more code
categories_list = ['blue','green','red','car','truck','van','new','used']
df_train_xgboost = df_train
df_train_xgboost = df_train_xgboost.drop(columns_I_dont_want, axis=1)
df_train_xgboost = df_train_xgboost.fillna(value = {'col1': 0, 'col2': 0, 'col3': 0})
ct = ColumnTransformer([(OneHotEncoder(categories = categories_list),['col1','col2','col3'])])
print(df_train_xgboost.shape)
ct.fit_transform(df_train_xgboost)

First of all, the use of ColumnTransformer is not necessary.
To make your code work you need one more input argument i.e., the "name" of the transformer.
Full example:
df
col1 col2 col3
0 blue car new
1 green truck used
2 red van new
ct = ColumnTransformer([("onehot",OneHotEncoder(),[0,1,2])])
ct.fit_transform(df.values)
array([[1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 1., 0.]])
Now notice that you get the same output by only using OneHotEncoder:
o = OneHotEncoder()
o.fit_transform(df).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 1., 0.]])

Related

Creating many state vectors and saving them in a file

I want to create m number of matrices, each of which is an n x 1 numpy arrays. Moreover those matrices should have only two nonzero entries in the two rows, all other rows should have 0 as their entries, meaning that matrix number m=1 should have entries m[0,:]=m[1,:]=1, rest elements are 0. And similarly the last matrix m=m should have entries like m[n-1,:]=m[n,:]=1, where rest of the elements in other rows are 0. So for consecutive two matrices, the nonzero elements shift by two rows. And finally, I would like them to be stored into a dictionary or in a file.
What would be a neat way to do this?
Is this what you're looking for?
In [2]: num_rows = 10 # should be divisible by 2
In [3]: np.repeat(np.eye(num_rows // 2), 2, axis=0)
Out[3]:
array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.]])
In terms of storage in a file, you can use np.save and np.load.
Note that the default data type for np.eye will be float64. If you expect your values to be small when you begin integrating or whatever you're planning on doing with your state vectors, I'd recommend setting the data type appropriately (like np.uint8 for positive integers < 256 for example).

Is this data representation here exact for One-Hot Encoding?

I am trying to encode the mushroom dataset here (https://www.kaggle.com/uciml/mushroom-classification) using One-Hot Encoding. Here is the code that I used (in Python) for the encoding:
from sklearn.preprocessing import OneHotEncoder
second_df = OneHotEncoder(handle_unknown='ignore').fit_transform(new_df)
print(second_df)
The result for my code is in this image, and that makes me quite confusing: Result for the encoding.
Is this result the right representation for my One-Hot Encoding? If not, what shall I do to fix the code?
The output looks a bit unusual because OneHotEncoder returns a sparse matrix by default:
OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
Sparse output is interpreted as (row, col) non_zero_value, where all the unlisted coordinates are zero:
(0, 1) 1.0 # value 1.0 at row 0, col 1
(0, 7) 1.0 # value 1.0 at row 0, col 7
...
To get a dense array instead,
either set sparse=False:
OneHotEncoder(sparse=False).fit_transform(new_df)
or chain toarray:
OneHotEncoder().fit_transform(new_df).toarray()
Output:
array([[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 1., 0.],
[1., 0., 0., ..., 0., 1., 0.],
...,
[0., 0., 1., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 1., 0.]])

minimize runtime for numpy array manipulation

I have an 2 dimensional array with np.shape(input)=(a,b) and that looks like
input=array[array_1[0,0,0,1,0,1,2,0,3,3,2,...,entry_b],...array_a[1,0,0,1,2,2,0,3,1,3,3,...,entry_b]]
Now I want to create an array np.shape(output)=(a,b,b) in which every entry that had the same value in the input get the value 1 and 0 otherwise
for example:
input=[[1,0,0,0,1,2]]
output=[array([[1., 0., 0., 0., 1., 0.],
[0., 1., 1., 1., 0., 0.],
[0., 1., 1., 1., 0., 0.],
[0., 1., 1., 1., 0., 0.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 1.]])]
My code so far is looking like:
def get_matrix(svdata,padding_size):
List=[]
for k in svdata:
matrix=np.zeros((padding_size,padding_size))
for l in range(padding_size):
for m in range(padding_size):
if k[l]==k[m]:
matrix[l][m]=1
List.append(matrix)
return List
But it takes 2:30 min for an input array of shape (2000,256). How can I become more effiecient by using built in numpy solutions?
res = input[:,:,None]==input[:,None,:]
Should give boolean (a,b,b) array
res = res.astype(int)
to get a 0/1 array
You're trying to create the array y where y[i,j,k] is 1 if input[i,j] == input[i, k]. At least that's what I think you're trying to do.
So y = input[:,:,None] == input[:,None,:] will give you a boolean array. You can then convert that to np.dtype('float64') using astype(...) if you want.

How to find patterns between numerious causes and the result in python?

For each instance I have a set of problems and a result, like this:
df = pd.DataFrame({
"problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
"results": ["A", "A", "C", "C", "A"]
})
I want to find patterns in the relationship between the problems and the result.
My first thought was Association Rule Mining, but this is more for finding patters within the problems (for example). I guess machine learning could help somehow, but I'm not interested in solely predicting the result, but in the patters that lead to that prediction.
I would be interested in patters like
Problem 1 causes result A
The combination of problems 4 and 5 cause result C
Any thoughts on that?
As I'd implement with Python, corresponding packages are welcomed hints, too.
Thanks a lot!
I was curious and I did some experimental stuff, based on the comment of Daniel Möller in this thread in tensorflow 2.0 with keras:
Update: Make the order not matter anymore:
To make the order not matty anymore, we need to remove the order information from our dataset. To do this, we first convert it to a one-hot vector, then we take the max() value to squash the dimensions into 3 again:
x_no_order = tf.keras.utils.to_categorical(x)
This gives us a one-hot vector looking like this:
array([[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.]],
[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.]],
[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.]],
[[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.]],
[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.]]], dtype=float32)
Taking the np.max() from that vector gives us a vector, that only knows about which numbers occur, without any information about the position, looking like this:
x_no_order.max(axis=1)
array([[0., 1., 1., 1., 0., 0., 0.],
[0., 1., 1., 0., 1., 0., 0.],
[0., 1., 0., 0., 1., 1., 0.],
[0., 0., 0., 1., 1., 1., 0.],
[0., 1., 0., 0., 0., 1., 1.]], dtype=float32)
First create the dataframe and create the training data
Thats a multiclass-classification task, so I use the tokenizer (there are for sure better approaches, since its rather for text)
import tensorflow as tf
import numpy as np
import pandas as pd
df = pd.DataFrame({
"problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
"results": ["A", "A", "C", "C", "A"]
})
x = df['problems']
y = df['results']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(y)
y_train = tokenizer.texts_to_sequences(y)
x = np.array([np.array(i,dtype=np.int32) for i in x])
y_train = np.array(y_train, dtype=np.int32)
**Then create the model **
input_layer = tf.keras.layers.Input(shape=(3))
dense_layer = tf.keras.layers.Dense(6)(input_layer)
dense_layer2 = tf.keras.layers.Dense(20)(dense_layer)
out_layer = tf.keras.layers.Dense(3, activation="softmax")(dense_layer2)
model = tf.keras.Model(inputs=[input_layer], outputs=[out_layer])
model.compile(optimizer="Nadam", loss="sparse_categorical_crossentropy",metrics=["accuracy"])
Train the model by fitting it
hist = model.fit(x,y_train, epochs=100)
Then, as based on Daniels comment, you take the sequence you want to test and mask out certain values, to test their influence
arr =np.reshape(np.array([1,2,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([0,2,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([1,0,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([1,2,0]), (1,3))
print(model.predict(arr))
This will print this result, have in mind that since y starts at one, the first value is a placeholder, so the second value stands for "A"
[[0.00441748 0.7981055 0.19747704]]
[[0.00103579 0.9863035 0.01266076]]
[[0.0031549 0.9953074 0.00153765]]
[[0.01631758 0.00633342 0.977349 ]]
There we can see, that in the first place A is correctly predicted by 0.7981..
When the of [1,2,3] we change 3 to 0, so [1,2,0] we see that the model suddenly predicts "C". So the influence of 3 on position 3 is the biggest. Putting that in a function, you can use all training data you have and build statistic metrics to analyze that further.
This is just a very simple approach, but keep in mind that it is a big research field called sensitivity analysis. You might want to have a deeper look at that topic, if you are interested.

Vectorizing / Contrasting a Dataframe with Categorical Variables

Say I have a dataframe like the following:
A B
0 bar one
1 bar three
2 flux six
3 bar three
4 foo five
5 flux one
6 foo two
I would like to apply dummy-coding contrasting on it so that I get:
A B
0 0 0
1 0 2
2 1 1
3 0 2
4 2 3
5 1 0
6 2 4
(i.e. mapping every unique value to a different integer, per column).
I have tried using scikit-learn's DictVectorizer, but I get:
> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 1.]])
This is because scikit-learn's DictVectorizer is designed to output one-of-K encoding. What I want is a simple-encoding instead (one column per variable).
How can I do this with scikit-learn and/or pandas? Aside from that, are there any other Python packages that help with general contrasting methods?
You could use pd.factorize:
In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]:
A B
0 0 0
1 0 1
2 1 2
3 0 1
4 2 3
5 1 0
6 2 4
The patsy package provides all the contrasts you'd need (and the ability to make more). [1] AFAIK, statsmodels is the only stats package that currently uses patsy's formula framework. [2, 3].
[1] https://patsy.readthedocs.org/en/latest/API-reference.html#handling-categorical-data
[2] http://statsmodels.sourceforge.net/devel/contrasts.html
[3] http://statsmodels.sourceforge.net/devel/example_formulas.html
Dummy encoding is what you get when you call DictVectorizer. The kind of integer encoding you get is actually different:
sklearn.preprocessing.LabelBinarizer or DictVectorizer gives dummy encoding (as pandas.get_dummies)
sklearn.preprocessing.LabelEncoder gives integer categorical encoding (as pandas.factorize)

Categories