Say I have a dataframe like the following:
A B
0 bar one
1 bar three
2 flux six
3 bar three
4 foo five
5 flux one
6 foo two
I would like to apply dummy-coding contrasting on it so that I get:
A B
0 0 0
1 0 2
2 1 1
3 0 2
4 2 3
5 1 0
6 2 4
(i.e. mapping every unique value to a different integer, per column).
I have tried using scikit-learn's DictVectorizer, but I get:
> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 1.]])
This is because scikit-learn's DictVectorizer is designed to output one-of-K encoding. What I want is a simple-encoding instead (one column per variable).
How can I do this with scikit-learn and/or pandas? Aside from that, are there any other Python packages that help with general contrasting methods?
You could use pd.factorize:
In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]:
A B
0 0 0
1 0 1
2 1 2
3 0 1
4 2 3
5 1 0
6 2 4
The patsy package provides all the contrasts you'd need (and the ability to make more). [1] AFAIK, statsmodels is the only stats package that currently uses patsy's formula framework. [2, 3].
[1] https://patsy.readthedocs.org/en/latest/API-reference.html#handling-categorical-data
[2] http://statsmodels.sourceforge.net/devel/contrasts.html
[3] http://statsmodels.sourceforge.net/devel/example_formulas.html
Dummy encoding is what you get when you call DictVectorizer. The kind of integer encoding you get is actually different:
sklearn.preprocessing.LabelBinarizer or DictVectorizer gives dummy encoding (as pandas.get_dummies)
sklearn.preprocessing.LabelEncoder gives integer categorical encoding (as pandas.factorize)
Related
I was trying to one hot encode a dataframe for some testing.
I tried using the regular OneHotEncoder from sklearn but It seemed to have some issues with NaN values (NaN values that were not present on columns I wanted to encode)
From what I searched, a solution was to use a column transformer, which could apply the encoding only to certain columns, something like the following
ct = ColumnTransformer([(OneHotEncoder(categories = categories_list),['col1','col2','col3'])])
In which categories_list is a list of all present categories.
The problem is that when I try to apply this transformer to my dataframe, I always get not enough values to unpack error.
Im transforming like this
ct.fit_transform(df_train_xgboost)
Any idea on what should I do?
EDIT:
Some example Data
id | col1 | col2 | col3 | price | has_something
1 blue car new 23781 NaN
2 green truck used 24512 1
3 red van new 44521 0
Some more code
categories_list = ['blue','green','red','car','truck','van','new','used']
df_train_xgboost = df_train
df_train_xgboost = df_train_xgboost.drop(columns_I_dont_want, axis=1)
df_train_xgboost = df_train_xgboost.fillna(value = {'col1': 0, 'col2': 0, 'col3': 0})
ct = ColumnTransformer([(OneHotEncoder(categories = categories_list),['col1','col2','col3'])])
print(df_train_xgboost.shape)
ct.fit_transform(df_train_xgboost)
First of all, the use of ColumnTransformer is not necessary.
To make your code work you need one more input argument i.e., the "name" of the transformer.
Full example:
df
col1 col2 col3
0 blue car new
1 green truck used
2 red van new
ct = ColumnTransformer([("onehot",OneHotEncoder(),[0,1,2])])
ct.fit_transform(df.values)
array([[1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 1., 0.]])
Now notice that you get the same output by only using OneHotEncoder:
o = OneHotEncoder()
o.fit_transform(df).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 1., 0.]])
This question already has answers here:
Quick way to upsample numpy array by nearest neighbor tiling [duplicate]
(3 answers)
Closed 4 years ago.
Given a matrix, such as:
1 0 0
0 1 1
1 1 0
I would like to expand each element to a "sub-matrix" of size AxA, e.g., 3x3, the result will be:
1 1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0 0
What is the fastest way of doing it in Python using numpy (or PyTorch)?
Since what you're describing is the Kronecker product:
Use np.kron
Computes the Kronecker product, a composite array made of blocks of the second array scaled by the first.
x = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 0]])
np.kron(x, np.ones((3, 3)))
array([[1., 1., 1., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 0., 0., 0.]])
I want to make a 34x34 Matrix consisting of entirely zeroes and ones. I have an array that lists the coordinates where all of the ones should go but don't know how to use it. The array looks like this:
0 1 1
0 2 1
0 3 1
1 1 1
where the first number in each row is the x coordinate, the second number in each row is the y coordinate, and the third number is the desired value (always 1).
I tried to create a blank matrix using Matrix=numpy.zeros(34,34) but I don't know how to change the desired values all at once.
Any idea how to take a matrix and change multiple values at once?
That's work:
a = np.array([[0,1,1],[0,2,1],[0,3,1],[1,1,1]])
m = np.zeros([5,5])
for i in range(len(a)):
m[a[i][0],a[i][1]] = a[i][2] # Or = 1 if that's always the case
And the m matrix is:
array([[ 0., 1., 1., 1., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
I'm struggling to create the following matrix in python:
| 1 -2 1 0 ... 0 |
| 0 1 -2 1 ... ... |
|... ... ... ... 0 |
| 0 ... 0 1 -2 1 |
I've the matlab code below which seems to create this matrix (article) but I cannot convert it in python code.
Matlab code:
D2 = spdiags(ones(T-2,1)*[1 -2 1],[0:2],T-2,T);
T is the number of columns.
The code in python looks like this:
from scipy.sparse import spdiags
D2 = spdiags( (ones((T-2,1))*array([1,-2,1])),arange(0,3),T-2,T)
This latter produce the following error:
ValueError: number of diagonals (327) does not match the number of
offsets (3)
But if I transpose the matrix like that:
D2 = spdiags( (ones((T-2,1))*array([1,-2,1])).T,arange(0,3),T-2,T)
I get the following result:
matrix([[ 1., -2., 1., ..., 0., 0., 0.],
[ 0., 1., -2., ..., 0., 0., 0.],
[ 0., 0., 1., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., -2., 0., 0.],
[ 0., 0., 0., ..., 1., 0., 0.]])
Does anybody can help me? Where am I wrong?
Change this:
D2 = spdiags( (ones((T-2,1))*array([1,-2,1])).T,arange(0,3),T-2,T)
to this:
D2 = spdiags( (ones((T,1))*array([1,-2,1])).T,arange(0,3),T-2,T)
That is, you want the length of the rows in the first argument, which is the array containing the diagonals, to be equal the number of columns in the result.
Say I have a matrix in a numpy array in Python
In [3]: my_matrix
Out[3]:
array([[ 2., 2., 2., 2., 2., 2., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 2., 2., 2., 2., 0., 0., 0.,
0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 2., 2.,
2., 2., 2., 2., 2.]])
Is there a way to have Python/IPython print my array as:
[ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2;
0 0 0 0 0 0 2 2 2 2 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 ]
? (~ similar to the way MATLAB does it)
Also, I have noticed that IPython does not use the full width of my terminal when printing numpy arrays. Other functions do (e.g. pprint.pprint). How can I change that?
Use numpy.set_printoptions. For increasing the line width:
np.set_printoptions(linewidth=150)
Replace 150 by whatever you need. Now, to print as you asked (I guess it means without the decimal point):
print my_matrix.astype('i')
If you have floating point values you can also control the precision for printouts with the option precision:
np.set_printoptions(precision=3)