one-hot encoding of some integers in sci-kit library

one-hot encoding of some integers in sci-kit library - python

I'm trying to do one-hot encoding of the first column with following code:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories = 'auto',sparse = False)
X[:,0] = enc.fit_transform(X[:,0]).toarray()
But I get Error which says I have to reshape my data. How can I one-hot encode the first column and then add it to the whole data ?

Your problem is you're passing to OHE a 1d array. Reshape is as a 2d and you're fine to go.
Proof
Suppose we happen to have some data that resembles yours:
np.random.seed(42)
X = np.c_[np.random.randint(0,3,10),np.random.randn(10),np.random.randn(10)]
X
array([[ 2. , 1.57921282, -1.01283112],
[ 0. , 0.76743473, 0.31424733],
[ 2. , -0.46947439, -0.90802408],
[ 2. , 0.54256004, -1.4123037 ],
[ 0. , -0.46341769, 1.46564877],
[ 0. , -0.46572975, -0.2257763 ],
[ 2. , 0.24196227, 0.0675282 ],
[ 1. , -1.91328024, -1.42474819],
[ 2. , -1.72491783, -0.54438272],
[ 2. , -0.56228753, 0.11092259]])
Then we can proceed as follows:
from sklearn.preprocessing import OneHotEncoder
oho = OneHotEncoder(sparse = False)
oho_enc = oho.fit_transform(X[:,0].reshape(-1,1)) # <--- you have a problem here
res = np.c_[oho_enc, X[:,1:]]
res
matrix([[ 0. , 0. , 1. , 1.57921282, -1.01283112],
[ 1. , 0. , 0. , 0.76743473, 0.31424733],
[ 0. , 0. , 1. , -0.46947439, -0.90802408],
[ 0. , 0. , 1. , 0.54256004, -1.4123037 ],
[ 1. , 0. , 0. , -0.46341769, 1.46564877],
[ 1. , 0. , 0. , -0.46572975, -0.2257763 ],
[ 0. , 0. , 1. , 0.24196227, 0.0675282 ],
[ 0. , 1. , 0. , -1.91328024, -1.42474819],
[ 0. , 0. , 1. , -1.72491783, -0.54438272],
[ 0. , 0. , 1. , -0.56228753, 0.11092259]])

Related

Taking multiple slices of numpy 1d array from given indices, copying result into 2d array

New to Python. Given in the code snippet below is a numpy 1d array called randomWalk. Given indices (which can be interpreted as start dates and end dates, both of which may vary from item to item), I want to do take multiple slices from that 1d array randomWalk and arrange the results in a 2d array of given shape.
I am trying to vectorize this. Was able to select the slices I wanted from the 1d array using np.r_, but failed to store these in the format I require for the output (a 2d array with rows representing items and columns representing time from min(startDates) to max(endDates).
Below is the (ugly) code that works.
import numpy as np
numItems = 20
numPeriods = 12
# Data
randomWalk = np.random.normal(loc = 0.0, scale = 0.05, size = (numPeriods,))
startDates = np.random.randint(low = 1, high = 5, size = numItems)
endDates = np.random.randint(low = 5, high = numPeriods + 1, size = numItems)
stochasticItems = np.random.choice([False, True], size=(numItems,), p = [0.9, 0.1])
# Result needs to be in this shape (code snippet is designed to capture that only
# a relatively small fraction of resultMatrix's elements will differ from unity)
resultMatrix = np.ones((numItems, numPeriods))
# Desired result (obtained via brute force)
for i in range(numItems):
if stochasticItems[i]:
resultMatrix[
i, startDates[i]:endDates[i]] = np.cumprod(randomWalk[startDates[i]:endDates[i]] + 1.0)

Inspired by #mozway 's answer, convert irregular slices into regular mask array:
>>> # build all arrays with np.random.seed(0)
>>> x = np.arange(numPeriods)
>>> mask = (startDates[:, None] <= x) & (endDates[:, None] > x)
>>> result = np.where(mask & stochasticItems[:, None], np.where(mask, randomWalk + 1, 1).cumprod(-1), 1)
>>> np.allclose(result, resultMatrix)
True
>>> result
array([[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1.0489369 , 1.16646468, 1.2753867 ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ]])

If the vectorization is the goal, so it is done by Pig answer, If it is not matter (as it is mentioned by the OP in the comments --> The aim is improvement in performance), so I suggest using numba library to accelerate the code. We can write np.cumprod equivalent numba code and accelerate it using numba no-python jit:
#nb.njit
def nb_cumprod(arr):
y = np.empty_like(arr)
y[0] = arr[0]
for i in range(1, arr.shape[0]):
y[i] = arr[i] * y[i-1]
return y
#nb.njit
def nb_(numItems, numPeriods, stochasticItems, startDates, endDates, randomWalk):
resultMatrix = np.ones((numItems, numPeriods))
for i in range(numItems):
if stochasticItems[i]:
resultMatrix[i, startDates[i]:endDates[i]] = nb_cumprod(randomWalk[startDates[i]:endDates[i]] + 1.0)
return resultMatrix
This code improved the code ~10 times faster than the OP in my some benchmarks.

How to determine the indices of rows for which the sum of certain columns are 0?

How can I determine the indices of rows for which the sum of certain columns are 0?
For example, in the following array:
array([[ 0.9200001, 1. , 0. , 0. , 0. ],
[ 1.8800001, 1. , 0. , 0. , 0. ],
[ 2.2100001, 1. , 0. , 0. , 0. ],
[ 3.3400001, 1. , 0. , 0. , 0. ],
[ 4.3100001, 1. , 0. , 0. , 0. ],
[ 5.5900001, 1. , 0. , 0. , 0. ],
[ 6.7500001, 1. , 0. , 0. , 0. ],
[ 7.8300001, 0. , 0. , 0. , 0. ],
[ 8.8500001, 1. , 0. , 0. , 0. ],
[ 9.1600001, 0. , 0. , 0. , 0. ],
[10.3900001, 0. , 0. , 1. , 1. ],
[13.5600001, 0. , 0. , 1. , 1. ]])
I'd like to get the indices of rows for which the sum of columns (1: ) is zero. In this case, it would be [7,9].
I already tried different combinations without success:
np.nonzero(sum(operations[:,1:]==0))
np.nonzero((operations[:,1:].sum()==0))
Sorry in advance, as I imagine this is a simple question, but I can't figure it out.

Since you want to sum over the columns, you need axis=1. If a is your array:
np.nonzero(a[:,1:].sum(axis=1)==0)

import numpy as np
a = np.array([[ 0.9200001, 1. , 0. , 0. , 0. ],
[ 1.8800001, 1. , 0. , 0. , 0. ],
[ 2.2100001, 1. , 0. , 0. , 0. ],
[ 3.3400001, 1. , 0. , 0. , 0. ],
[ 4.3100001, 1. , 0. , 0. , 0. ],
[ 5.5900001, 1. , 0. , 0. , 0. ],
[ 6.7500001, 1. , 0. , 0. , 0. ],
[ 7.8300001, 0. , 0. , 0. , 0. ],
[ 8.8500001, 1. , 0. , 0. , 0. ],
[ 9.1600001, 0. , 0. , 0. , 0. ],
[10.3900001, 0. , 0. , 1. , 1. ],
[13.5600001, 0. , 0. , 1. , 1. ]])
# slice the array and only use the columns which shall be summed:
b = a[:,1:]
# then sum the columns by axis 1
c = b.sum(axis=1)
# then get the indices
np.where(c==0)
#out: (array([7, 9], dtype=int64),)
or in one line:
print(np.where(a[:,1:].sum(axis=1)==0))

I used simple way:
import numpy as np
array=np.array([[ 0.9200001, 1. , 0. , 0. , 0. ],
[ 1.8800001, 1. , 0. , 0. , 0. ],
[ 2.2100001, 1. , 0. , 0. , 0. ],
[ 3.3400001, 1. , 0. , 0. , 0. ],
[ 4.3100001, 1. , 0. , 0. , 0. ],
[ 5.5900001, 1. , 0. , 0. , 0. ],
[ 6.7500001, 1. , 0. , 0. , 0. ],
[ 7.8300001, 0. , 0. , 0. , 0. ],
[ 8.8500001, 1. , 0. , 0. , 0. ],
[ 9.1600001, 0. , 0. , 0. , 0. ],
[10.3900001, 0. , 0. , 1. , 1. ],
[13.5600001, 0. , 0. , 1. , 1. ]])
sumrow = np.sum(array, axis=1)
for i in range(len(array)):
if array[i][0]==sumrow[i]:
print(i)

bool masking in numpy array matrices

I have following program
import numpy as np
arr = np.random.randn(3,4)
print(arr)
regArr = (arr > 0.8)
print (regArr)
print (arr[ regArr].reshape(arr.shape))
output:
[[ 0.37182134 1.4807685 0.11094223 0.34548185]
[ 0.14857641 -0.9159358 -0.37933393 -0.73946522]
[ 1.01842304 -0.06714827 -1.22557205 0.45600827]]
I am looking for output in arr where values greater than 0.8 should exist and other values to be zero.
I tried bool masking as shown above. But I am able to slove this. Kindly help

I'm not entirely sure what exactly you want to achieve, but this is what I did to filter.
arr = np.random.randn(3,4)
array([[-0.04790508, -0.71700005, 0.23204224, -0.36354634],
[ 0.48578236, 0.57983561, 0.79647091, -1.04972601],
[ 1.15067885, 0.98622772, -0.7004639 , -1.28243462]])
arr[arr < 0.8] = 0
array([[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[1.15067885, 0.98622772, 0. , 0. ]])
Thanks to user3053452, I have added one more solution which the original data will not be changed.
arr = np.random.randn(3,4)
array([[ 0.4297907 , 0.38100702, 0.30358291, -0.71137138],
[ 1.15180635, -1.21251676, 0.04333404, 1.81045931],
[ 0.17521058, -1.55604971, 1.1607159 , 0.23133528]])
new_arr = np.where(arr < 0.8, 0, arr)
array([[0. , 0. , 0. , 0. ],
[1.15180635, 0. , 0. , 1.81045931],
[0. , 0. , 1.1607159 , 0. ]])

Python: Generate a new array with more columns based on another array

I have this array:
I need create a new array like this:
I guess I need use a conditional, but I don't know how create an array with 7 columns, based on values of a 5 columns array.
If anyone could help me, I thank!

I'm going to assume you want to convert your last column into one hot concodings and then concat it to your original array. You can initialise an array of zeros, and then set the appropriate indices to 1. Finally concat the OHE array to your original.
MCVE:
print(arr)
array([[ -9.95, 15.27, 9.08, 1. ],
[ -6.81, 11.87, 8.38, 2. ],
[ -3.02, 11.08, -8.5 , 1. ],
[ -5.73, -2.29, -2.09, 2. ],
[ -7.01, -0.9 , 12.91, 2. ],
[-11.64, -10.3 , 2.09, 2. ],
[ 17.85, 13.7 , 2.14, 0. ],
[ 6.34, -9.49, -8.05, 2. ],
[ 18.62, -9.43, -1.02, 1. ],
[ -2.15, -23.65, -13.03, 1. ]])
c = arr[:, -1].astype(int)
ohe = np.zeros((c.shape[0], c.max() + 1))
ohe[np.arange(c.shape[0]), c] = 1
arr = np.hstack((arr[:, :-1], ohe))
print(arr)
array([[ -9.95, 15.27, 9.08, 0. , 1. , 0. ],
[ -6.81, 11.87, 8.38, 0. , 0. , 1. ],
[ -3.02, 11.08, -8.5 , 0. , 1. , 0. ],
[ -5.73, -2.29, -2.09, 0. , 0. , 1. ],
[ -7.01, -0.9 , 12.91, 0. , 0. , 1. ],
[-11.64, -10.3 , 2.09, 0. , 0. , 1. ],
[ 17.85, 13.7 , 2.14, 1. , 0. , 0. ],
[ 6.34, -9.49, -8.05, 0. , 0. , 1. ],
[ 18.62, -9.43, -1.02, 0. , 1. , 0. ],
[ -2.15, -23.65, -13.03, 0. , 1. , 0. ]])

One-line version of #COLDSPEED using the np.eye trick:
np.hstack([arr[:,:-1], np.eye(arr[:,-1].astype(int).max() + 1)[arr[:,-1].astype(int)]])

How to randomly select some non-zero elements from a numpy.ndarray?

I've implemented a matrix factorization model, say R = U*V, and now I would to train and test this model.
To this end, given a sparse matrix R (zero for missing value), I want to first hide some non-zero elements in the training and use these non-zero elements as test set later.
How can I randomly select some non-zero elements from a numpy.ndarray? Besides, I need to remember the index and column position of these selected elements to use these elements in testing.
for example:
In [2]: import numpy as np
In [4]: mtr = np.random.rand(10,10)
In [5]: mtr
Out[5]:
array([[ 0.92685787, 0.95496193, 0.76878455, 0.12304856, 0.13804963,
0.30867502, 0.60245974, 0.00797898, 0.1060602 , 0.98277982],
[ 0.88879888, 0.40209901, 0.35274404, 0.73097713, 0.56238248,
0.380625 , 0.16432029, 0.5383006 , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0.5991212 , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0.91442668, 0.72827097, 0.4511198 ],
[ 0.63979934, 0.33421621, 0.09218392, 0.71520048, 0.57100522,
0.37205284, 0.59726293, 0.58224992, 0.58690505, 0.4791199 ],
[ 0.35219557, 0.34954002, 0.93837312, 0.2745864 , 0.89569075,
0.81244084, 0.09661341, 0.80673646, 0.83756759, 0.7948081 ],
[ 0.09173706, 0.86250006, 0.22121994, 0.21097563, 0.55090202,
0.80954817, 0.97159981, 0.95888693, 0.43151554, 0.2265607 ],
[ 0.00723128, 0.95690539, 0.94214806, 0.01721733, 0.12552314,
0.65977765, 0.20845669, 0.44663729, 0.98392716, 0.36258081],
[ 0.65994805, 0.47697842, 0.35449045, 0.73937445, 0.68578224,
0.44278095, 0.86743906, 0.5126411 , 0.75683392, 0.73354572],
[ 0.4814301 , 0.92410622, 0.85267402, 0.44856078, 0.03887269,
0.48868498, 0.83618382, 0.49404473, 0.37328248, 0.18134919],
[ 0.63999748, 0.48718656, 0.54826717, 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0.70610649, 0.03213063, 0.88371607]])
In [6]: mtr = np.where(mtr>0.5, 0, mtr)
In [7]: %clear
In [8]: mtr
Out[8]:
array([[ 0. , 0. , 0. , 0.12304856, 0.13804963,
0.30867502, 0. , 0.00797898, 0.1060602 , 0. ],
[ 0. , 0.40209901, 0.35274404, 0. , 0. ,
0.380625 , 0.16432029, 0. , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0. , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0. , 0. , 0.4511198 ],
[ 0. , 0.33421621, 0.09218392, 0. , 0. ,
0.37205284, 0. , 0. , 0. , 0.4791199 ],
[ 0.35219557, 0.34954002, 0. , 0.2745864 , 0. ,
0. , 0.09661341, 0. , 0. , 0. ],
[ 0.09173706, 0. , 0.22121994, 0.21097563, 0. ,
0. , 0. , 0. , 0.43151554, 0.2265607 ],
[ 0.00723128, 0. , 0. , 0.01721733, 0.12552314,
0. , 0.20845669, 0.44663729, 0. , 0.36258081],
[ 0. , 0.47697842, 0.35449045, 0. , 0. ,
0.44278095, 0. , 0. , 0. , 0. ],
[ 0.4814301 , 0. , 0. , 0.44856078, 0.03887269,
0.48868498, 0. , 0.49404473, 0.37328248, 0.18134919],
[ 0. , 0.48718656, 0. , 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0. , 0.03213063, 0. ]])
Given such sparse ndarray, how can I select 20% of the non-zero elements and remember their position?

We'll use numpy.random.choice. First, we get arrays of the (i,j) indices where the data is nonzero:
i,j = np.nonzero(x)
Then we'll select 20% of these:
ix = np.random.choice(len(i), int(np.floor(0.2 * len(i))), replace=False)
Here ix is a list of random, unique indices, 20% the length of i and j (the length of i and j is the number of nonzero entries). To recover the indices, we do i[ix] and j[ix], so we can then select 20% of the nonzero entries of x by writing:
print x[i[ix], j[ix]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

one-hot encoding of some integers in sci-kit library - python

Related

Taking multiple slices of numpy 1d array from given indices, copying result into 2d array

How to determine the indices of rows for which the sum of certain columns are 0?

bool masking in numpy array matrices

Python: Generate a new array with more columns based on another array

How to randomly select some non-zero elements from a numpy.ndarray?

Categories

Resources