Related
New to Python. Given in the code snippet below is a numpy 1d array called randomWalk. Given indices (which can be interpreted as start dates and end dates, both of which may vary from item to item), I want to do take multiple slices from that 1d array randomWalk and arrange the results in a 2d array of given shape.
I am trying to vectorize this. Was able to select the slices I wanted from the 1d array using np.r_, but failed to store these in the format I require for the output (a 2d array with rows representing items and columns representing time from min(startDates) to max(endDates).
Below is the (ugly) code that works.
import numpy as np
numItems = 20
numPeriods = 12
# Data
randomWalk = np.random.normal(loc = 0.0, scale = 0.05, size = (numPeriods,))
startDates = np.random.randint(low = 1, high = 5, size = numItems)
endDates = np.random.randint(low = 5, high = numPeriods + 1, size = numItems)
stochasticItems = np.random.choice([False, True], size=(numItems,), p = [0.9, 0.1])
# Result needs to be in this shape (code snippet is designed to capture that only
# a relatively small fraction of resultMatrix's elements will differ from unity)
resultMatrix = np.ones((numItems, numPeriods))
# Desired result (obtained via brute force)
for i in range(numItems):
if stochasticItems[i]:
resultMatrix[
i, startDates[i]:endDates[i]] = np.cumprod(randomWalk[startDates[i]:endDates[i]] + 1.0)
Inspired by #mozway 's answer, convert irregular slices into regular mask array:
>>> # build all arrays with np.random.seed(0)
>>> x = np.arange(numPeriods)
>>> mask = (startDates[:, None] <= x) & (endDates[:, None] > x)
>>> result = np.where(mask & stochasticItems[:, None], np.where(mask, randomWalk + 1, 1).cumprod(-1), 1)
>>> np.allclose(result, resultMatrix)
True
>>> result
array([[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1.0489369 , 1.16646468, 1.2753867 ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ],
[1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. ]])
If the vectorization is the goal, so it is done by Pig answer, If it is not matter (as it is mentioned by the OP in the comments --> The aim is improvement in performance), so I suggest using numba library to accelerate the code. We can write np.cumprod equivalent numba code and accelerate it using numba no-python jit:
#nb.njit
def nb_cumprod(arr):
y = np.empty_like(arr)
y[0] = arr[0]
for i in range(1, arr.shape[0]):
y[i] = arr[i] * y[i-1]
return y
#nb.njit
def nb_(numItems, numPeriods, stochasticItems, startDates, endDates, randomWalk):
resultMatrix = np.ones((numItems, numPeriods))
for i in range(numItems):
if stochasticItems[i]:
resultMatrix[i, startDates[i]:endDates[i]] = nb_cumprod(randomWalk[startDates[i]:endDates[i]] + 1.0)
return resultMatrix
This code improved the code ~10 times faster than the OP in my some benchmarks.
How can I determine the indices of rows for which the sum of certain columns are 0?
For example, in the following array:
array([[ 0.9200001, 1. , 0. , 0. , 0. ],
[ 1.8800001, 1. , 0. , 0. , 0. ],
[ 2.2100001, 1. , 0. , 0. , 0. ],
[ 3.3400001, 1. , 0. , 0. , 0. ],
[ 4.3100001, 1. , 0. , 0. , 0. ],
[ 5.5900001, 1. , 0. , 0. , 0. ],
[ 6.7500001, 1. , 0. , 0. , 0. ],
[ 7.8300001, 0. , 0. , 0. , 0. ],
[ 8.8500001, 1. , 0. , 0. , 0. ],
[ 9.1600001, 0. , 0. , 0. , 0. ],
[10.3900001, 0. , 0. , 1. , 1. ],
[13.5600001, 0. , 0. , 1. , 1. ]])
I'd like to get the indices of rows for which the sum of columns (1: ) is zero. In this case, it would be [7,9].
I already tried different combinations without success:
np.nonzero(sum(operations[:,1:]==0))
np.nonzero((operations[:,1:].sum()==0))
Sorry in advance, as I imagine this is a simple question, but I can't figure it out.
Since you want to sum over the columns, you need axis=1. If a is your array:
np.nonzero(a[:,1:].sum(axis=1)==0)
import numpy as np
a = np.array([[ 0.9200001, 1. , 0. , 0. , 0. ],
[ 1.8800001, 1. , 0. , 0. , 0. ],
[ 2.2100001, 1. , 0. , 0. , 0. ],
[ 3.3400001, 1. , 0. , 0. , 0. ],
[ 4.3100001, 1. , 0. , 0. , 0. ],
[ 5.5900001, 1. , 0. , 0. , 0. ],
[ 6.7500001, 1. , 0. , 0. , 0. ],
[ 7.8300001, 0. , 0. , 0. , 0. ],
[ 8.8500001, 1. , 0. , 0. , 0. ],
[ 9.1600001, 0. , 0. , 0. , 0. ],
[10.3900001, 0. , 0. , 1. , 1. ],
[13.5600001, 0. , 0. , 1. , 1. ]])
# slice the array and only use the columns which shall be summed:
b = a[:,1:]
# then sum the columns by axis 1
c = b.sum(axis=1)
# then get the indices
np.where(c==0)
#out: (array([7, 9], dtype=int64),)
or in one line:
print(np.where(a[:,1:].sum(axis=1)==0))
I used simple way:
import numpy as np
array=np.array([[ 0.9200001, 1. , 0. , 0. , 0. ],
[ 1.8800001, 1. , 0. , 0. , 0. ],
[ 2.2100001, 1. , 0. , 0. , 0. ],
[ 3.3400001, 1. , 0. , 0. , 0. ],
[ 4.3100001, 1. , 0. , 0. , 0. ],
[ 5.5900001, 1. , 0. , 0. , 0. ],
[ 6.7500001, 1. , 0. , 0. , 0. ],
[ 7.8300001, 0. , 0. , 0. , 0. ],
[ 8.8500001, 1. , 0. , 0. , 0. ],
[ 9.1600001, 0. , 0. , 0. , 0. ],
[10.3900001, 0. , 0. , 1. , 1. ],
[13.5600001, 0. , 0. , 1. , 1. ]])
sumrow = np.sum(array, axis=1)
for i in range(len(array)):
if array[i][0]==sumrow[i]:
print(i)
I have following program
import numpy as np
arr = np.random.randn(3,4)
print(arr)
regArr = (arr > 0.8)
print (regArr)
print (arr[ regArr].reshape(arr.shape))
output:
[[ 0.37182134 1.4807685 0.11094223 0.34548185]
[ 0.14857641 -0.9159358 -0.37933393 -0.73946522]
[ 1.01842304 -0.06714827 -1.22557205 0.45600827]]
I am looking for output in arr where values greater than 0.8 should exist and other values to be zero.
I tried bool masking as shown above. But I am able to slove this. Kindly help
I'm not entirely sure what exactly you want to achieve, but this is what I did to filter.
arr = np.random.randn(3,4)
array([[-0.04790508, -0.71700005, 0.23204224, -0.36354634],
[ 0.48578236, 0.57983561, 0.79647091, -1.04972601],
[ 1.15067885, 0.98622772, -0.7004639 , -1.28243462]])
arr[arr < 0.8] = 0
array([[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[1.15067885, 0.98622772, 0. , 0. ]])
Thanks to user3053452, I have added one more solution which the original data will not be changed.
arr = np.random.randn(3,4)
array([[ 0.4297907 , 0.38100702, 0.30358291, -0.71137138],
[ 1.15180635, -1.21251676, 0.04333404, 1.81045931],
[ 0.17521058, -1.55604971, 1.1607159 , 0.23133528]])
new_arr = np.where(arr < 0.8, 0, arr)
array([[0. , 0. , 0. , 0. ],
[1.15180635, 0. , 0. , 1.81045931],
[0. , 0. , 1.1607159 , 0. ]])
I have this array:
I need create a new array like this:
I guess I need use a conditional, but I don't know how create an array with 7 columns, based on values of a 5 columns array.
If anyone could help me, I thank!
I'm going to assume you want to convert your last column into one hot concodings and then concat it to your original array. You can initialise an array of zeros, and then set the appropriate indices to 1. Finally concat the OHE array to your original.
MCVE:
print(arr)
array([[ -9.95, 15.27, 9.08, 1. ],
[ -6.81, 11.87, 8.38, 2. ],
[ -3.02, 11.08, -8.5 , 1. ],
[ -5.73, -2.29, -2.09, 2. ],
[ -7.01, -0.9 , 12.91, 2. ],
[-11.64, -10.3 , 2.09, 2. ],
[ 17.85, 13.7 , 2.14, 0. ],
[ 6.34, -9.49, -8.05, 2. ],
[ 18.62, -9.43, -1.02, 1. ],
[ -2.15, -23.65, -13.03, 1. ]])
c = arr[:, -1].astype(int)
ohe = np.zeros((c.shape[0], c.max() + 1))
ohe[np.arange(c.shape[0]), c] = 1
arr = np.hstack((arr[:, :-1], ohe))
print(arr)
array([[ -9.95, 15.27, 9.08, 0. , 1. , 0. ],
[ -6.81, 11.87, 8.38, 0. , 0. , 1. ],
[ -3.02, 11.08, -8.5 , 0. , 1. , 0. ],
[ -5.73, -2.29, -2.09, 0. , 0. , 1. ],
[ -7.01, -0.9 , 12.91, 0. , 0. , 1. ],
[-11.64, -10.3 , 2.09, 0. , 0. , 1. ],
[ 17.85, 13.7 , 2.14, 1. , 0. , 0. ],
[ 6.34, -9.49, -8.05, 0. , 0. , 1. ],
[ 18.62, -9.43, -1.02, 0. , 1. , 0. ],
[ -2.15, -23.65, -13.03, 0. , 1. , 0. ]])
One-line version of #COLDSPEED using the np.eye trick:
np.hstack([arr[:,:-1], np.eye(arr[:,-1].astype(int).max() + 1)[arr[:,-1].astype(int)]])
I've implemented a matrix factorization model, say R = U*V, and now I would to train and test this model.
To this end, given a sparse matrix R (zero for missing value), I want to first hide some non-zero elements in the training and use these non-zero elements as test set later.
How can I randomly select some non-zero elements from a numpy.ndarray? Besides, I need to remember the index and column position of these selected elements to use these elements in testing.
for example:
In [2]: import numpy as np
In [4]: mtr = np.random.rand(10,10)
In [5]: mtr
Out[5]:
array([[ 0.92685787, 0.95496193, 0.76878455, 0.12304856, 0.13804963,
0.30867502, 0.60245974, 0.00797898, 0.1060602 , 0.98277982],
[ 0.88879888, 0.40209901, 0.35274404, 0.73097713, 0.56238248,
0.380625 , 0.16432029, 0.5383006 , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0.5991212 , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0.91442668, 0.72827097, 0.4511198 ],
[ 0.63979934, 0.33421621, 0.09218392, 0.71520048, 0.57100522,
0.37205284, 0.59726293, 0.58224992, 0.58690505, 0.4791199 ],
[ 0.35219557, 0.34954002, 0.93837312, 0.2745864 , 0.89569075,
0.81244084, 0.09661341, 0.80673646, 0.83756759, 0.7948081 ],
[ 0.09173706, 0.86250006, 0.22121994, 0.21097563, 0.55090202,
0.80954817, 0.97159981, 0.95888693, 0.43151554, 0.2265607 ],
[ 0.00723128, 0.95690539, 0.94214806, 0.01721733, 0.12552314,
0.65977765, 0.20845669, 0.44663729, 0.98392716, 0.36258081],
[ 0.65994805, 0.47697842, 0.35449045, 0.73937445, 0.68578224,
0.44278095, 0.86743906, 0.5126411 , 0.75683392, 0.73354572],
[ 0.4814301 , 0.92410622, 0.85267402, 0.44856078, 0.03887269,
0.48868498, 0.83618382, 0.49404473, 0.37328248, 0.18134919],
[ 0.63999748, 0.48718656, 0.54826717, 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0.70610649, 0.03213063, 0.88371607]])
In [6]: mtr = np.where(mtr>0.5, 0, mtr)
In [7]: %clear
In [8]: mtr
Out[8]:
array([[ 0. , 0. , 0. , 0.12304856, 0.13804963,
0.30867502, 0. , 0.00797898, 0.1060602 , 0. ],
[ 0. , 0.40209901, 0.35274404, 0. , 0. ,
0.380625 , 0.16432029, 0. , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0. , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0. , 0. , 0.4511198 ],
[ 0. , 0.33421621, 0.09218392, 0. , 0. ,
0.37205284, 0. , 0. , 0. , 0.4791199 ],
[ 0.35219557, 0.34954002, 0. , 0.2745864 , 0. ,
0. , 0.09661341, 0. , 0. , 0. ],
[ 0.09173706, 0. , 0.22121994, 0.21097563, 0. ,
0. , 0. , 0. , 0.43151554, 0.2265607 ],
[ 0.00723128, 0. , 0. , 0.01721733, 0.12552314,
0. , 0.20845669, 0.44663729, 0. , 0.36258081],
[ 0. , 0.47697842, 0.35449045, 0. , 0. ,
0.44278095, 0. , 0. , 0. , 0. ],
[ 0.4814301 , 0. , 0. , 0.44856078, 0.03887269,
0.48868498, 0. , 0.49404473, 0.37328248, 0.18134919],
[ 0. , 0.48718656, 0. , 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0. , 0.03213063, 0. ]])
Given such sparse ndarray, how can I select 20% of the non-zero elements and remember their position?
We'll use numpy.random.choice. First, we get arrays of the (i,j) indices where the data is nonzero:
i,j = np.nonzero(x)
Then we'll select 20% of these:
ix = np.random.choice(len(i), int(np.floor(0.2 * len(i))), replace=False)
Here ix is a list of random, unique indices, 20% the length of i and j (the length of i and j is the number of nonzero entries). To recover the indices, we do i[ix] and j[ix], so we can then select 20% of the nonzero entries of x by writing:
print x[i[ix], j[ix]]