How to cut dataset into X and Y parameters - python

I have a time-series dataset that consists of evenly spaced timesteps and another parameter (say volume). I want to cut/split the dataset into X and Y parameters to train my ML model. I am looking for a logic/algorithm for Python that will be useful in tacking the simplified version below.
I have an array of even timesteps (1 timestep = 1day) ranging from 1 to 100:
array = [1,2,3,...,100]
I also have come up with the following parameters: N and K. N to be used in X parameter and K to be used in Y parameter.
If N = 5, then on first iteration X = [1,2,3,4,5], on second iteration X = [2,3,4,5,6] and on third iteration X = [3,4,5,6,7] and so forth. So, the length of an X is equal to the number of N. If N = 10, then on first iteration X = [1,2,3,...,10], on second iteration X = [2,3,4...,11] and so forth.
K parameter represents the length of a geometric sequence. For example: k =5 means k = (1,2,4,8,16), k = 3 means k = (1,2,4) and k = 7 means k = (1,2,4,8,16,32,64). Y parameter uses last element of an X array at each iteration and adds to it the values from the geometric sequence. So, the length of a Y is equal to the number of K. If len(K) = 5 -> len(Y)=5, if len(K) = 3 -> len(Y)=3 and so forth.
Example 1: N= 5, K=5:
First step:
X = [1,2,3,4,5] and Y = [6,7,9,13, 21]
because K = (1,2,4,8, 16) and Y = [5+1, 5+2, 5+4, 5+8, 5+16] with 5 being the last element of an array X
Second Step:
X = [2,3,4,5, 6] and Y = [7,8,10,14, 22]
because K = (1,2,4,8, 16) and Y = [6+1, 6+2, 6+4, 6+8, 6+16] with 6 being the last element of an array X
Third Step:
X = [3,4,5, 6, 7] and Y = [8,9,11,15, 23]
because K = (1,2,4,8, 16) and Y = [7+1, 7+2, 7+4, 7+8, 7+16] with 7 being the last element of an array X
**Other steps**
Last Step:
X = [?,?,?,?,?]; Y = [?,?,?,?,100]
k = (1,2,4,8,16) because 100 is the last element of an array
Example 2: N = 6, K = 3:
First Step:
X = [1,2,3,4,5, 6] and Y = [7,8,10] Because K = (1,2,4) and Y = [6+1, 6+2, 6+4]
Second Step:
X = [2,3,4,5,6, 7] and Y = [8,9,11] Because K = (1,2,4) and Y = [7+1, 7+2, 7+4]
Third Step:
X = [3,4,5,6,7,8] and Y = [9,10,12] Because K = (1,2,4) and Y = [8+1, 8+2, 8+4]
**Other steps**
Last Step:
X = [92,93,94,95,96]; Y = [97,98,100], k = (1,2,4) because 100 is the last element of an array
Edit
I expect the function to look like:
def dataset_split(array, N, K):
It should return multiple X and Y arrays (basically chunks) based on the input array between 1 and 100. Basically it should go over steps and save the results for X and Y in a form of matrix or arrays. Based on my Example 1 above, my X array after first three steps will be
X = [[1,2,3,4,5], [2,3,4,5,6], [3,4,5, 6, 7]]
and my Y array after first three steps will be
Y = [[6,7,9,13, 21], [7,8,10,14, 22], [8,9,11,15, 23]]
The procedure should continue until the last element of an array is reached which is 100 in this case

A very simple way to get all the values of X is to create a sliding window view into array. You can do this directly with np.lib.stride_tricks.sliding_window_view:
n = ...
k = ...
x = np.lib.stride_tricks.sliding_window_view(array, n)
The geometric sequence K can be trivially generated with np.logspace:
K = np.logspace(0, k - 1, k, base=2)
OR
K = 2.0**np.arange(k)
Either way, you can pre-generate all of y as
y = x + K
Now you have two arrays with all of the data you need:
>>> array = np.arange(1, 101)
>>> n = k = 5
>>> x = np.lib.stride_tricks.sliding_window_view(array, n)
>>> x
array([[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6],
[ 3, 4, 5, 6, 7],
...
[ 94, 95, 96, 97, 98],
[ 95, 96, 97, 98, 99],
[ 96, 97, 98, 99, 100]])
>>> K = np.logspace(0, k - 1, k, base=2)
>>> K
array([ 1., 2., 4., 8., 16.])
>>> y = x + K
>>> y
array([[ 2., 4., 7., 12., 21.],
[ 3., 5., 8., 13., 22.],
[ 4., 6., 9., 14., 23.],
...
[ 95., 97., 100., 105., 114.],
[ 96., 98., 101., 106., 115.],
[ 97., 99., 102., 107., 116.]])
The nice thing about this approach is that you don't need to copy the original data of array to make x, and everything is fully vectorized. Whatever operation you are planning on doing can likely be performed in bulk using numpy functions.

This satisfies your examples:
def split_dataset(array, N, K):
k = 2**np.arange(K)
# column stacking i-places shifted array for N columns
X = np.c_[[np.roll(array,-i) for i in range(N)]].T[:-N+1]
# masking rows that will go over the last value in array
mask = X[:,-1] + k[-1] <= array[-1]
X = X[mask]
# adding k to the last column of X
Y = X[:,-1].reshape(-1,1) + k
return X, Y
X, Y = split_dataset(array, 5, 5)

Related

Working with 2 arrays with different lengths Numpy Python

Is there a way I could modify the function down below so that it could compute arrays with different length sizes. the length of Numbers array is 7 and the length of the Formating is 5. The code down below compares if any number in Formating is between two values and if it the case then it sums the values that are in between. So for the first calculation since no element in Numbers is between 0, 2 the result will be 0. Link to code was derived from: issue.
Code:
Numbers = np.array([3, 4, 5, 7, 8, 10,20])
Formating = np.array([0, 2 , 5, 12, 15])
x = np.sort(Numbers);
l = np.searchsorted(x, Formating, side='left')
mask=(Formating[:-1,None]<=Numbers)&(Numbers<Formating[1:,None])
N=Numbers[:,None].repeat(5,1).T
result= np.ma.masked_array(N,~mask)
result = result.filled(0)
result = np.sum(result, axis=1)
Expected output:
[ 0 7 30 0]
Here's an approach with bincounts. Note that you have your x and l messed-up, and I recalled that you could/should use digitize:
# Formating goes here
x = np.sort(Formating);
# digitize
l = np.digitize(Numbers, x)
# output:
np.bincount(l, weights=Numbers)
Out:
array([ 0., 0., 7., 30., 0., 20.])

Faster way to create cost matrix

I am using the Hungarian Algorithm in scipy which takes as an input the cost matrix of two sets of points. This just means each element in array x is passed into function f with each element in array y. I currently implemented this with a nested for loop in python. Here is a basic example of what I do:
def f(a, b):
return a * b
x = np.array([1, 2, 3])
y = np.array([1, 2, 3])
cost_mat = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[0]):
for j in range(y.shape[0]):
cost_mat[i, j] = f(x[i], y[j])
print(cost_mat)
>> out:
[[1., 2., 3.]
[2., 4., 6.]
[3., 6., 9.]]
Is there a faster way to do this? For example, vectorizing it somehow?
Something like this work :
x = np.array([1, 2, 3], ndmin=2)
y = np.array([1, 2, 3], ndmin=2)
cost_mat = x * y.T
cost_matrix is
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
Let's time both solutions with bigger arrays :
x = np.random.rand(10000,1)
y = np.random.rand(10000,1)
def f(a, b):
return a * b
# Start timing here
cost_mat1 = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[0]):
for j in range(y.shape[0]):
cost_mat1[i, j] = f(x[i], y[j])
# Wall time: 2min 13s
Using transpose is way faster :
# Start timing here
cost_mat2 = x * y.T
# Wall time: 395 ms
And then check that
np.array_equal(cost_mat1, cost_mat2)
Returns true

Subtract the average of first and last value of each row from all values in the row

I have a numpy array that looks like this:
77.132 2.075 63.365 74.880
49.851 22.480 19.806 76.053
16.911 8.834 68.536 95.339
0.395 51.219 81.262 61.253
72.176 29.188 91.777 71.458
54.254 14.217 37.334 67.413
44.183 43.401 61.777 51.314
65.040 60.104 80.522 52.165
90.865 31.924 9.046 30.070
11.398 82.868 4.690 62.629
and what I'm trying to do is
find the average of the first and last item in each row
subtract this average from each pixel in that row
repeat for each row
create a new image of subtracted pixels.
I've tried it using for loops but I can't get it working:
import numpy as np
# Create random arrays to simulate images
np.random.seed(10)
image = 100 * np.random.rand(10, 4)
no_disk_list = []
#for row in image:
# left, right = row[0], row[-1]
# average = (left + right) / 2.0
# for i in row:
# no_average = row[i] - average
# print(average)
# no_disk_list.append(no_average)
subtracted = np.ones_like(image)
height, width = image.shape
for row in image:
left, right = image[0], image[-1]
average = (left + right) / 2.0
for element in row:
subtracted[row, element] = image[row, element] - average
Both of the nested loops gives an error:
File "C:/Users/Jeremy/Dropbox/Astro480/NEOWISE/subtract_disk.py", line 17, in <module>
no_disk_value = row[i] - disk_value
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
for the first loop and
File "C:/Users/Jeremy/Dropbox/Astro480/NEOWISE/subtract_pixels.py", line 23, in <module>
print(image[row, element])
IndexError: arrays used as indices must be of integer (or boolean) type
for the second. The questions here, here, and here are of limited use in my situation. Besides, I know that vectorization would be a better way to go, since the image I'll eventually be using has 1.3 million pixels. How can I get the loops working, or even better, vectorize the calculation?
If I understand the question correctly this will work:
subtracted = np.ones_like(image)
height, width = image.shape
for row_no, row in enumerate(image): # keep the row number using enumerate
left, right = row[0], row[-1] # you need the first and last value of the ROW!
average = (left + right) / 2.0
# Also use enumerate in the inner loop
for col_no, element in enumerate(row):
subtracted[row_no, col_no] = element - average
You can even use broadcasting ("vectorization") to shorten this considerably:
subtracted = image - (image[:, [0]] + image[:, [-1]]) / 2
The image[:, [0]] is the first column, the image[:, [-1]] is the last column. By adding and dividing them by 2 you get a 2D array containing the averages of each row. The final step is subtracting this from the image, which is easy in this case because it will broadcast correctly.
Step-by-step:
>>> arr = np.arange(20).reshape(4, 5)
>>> arr
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
>>> arr[:, [0]] # first column
array([[ 0],
[ 5],
[10],
[15]])
>>> arr[:, [-1]] # last column
array([[ 4],
[ 9],
[14],
[19]])
>>> (arr[:, [0]] + arr[:, [-1]]) / 2 # average
array([[ 2.],
[ 7.],
[ 12.],
[ 17.]])
>>> arr - (arr[:, [0]] + arr[:, [-1]]) / 2 # subtracted
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])

How do I access elements of a 2d array based on two where clauses?

I have the following ndarray:
[[ 3 271]
[ 4 271]
[375 271]
[ 3 216]
[375 216]
[ 0 0]
[ 0 546]
[378 546]
[378 0]
[ 1 182]
[ 2 181]
[376 181]
[377 182]
[377 544]
[376 545]]
Essentially a bunch of X,Y coordinates/points. I'd like to be able to select X,Y coordinates "near" a given point on both axes.
For example, given a target point of [3, 271], I'd first retrieve all other points at this Y location (271), to then be able to select rows -/+ 3 on the X axis. For the above, that should yield:
[ 3 271]
[ 4 271]
I've gotten as far as getting all rows with the same Y value like this:
index_on_y = points[:,1] == point[1]
shared_y = points[index_on_y]
This returns:
shared_y:
[[ 3 271]
[ 4 271]
[375 271]]
How do I now select all rows from this array where the X value (column 0) can be anything between 0-6? I've tried various combinations of slicing/indexing/np.where but I've not been able to get the desired result. The below is as far as I got, but I know it is incorrect; just don't know what the right (and most efficient) way of doing it would be:
def nearby_contour_points_x(self, point, points, radius):
index_on_y = points[:,1] == point[1] # correct
shared_y = points[index_on_y] # correct
x_vals = shared_y[:,0] # not good?
index_on_x = np.where(np.logical_or(x_vals <= (point[0] - radius), x_vals <= (point[0] + radius)))
return shared_y[index_on_x]
Ideally, I wouldn't have to first group on one of the axes.
With a as the array in your example.
target = np.array([3, 271])
Subtract the target
diff = a - target
y (column one) must be the same as the target - this results in a boolean array of shape a.shape[0]:
y_rows = diff[:,1] == 0
x is within a range of the target - this results in a boolean array of shape a.shape[0]:
x_rows = np.logical_and(diff[:,0] <= 6, diff[:,0] >= 0)
Make a mask for boolean indexing - its shape will be (15,) - a.shape[0],
allowing it to broadcast along the rows
mask = np.logical_and(x_rows, y_rows)
>>> a[mask]
array([[ 3, 271],
[ 4, 271]])
Foregoing the initial subtraction and a little more generalized:
x_rows = np.logical_and(a[:,0] >= target[0] - 3, a[:,0] <= target[0] + 3)
y_rows = a[:,1] == target[1]
mask = np.logical_and(x_rows, y_rows)
perhaps an abuse of isclose but works
ys = np.array([[ 3, 271],
[ 4, 271],
[375, 271]])
np.compress(np.all(np.isclose(ys, [3,271], rtol=0, atol=3), axis=1), ys, axis=0)
Out[273]:
array([[ 3, 271],
[ 4, 271]])

assigning points to bins

What is a good way to bin numerical values into a certain range? For example, suppose I have a list of values and I want to bin them into N bins by their range. Right now, I do something like this:
from scipy import *
num_bins = 3 # number of bins to use
values = # some array of integers...
min_val = min(values) - 1
max_val = max(values) + 1
my_bins = linspace(min_val, max_val, num_bins)
# assign point to my bins
for v in values:
best_bin = min_index(abs(my_bins - v))
where min_index returns the index of the minimum value. The idea is that you can find the bin the point falls into by seeing what bin it has the smallest difference with.
But I think this has weird edge cases. What I am looking for is a good representation of bins, ideally ones that are half closed half open (so that there is no way of assigning one point to two bins), i.e.
bin1 = [x1, x2)
bin2 = [x2, x3)
bin3 = [x3, x4)
etc...
what is a good way to do this in Python, using numpy/scipy? I am only concerned here with binning integer values.
thanks very much for your help.
numpy.histogram() does exactly what you want.
The function signature is:
numpy.histogram(a, bins=10, range=None, normed=False, weights=None, new=None)
We're mostly interested in a and bins. a is the input data that needs to be binned. bins can be a number of bins (your num_bins), or it can be a sequence of scalars, which denote bin edges (half open).
import numpy
values = numpy.arange(10, dtype=int)
bins = numpy.arange(-1, 11)
freq, bins = numpy.histogram(values, bins)
# freq is now [0 1 1 1 1 1 1 1 1 1 1]
# bins is unchanged
To quote the documentation:
All but the last (righthand-most) bin is half-open. In other words, if bins is:
[1, 2, 3, 4]
then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.
Edit: You want to know the index in your bins of each element. For this, you can use numpy.digitize(). If your bins are going to be integral, you can use numpy.bincount() as well.
>>> values = numpy.random.randint(0, 20, 10)
>>> values
array([17, 14, 9, 7, 6, 9, 19, 4, 2, 19])
>>> bins = numpy.linspace(-1, 21, 23)
>>> bins
array([ -1., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.,
21.])
>>> pos = numpy.digitize(values, bins)
>>> pos
array([19, 16, 11, 9, 8, 11, 21, 6, 4, 21])
Since the interval is open on the upper limit, the indices are correct:
>>> (bins[pos-1] == values).all()
True
>>> import sys
>>> for n in range(len(values)):
... sys.stdout.write("%g <= %g < %g\n"
... %(bins[pos[n]-1], values[n], bins[pos[n]]))
17 <= 17 < 18
14 <= 14 < 15
9 <= 9 < 10
7 <= 7 < 8
6 <= 6 < 7
9 <= 9 < 10
19 <= 19 < 20
4 <= 4 < 5
2 <= 2 < 3
19 <= 19 < 20
This is fairly straightforward in numpy using broadcasting--my example below is four lines of code (not counting first two lines to create bins and data points, which would of course ordinarily be supplied.)
import numpy as NP
# just creating 5 bins at random, each bin expressed as (x, y, z) although, this code
# is not limited by bin number or bin dimension
bins = NP.random.random_integers(10, 99, 15).reshape(5, 3)
# creating 30 random data points
data = NP.random.random_integers(10, 99, 90).reshape(30, 3)
# for each data point i want the nearest bin, but before i can generate a distance
# matrix, i need to 'conform' the array dimensions
# 'broadcasting' is an excellent and concise way to do this
bins = bins[:, NP.newaxis, :]
data2 = data[NP.newaxis, :, :]
# now i can calculate the distance matrix
dist_matrix = NP.sqrt(NP.sum((data - bins)**2, axis=-1))
bin_assignments = NP.argmin(dist_matrix, axis=0)
'bin_assignments' is a 1d array of indices comprised of integer values from 0 to 4, corresponding to the five bins--the bin assignments for each of the 30 original points in the 'data' matrix above.

Categories