Is there an efficient python way to window the data? - python

I have a time-series data that has 45000 rows and 30 columns. I want to split this data up into rolling windows of length 150 along the rows and compute some statistics for each window.
Basically, this means I first take rows 0 to 149 and compute mean followed by std for each column. I then take rows 1 to 150 and again compute mean and std for each column separately. I stack vectors of means and std for each column horizontally.
At the moment I have the following function:
def statFeaturesWindowed(data):
dataLen = data.shape[0]
count = 0
winSize = 150
ftVect = np.empty((1, 60))
while count+winSize <= dataLen:
if not count % 1000:
print("Row {} of {}".format(count, dataLen-winSize))
dataWin = data[count:count+winSize,:] # data window
means = np.mean(dataWin, axis = 0) # mean
stDev = np.std(dataWin, axis = 0) # standard deviation
tempVect = np.hstack((means, stDev)).T
ftVect = np.vstack((ftVect, tempVect))
count+=1
np.delete(ftVect, (1), 0)
return ftVect
However, this code is super slow and slows down as the output matrix becomes larger and larger. At the moment running this for 45000 lines takes roughly 5 minutes. Is there a more efficient way to do this?

Related

Storing Values from One Array into Another Larger Array

I am trying to create a range of signals of different frequencies. I am finding it difficult to store amplitude vs time into another storage matrix for each frequency ranging from 0 to 50 Hz. Example, for a frequency of 20 Hz, I want to store the amplitude vs time for that frequency, then for 21 Hz I want to store the amplitude vs time for that frequency etc, until I have all of them in a large matrix. I am getting so confused at this point with indexing and syntax, any help welcome!
import numpy as np
max_freq = 50
s_frequency = np.arange(0,51,0.1)
fs = 200
time = np.arange(0,5-(1/fs),(1/fs))
x = np.empty((len(time)), dtype=np.float32)
i = 0
j = 0
full_array = np.empty((len(s_frequency),len(time),len(time)), dtype=np.float32)
amplitude = np.zeros(999)
for f1 in s_frequency:
i = 0
for t in time:
amplitude[i] = np.sin(2*np.pi*f1*t)
i = i + 1
full_array[i] = ([time], [amplitude])
I have also tried the following:
import numpy as np
max_freq = 50
s_frequency = np.arange(0,50.1,0.1)
fs = 200
time = np.arange(0,5-(1-fs),(1/fs))
#full_array = np.sin(2*np.pi*np.outer(s_frequency,time))
full_array = np.empty((len(s_frequency),len(time), len(time)), dtype=np.float32)
for f1 in s_frequency:
array = []
for i, t in enumerate(time):
amplitude = np.sin(2*np.pi*f1*t)
array.insert(i,amplitude)
full_array[i] = [time, array]
Not 100% sure what you're trying to do, but it seems like you're trying to initialize a 2-dimensional grid (i.e. a matrix) where you have a dimension for time and one for frequency. Here is what I would do:
import numpy as np
max_freq = 50
s_frequency = np.arange(0,51,0.1)
fs = 200
time = np.arange(0,5-(1/fs),(1/fs))
full_array = np.sin(2*np.pi*np.outer(s_frequency,time))
No explicit for-loops or index handling needed. np.outer() will give you a 2D grid (i.e. a matrix) of frequency versus time. Now whats left is to compute the sine of 2 Pi times that grid value. Very conveniently numpy functions do accept arrays as input, thus we can simply call np.sin(2*np.pi*np.outer(s_frequency,time).
Not sure what x and j are good for in your code and why full_array should be 3-diemsional. Would you like to include a spatial component as well?
By the way, a construct like this:
i = 0
for t in time:
amplitude[i] = np.sin(2*np.pi*f1*t)
i = i + 1
can easily be avoided in python, thanks to pythons build-in enumerate() function. It would then look like this:
for i, t in enumerate(time):
amplitude[i] = np.sin(2*np.pi*f1*t)
which does essentially the same, but you don't have to explicitly create the index i = 0 and manually incerement it in every iteration i = i + 1.

Which is the fastest method to calculate means square error in large image dataset?

I'm trying to calculate the mean square error in an image dataset(CIFAR-10). I have a numpy array of dimension 5*10000*32*32*3 which is, in words, 5 batches of 10000 images each with dimensions of 32*32*3. These images belong to 10 categories of images. I have calculated average of each class and now I'm trying to calculate the mean square error of each of the 50000 images wrt the 10 average images. Here is the code:
for i in range(0, 5):
for j in range(0, 10000):
min_diff, min_class = float('inf'), 0
for avg in class_avg: # avg class comprises of 10 average images
temp = mse(avg[1], images[i][j])
if temp < min_diff:
min_diff = temp
min_class = avg[0]
train_pred[i][j] = min_class
Problem: Is there any way to make it faster. Any numpy magic? Thank you.
You can use expand_dims and tile.
There are many ways of expanding the dimension of an array, I will use one of them, which is something like [:,None,:], this adds a new axis in the middle.
Below is an example of how you can combine the two methods to fulfill your task:
test = np.ones((5,100,32,32,3)) # batches of images
average = np.ones((10,32,32,3)) # the 10 images
average = average[None,None,...] # reshape to (1,1,10,32,32,3)
test = test[:,:,None,...] # insert an axis
test = np.tile(test,(1,1,10,1,1,1)) # reshape to (5,100,10,32,32,3)
print(test.shape,average.shape)
mse = ((test-average)**2).mean(axis=(3,4,5))
class_idx = np.argmin(mse,axis=-1)
UPDATE
The purpose of using expand_dims and tile is to avoid using a for-loop. However, the np.tile operation will create 10 replicates of the original array, this will definitely hurt the performance if the array is large. To avoid using np.tile, you can try the code below:
labels = np.empty((5,100,10))
average = np.ones((10,32,32,3))
average = average[None,...]
test = np.ones((5,100,32,32,3))
for ind in range(10):
labels[...,ind] = ((test-average[:,ind,...])**2).mean(axis=(2,3,4))
labels = np.argmin(labels,axis=-1)

Peak detection in unevenly spaced timeseries

I'm working with a dataset containing measures combined with a datetime like:
datetime value
2017-01-01 00:01:00,32.7
2017-01-01 00:03:00,37.8
2017-01-01 00:04:05,35.0
2017-01-01 00:05:37,101.1
2017-01-01 00:07:00,39.1
2017-01-01 00:09:00,38.9
I'm trying to detect and remove potential peaks that might appear, like 2017-01-01 00:05:37,101.1 measure.
Some things that I found so far:
This dataset has a time spacing that goes from 15 seconds all the way to 25 minutes, making it super uneven;
The width of the peaks cannot be determined beforehand
The height of the peaks clearly and significantly deviates from the other values
Normalization of the time step should only occur after the removal of the outliers since they would interfere with the results
It's "impossible" to making it even due to other anomalies (e.g, negative values, flat lines), even without them it would create wrong values due to the peaks;
find_peaks is expecting an evenly spaced timeseries therefore the previous solution didn't work for the irregular timeseries we have;
On that issue I forgot to mention the critical point that is unevenly spaced timeseries.
I've searched everywhere and I couldn't find anything. The implementation is going to be in Python but I'm willing to dig around other languages to get the logic.
I've posted this code on github to anyone that in the future have this problem, or similar.
After a lot of trial and error I think I created something that works. Using what #user58697 told me I managed to create a code that detects every peak between a threshold.
By using the logic that he/she explained if ((flow[i+1] - flow[i]) / (time[i+1] - time[i]) > threshold I've coded the following code:
Started by reading the .csv and parse the dates, followed by splitting into two numpy arrays:
dataset = pd.read_csv('https://raw.githubusercontent.com/MigasTigas/peak_removal/master/dataset_simple_example.csv', parse_dates=['date'])
dataset = dataset.sort_values(by=['date']).reset_index(drop=True).to_numpy() # Sort and convert to numpy array
# Split into 2 arrays
values = [float(i[1]) for i in dataset] # Flow values, in float
values = np.array(values)
dates = [i[0].to_pydatetime() for i in dataset]
dates = np.array(dates)
Then applied the (flow[i+1] - flow[i]) / (time[i+1] - time[i]) to the whole dataset:
flow = np.diff(values)
time = np.diff(dates).tolist()
time = np.divide(time, np.power(10, 9))
slopes = np.divide(flow, time) # (flow[i+1] - flow[i]) / (time[i+1] - time[i])
slopes = np.insert(slopes, 0, 0, axis=0) # Since we "lose" the first index, this one is 0, just for alignments
And finally to detect the peaks we reduced the data to rolling windows of x seconds each. That way we can detect them easily:
# ROLLING WINDOW
size = len(dataset)
rolling_window = []
rolling_window_indexes = []
RW = []
RWi = []
window_size = 240 # Seconds
dates = [i.to_pydatetime() for i in dataset['date']]
dates = np.array(dates)
# create the rollings windows
for line in range(size):
limit_stamp = dates[line] + datetime.timedelta(seconds=window_size)
for subline in range(line, size, 1):
if dates[subline] <= limit_stamp:
rolling_window.append(slopes[subline]) # Values of the slopes
rolling_window_indexes.append(subline) # Indexes of the respective values
else:
RW.append(rolling_window)
if line != size: # To prevent clearing the last rolling window
rolling_window = []
RWi.append(rolling_window_indexes)
if line != size:
rolling_window_indexes = []
break
else:
# To get the last rolling window since it breaks before append
RW.append(rolling_window)
RWi.append(rolling_window_indexes)
After getting all rolling windows we start the fun:
t = 0.3 # Threshold
peaks = []
for index, rollWin in enumerate(RW):
if rollWin[0] > t: # If the first value is greater of threshold
top = rollWin[0] # Sets as a possible peak
bottom = np.min(rollWin) # Finds the minimum of the peak
if bottom < -t: # If less than the negative threshold
bottomIndex = int(np.argmin(rollWin)) # Find it's index
for peak in range(0, bottomIndex, 1): # Appends all points between the first index of the rolling window until the bottomIndex
peaks.append(RWi[index][peak])
The idea behind this code is every peak has a rising and a falling, and if both are greater than the stated threshold then it's an outlier peak along with all peaks between them:
Where translated to the real dataset used, posted on github:

What is the meaning of normalization in machine learning language? Does it correspond to one sample?

I am dealing with a classification problem I want to classify data into 2 classes. I generate 1000 samples at different temperatures ranging from 1 to 5. I load data using following function load_data. Where "data" is 2 dimensional array (1000,16), Rows correspond to number of samples at "1.0.npy" and similarly for other points and 16 is number of features. So I picked max and min values from each sample by applying a for loop. But I'm afraid that my normalization is not correct because I'm not sure what is the strategy of normalization in machine learning. Should I pick np.amax(each sample) or should I pick np.amax("1.0.npy") mean from all 1000 samples that contained in 1.0.npy files. My goal is to normalize data between 0 and 1.
`def load_data():
path ="./directory"
files =sorted(os.listdir(path)) #{1.0.npy, 2.0.npy,.....5.0.npy}
dictData ={}
for df in sorted(files):
print(df)
data = np.load(os.path.join(path,df))
a=data
lis =[]
for i in range(len(data)):
old_range = np.amax(a[i]) - np.amin(a[i])
new_range = 1 - 0
f = ((a[i] - np.amin(a[i])) / old_range)*new_range + 0
lis.append(f)`
After normalization I get following result such that first value of every sample is 0 and last value is one.
[0, ...., 1] #first sample
[0,.....,1] #second sample

Improve performance of the np.irr function through vectorization

Is it possible to improve the performance of the np.irr function such that it can applied to a 2-dimension array of cash flows without using a for-loop--either though vectorizing the np.irr function or through an alternative algorithm?
The irr function in the numpy library calculates the periodically compounded rate of return that gives a net present value of 0 for an array of cash flows. This function can only be applied to a 1-dimensional array:
x = np.array([-100,50,50,50])
r = np.irr(x)
np.irr will not work against a 2-dimensional array of cash flows, such as:
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
where each row represents a series of cash flows, and columns represent time periods. Therefore a slow implementation would be to loop over each row and apply np.irr to individual rows:
out = []
for x in cfs:
out.append(np.irr(x))
For large arrays, this is an optimization barrier. Looking at the source code of the np.irr function, I believe the main obstacle is vectorizing the np.roots function:
def irr(values):
res = np.roots(values[::-1])
mask = (res.imag == 0) & (res.real > 0)
if res.size == 0:
return np.nan
res = res[mask].real
# NPV(rate) = 0 can have more than one solution so we return
# only the solution closest to zero.
rate = 1.0/res - 1
rate = rate.item(np.argmin(np.abs(rate)))
return rate
I have found a similar implementation in R: Fast loan rate calculation for a big number of loans, but don't know how to port this into Python. Also, I don't consider np.apply_along_axis or np.vectorize to be solutions to this issue since my main concern is performance, and I understand both are wrappers for a for-loop.
Thanks!
Looking at the source of np.roots,
import inspect
print(inspect.getsource(np.roots))
We see that it works by finding the eigenvalues of the "companion matrix". It also does some special handling of coefficients that are zero. I really don't understand the mathematical background, but I do know that np.linalg.eigvals can calculate eigenvalues for multiple matrices in a vectorized way.
Merging it with the source of np.irr has resulted in the following "Frankencode":
def irr_vec(cfs):
# Create companion matrix for every row in `cfs`
M, N = cfs.shape
A = np.zeros((M, (N-1)**2))
A[:,N-1::N] = 1
A = A.reshape((M,N-1,N-1))
A[:,0,:] = cfs[:,-2::-1] / -cfs[:,-1:] # slice [-1:] to keep dims
# Calculate roots; `eigvals` is a gufunc
res = np.linalg.eigvals(A)
# Find the solution that makes the most sense...
mask = (res.imag == 0) & (res.real > 0)
res = np.ma.array(res.real, mask=~mask, fill_value=np.nan)
rate = 1.0/res - 1
idx = np.argmin(np.abs(rate), axis=1)
irr = rate[np.arange(M), idx].filled()
return irr
This does not do handling of zero coefficients and surely fails when any(cfs[:,-1] == 0). Also some input argument checking wouldn't hurt. And some other problems maybe? But for the supplied example data it achieves what we wanted (at the cost of increased memory use):
In [487]: cfs = np.zeros((10000,4))
...: cfs[:,0] = -100
...: cfs[:,1:] = 50
In [488]: %timeit [np.irr(x) for x in cfs]
1 loops, best of 3: 2.96 s per loop
In [489]: %timeit irr_vec(cfs)
10 loops, best of 3: 77.8 ms per loop
If you have the special case of loans with a fixed payback amount (like in the question you linked) you may be able do it faster using interpolation...
After I posted this question I worked on this question and came up with a vectorized solution that uses a different algorithm:
def virr(cfs, precision = 0.005, rmin = 0, rmax1 = 0.3, rmax2 = 0.5):
'''
Vectorized IRR calculator. First calculate a 3D array of the discounted
cash flows along cash flow series, time period, and discount rate. Sum over time to
collapse to a 2D array which gives the NPV along a range of discount rates
for each cash flow series. Next, find crossover where NPV is zero--corresponds
to the lowest real IRR value. For performance, negative IRRs are not calculated
-- returns "-1", and values are only calculated to an acceptable precision.
IN:
cfs - numpy 2d array - rows are cash flow series, cols are time periods
precision - level of accuracy for the inner IRR band eg 0.005%
rmin - lower bound of the inner IRR band eg 0%
rmax1 - upper bound of the inner IRR band eg 30%
rmax2 - upper bound of the outer IRR band. eg 50% Values in the outer
band are calculated to 1% precision, IRRs outside the upper band
return the rmax2 value
OUT:
r - numpy column array of IRRs for cash flow series
'''
if cfs.ndim == 1:
cfs = cfs.reshape(1,len(cfs))
# Range of time periods
years = np.arange(0,cfs.shape[1])
# Range of the discount rates
rates_length1 = int((rmax1 - rmin)/precision) + 1
rates_length2 = int((rmax2 - rmax1)/0.01)
rates = np.zeros((rates_length1 + rates_length2,))
rates[:rates_length1] = np.linspace(0,0.3,rates_length1)
rates[rates_length1:] = np.linspace(0.31,0.5,rates_length2)
# Discount rate multiplier rows are years, cols are rates
drm = (1+rates)**-years[:,np.newaxis]
# Calculate discounted cfs
discounted_cfs = cfs[:,:,np.newaxis] * drm
# Calculate NPV array by summing over discounted cashflows
npv = discounted_cfs.sum(axis = 1)
## Find where the NPV changes sign, implies an IRR solution
signs = npv < 0
# Find the pairwise differences in boolean values when sign crosses over, the
# pairwise diff will be True
crossovers = np.diff(signs,1,1)
# Extract the irr from the first crossover for each row
irr = np.min(np.ma.masked_equal(rates[1:]* crossovers,0),1)
# Error handling, negative irrs are returned as "-1", IRRs greater than rmax2 are
# returned as rmax2
negative_irrs = cfs.sum(1) < 0
r = np.where(negative_irrs,-1,irr)
r = np.where(irr.mask * (negative_irrs == False), 0.5, r)
return r
Performance:
import numpy as np
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
%timeit [np.irr(x) for x in cfs]
10 loops, best of 3: 1.06 s per loop
%timeit virr(cfs)
10 loops, best of 3: 29.5 ms per loop
pyxirr is super fast, and np.irr is deprecated, so I'd use this now:
https://pypi.org/project/pyxirr/
import pyxirr
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
df = pd.DataFrame(cfs).T
df.apply(pyxirr.irr)

Categories