Manual normalization function taking too long to execute - python

I am trying to implement a normalization function manually rather than using the scikit learn's one. The reason is that, I need to define the maximum and minimum parameters manually and scikit learn doesn't allow that alteration.
I successfully implemented this to normalize the values between 0 and 1. But it is taking a very long time to run.
Question: Is there another efficient way I can do this? How can I make this execute faster.
Shown below is my code:
scaled_train_data = scale(train_data)
def scale(data):
for index, row in data.iterrows():
X_std = (data.loc[index, "Close"] - 10) / (2000 - 10)
data.loc[index, "Close"] = X_std
return data
2000 and 10 are the attributes that i defined manually rather than taking the minimum and the maximum value of the dataset.
Thank you in advance.

Use numpy's matrix.you can also set your min and max mannually.
import numpy as np
data = np.array(df)
_min = np.min(data, axis=0)
_max = np.max(data, axis=0)
normed_data = (data - _min) / (_max - _min)

Why loop? You can just use
train_data['close'] = (train_data['close'] - 10)/(2000 - 10)
to make use of vectorized numpy functions. Of course, you could also put this in a function, if you prefer.
Alternatively, if you want to rescale to a linear range, you could use http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html. The advantage of this is that you can save it and then rescale the test data in the same manner.

Related

fit_transform and inverse_transform on two different scripts

How to fit_transform & inverse_transform in separate scripts?
I first normalize numerical targets (integers) in a script.
Then, I use an other script to predict in real-time these numerical targets (regression).
fit_transform & inverse_transform functions def are in a third script.
scaler = MinMaxScaler(copy=True, feature_range=(0.,1.))
def normalize(array):
array = scaler.fit_transform(array).flatten()
return array
def inverse_norm(array):
array = scaler.inverse_transform(array).flatten()
return array
Naively, I just "inverse_transformed" the predicted values within my real-time script.
But predicted values were not in the range as the original numerical targets: these are little float numbers.
Thank you for your help.
In general I think you don't want to normalize you target variable but if you want to do so you can use a label encoder instead of a minmax scaler which is rather use to normalize features
I fixed myself the problem thanks to mattOrNothing.
Just write down the answer.
First script:
myNormalizedArray = (myArray - myArrayMin) / (myArrayMax - myArrayMin)
Second script:
myDenormalizedArray = myPredicedArray * (myArrayMax - myArrayMin) + myArrayMin
Where myArray & myPredicedArray are numpy arrays.

SVM with python and CPLEX, load the quadratic part of the objective function

''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!
Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.

Data mean by intervals in IDL

I did the following script to integrate (average) data by intervals in python:
# N = points to mean in the array
# data = original data
# data_mean = average data each N points
data_mean = np.array([np.mean(i) for i in np.array_split(data, len(data)/N)])
How could do that in IDL?
There are a "mean" function, but a "array_split-like"?
The array_split functionality is usually done via REFORM to create a two (or higher) dimensional array from a 1-dimensional array using the same values. So for example:
n = 20
data = randomu(seed, 100)
data = reform(data, 100 / n, n)
print, mean(data, dimension=2)
The IDL mean function is equivalent to the numpy mean function, and the IDL reform can be used similarly to the numpy array_split:
data_mean = mean(reform(data, n_elements(data) / N), dimension=2)
If you don't mind data ending up with different dimensions, you can greatly speed this up using the /overwrite keyword:
data_mean = mean(reform(data, n_elements(data) / N, /overwrite), dimension=2)
Finally, if you have a version of IDL before IDL 8.0, then you won't have the dimension keyword for the mean function. Use this (less elegant) pattern instead:
data_mean = total(reform(data, n_elements(data) / N), 2) / N
Note that this version with total also accepts the /nan keyword, so that it works even when some data are missing.

Efficiently get indices of histogram bins in Python

Short Question
I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.
How do I extract the indices of each bin to efficiently perform my calculation using the bins values?
What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?
Detailed Explanation
1. Straightforward Solution
One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below
import numpy as np
# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
y = np.sum(x)
return y
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = [func(vals[ind == j]) for j in range(1, nbins)]
This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.
2. Using binned_statistics
The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following
import numpy as np
from scipy.stats import binned_statistics
vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]
3. Using labeled_comprehension
Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become
import numpy as np
from scipy.ndimage import labeled_comprehension
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)
Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.
4. Comparison with IDL language
To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?
Specifically, using the IDL language the above example could be written as
vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)
for j=0, nbins-1 do begin
jbins = r[r[j]:r[j+1]-1] ; Selects indices of bin j
result[j] = func(vals[jbins])
endfor
The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
import numpy as np
from scipy.sparse import csr_matrix
def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''
N = len(values)
r0, r1 = range
digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.
I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.
vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()
#example of using the data:
for j in range(nbins-1):
#I am using a plotting function for your f, to show that they cluster
plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')
The figure shows that the bins are actually clusters as expected:
You can halve the computation time by sorting the array first, then use np.searchsorted.
vals = np.random.random(1e8)
vals.sort()
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
results = [func(vals[np.searchsorted(ind,j,side='left'):
np.searchsorted(ind,j,side='right')])
for j in range(1,nbins)]
Using 1e8 as my test case, I go from 34 seconds of computation to about 17.
One efficient solution is using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
npi.group_by(ind).split(vals)
Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :
import pandas as pd
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values
or more generally :
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values
Although the latter is slower for standard aggregation functions
(like sum, mean, etc)

Moving average of an array in Python

I have an array where discreet sinewave values are recorded and stored. I want to find the max and min of the waveform. Since the sinewave data is recorded voltages using a DAQ, there will be some noise, so I want to do a weighted average. Assuming self.yArray contains my sinewave values, here is my code so far:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
for y in range (0,filtersize):
summation = sum(self.yArray[x+y])
ave = summation/filtersize
filterarray.append(ave)
My issue seems to be in the second for loop, where depending on my averaging window size (filtersize), I want to sum up the values in the window to take the average of them. I receive an error saying:
summation = sum(self.yArray[x+y])
TypeError: 'float' object is not iterable
I am an EE with very little experience in programming, so any help would be greatly appreciated!
The other answers correctly describe your error, but this type of problem really calls out for using numpy. Numpy will run faster, be more memory efficient, and is more expressive and convenient for this type of problem. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
# make a sine wave with noise
times = np.arange(0, 10*np.pi, .01)
noise = .1*np.random.ranf(len(times))
wfm = np.sin(times) + noise
# smoothing it with a running average in one line using a convolution
# using a convolution, you could also easily smooth with other filters
# like a Gaussian, etc.
n_ave = 20
smoothed = np.convolve(wfm, np.ones(n_ave)/n_ave, mode='same')
plt.plot(times, wfm, times, -.5+smoothed)
plt.show()
If you don't want to use numpy, it should also be noted that there's a logical error in your program that results in the TypeError. The problem is that in the line
summation = sum(self.yArray[x+y])
you're using sum within the loop where your also calculating the sum. So either you need to use sum without the loop, or loop through the array and add up all the elements, but not both (and it's doing both, ie, applying sum to the indexed array element, that leads to the error in the first place). That is, here are two solutions:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = sum(self.yArray[x:x+filtersize]) # sum over section of array
ave = summation/filtersize
filterarray.append(ave)
or
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = 0.
for y in range (0,filtersize):
summation = self.yArray[x+y]
ave = summation/filtersize
filterarray.append(ave)
self.yArray[x+y] is returning a single item out of the self.yArray list. If you are trying to get a subset of the yArray, you can use the slice operator instead:
summation = sum(self.yArray[x:y])
to return an iterable that the sum builtin can use.
A bit more information about python slices can be found here (scroll down to the "Sequences" section): http://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy
You could use numpy, like:
import numpy
filtersize = 2
ysums = numpy.cumsum(numpy.array(self.yArray, dtype=float))
ylags = numpy.roll(ysums, filtersize)
ylags[0:filtersize] = 0.0
moving_avg = (ysums - ylags) / filtersize
Your original code attempts to call sum on the float value stored at yArray[x+y], where x+y is evaluating to some integer representing the index of that float value.
Try:
summation = sum(self.yArray[x:y])
Indeed numpy is the way to go. One of the nice features of python is list comprehensions, allowing you to do away with the typical nested for loop constructs. Here goes an example, for your particular problem...
import numpy as np
step=2
res=[np.sum(myarr[i:i+step],dtype=np.float)/step for i in range(len(myarr)-step+1)]

Categories