Custom binning in sklearn.preprocessing? - python

I have a list of continuous variables called size_array. I've been scaling them from [0, 1] like this:
max_abs_scaler = preprocessing.MinMaxScaler()
scaled = max_abs_scaler.fit_transform(size_array)
Is there a way to scale them on the range of [-1, 1] where the median (or a percentile) is 0? My data is right skewed, so the values above the median are spread out a lot and the values to the left of the median are not spread out. I tried to scaling them with this method:
def using_median():
if x >= median:
return (x - median)/(max - median)
else:
return (median - x)/(median - min)
But that didn't work. Is there any other way to do this with sklearn.preprocessing?

I would recommend using the PowerTransformer(). It can work very well for skewed distributions.
check out this example:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
pt = preprocessing.PowerTransformer()
X_lognormal = np.random.RandomState(616)\
.lognormal(size=(300, 2))
_,ax = plt.subplots(1,2,sharey=True)
ax[0].hist(X_lognormal)
ax[1].hist(pt.fit_transform(X_lognormal))

Related

Identifying extremes in pandas timeseries

I'm looking for a way to identify the local extremes in a pandas timeseries.
A MWE would be
import math
import matplotlib.pyplot as plt
import pandas as pd
sin_list = []
for i in range(200):
sin_list.append(math.sin(i / 10) + i / 100)
idx = pd.date_range('2018-01-01', periods=200, freq='H')
ts = pd.Series(sin_list, index=idx)
ts.plot(style='.')
plt.show()
and the red lines would mark the timestamps which I want to identify. Note that there are, of course, finite steps in this series.
A possible solution could be to fit a curve to it, derive it and then identify the exact place where the gradient is 0. This does seem like a big effort to program myself, and I assume such an implementation exists somewhere.
I developed a solution for the problem based on the .diff() module. The crucial attribute here is the p factor of the get_percentile function. Since the finite number of values means that the gradient will not attain value 0, the solution space must be blurred a bit. This means, the less values there are, the higher the p factor must be. In my solution 0.05 proved to be sufficient to identify the extremes, but small enough to locate the extremes with reasonable accuracy.
This is the code:
import copy
import math
import matplotlib.pyplot as plt
import pandas as pd
def get_percentile(data: list, p: float):
_data = copy.copy(data)
_data.sort()
result = _data[math.floor(len(_data) * p) - 1]
return result
sin_list = []
for i in range(200):
sin_list.append(math.sin(i / 10) + i / 100)
idx = pd.date_range('2018-01-01', periods=200, freq='H')
ts = pd.Series(sin_list, index=idx)
gradient_ts = abs(ts.diff())
percentile = get_percentile(gradient_ts.values, p=0.05)
binary_ts = gradient_ts.where(gradient_ts > percentile, 1).where(gradient_ts < percentile, 0)
fig, ax = plt.subplots()
binary_ts.plot(drawstyle="steps", ax=ax)
ax.fill_between(binary_ts.index, binary_ts, facecolor='green', alpha=0.5, step='pre')
ts.plot(secondary_y=True, style='.')
plt.show()

sklearn's KMeans: Cluster centers and cluster means differ. Numerical Imprecision?

I've noticed that when using sklearn.cluster.KMeans to obtain clusters, the cluster centers, from the method .cluster_centers_, and computing means manually for each cluster don't seem to give exactly the same answer.
For small sample sizes the difference is very small and probably within float imprecision. But for larger samples:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
[[ 0.00217333]
[ 0.00223798]]
Doing the same thing for different sample sizes:
This seems too much to be floating point imprecision. Are cluster centers not means, or what's going on here?
I think that it may be related to the tolerance of KMeans. The default value is 1e-4, so setting a lower value, i.e. tol=1e-8 gives:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2, tol=1e-8).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
0
label
0 9.99200722e-16
1 1.11022302e-16
Hope it helps.

What's the difference between pandas ACF and statsmodel ACF?

I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr function built into Pandas, and the acf function supplied by statsmodels.tsa. This is done in the following MWE:
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf
ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)
ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)
ticker_data_acf_1 = acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]
test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')
What I noticed was the values they predicted weren't identical:
What accounts for this difference, and which values should be used?
The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:
autocorr does nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.
The differences may get smaller for longer time series but are quite big for short ones.
Compared to Matlab, the Pandas autocorr function probably corresponds to doing Matlabs xcorr (cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).
See this MWE for clarification:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")
def autocorr_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the subseries means
sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
# Normalize with the subseries stds
return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))
def acf_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x to calculate Cov
sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
# Normalize with var of whole series
return sum_product / ((len(x) - lag) * np.var(x))
x = np.linspace(0,100,101)
results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)
pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()
Statsmodels uses np.correlate to optimize this, but this is basically how it works.
As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=True to the statsmodels function. Using a random input:
import statistics
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32
def test(unbiased):
data = pd.Series(np.random.random(DATA_LEN))
data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
# return difference between results
return sum(abs(data_acf_1 - data_acf_2))
for value in (False, True):
diffs = [test(value) for _ in range(N_TESTS)]
print(value, statistics.mean(diffs))
Output:
False 0.464562410987
True 0.0820847168593
In the following example, Pandas autocorr() function gives the expected results but statmodels acf() function does not.
Consider the following series:
import pandas as pd
s = pd.Series(range(10))
We expect that there is perfect correlation between this series and any of its lagged series, and this is actually what we get with autocorr() function
[ s.autocorr(lag=i) for i in range(10) ]
# [0.9999999999999999, 1.0, 1.0, 1.0, 1.0, 0.9999999999999999, 1.0, 1.0, 0.9999999999999999, nan]
But using acf() we get a different result:
from statsmodels.tsa.stattools import acf
acf(s)
# [ 1. 0.7 0.41212121 0.14848485 -0.07878788
# -0.25757576 -0.37575758 -0.42121212 -0.38181818 -0.24545455]
If we try acf with adjusted=True the result is even more unexpected because for some lags the result is less than -1 (note that correlation has to be in [-1, 1])
acf(s, adjusted=True) # 'unbiased' is deprecated and 'adjusted' should be used instead
# [ 1. 0.77777778 0.51515152 0.21212121 -0.13131313
# -0.51515152 -0.93939394 -1.4040404 -1.90909091 -2.45454545]

Find local maximums in numpy array

I am looking to find the peaks in some gaussian smoothed data that I have. I have looked at some of the peak detection methods available but they require an input range over which to search and I want this to be more automated than that. These methods are also designed for non-smoothed data. As my data is already smoothed I require a much more simple way of retrieving the peaks. My raw and smoothed data is in the graph below.
Essentially, is there a pythonic way of retrieving the max values from the array of smoothed data such that an array like
a = [1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1]
would return:
r = [5,3,6]
There exists a bulit-in function argrelextrema that gets this task done:
import numpy as np
from scipy.signal import argrelextrema
a = np.array([1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1])
# determine the indices of the local maxima
max_ind = argrelextrema(a, np.greater)
# get the actual values using these indices
r = a[max_ind] # array([5, 3, 6])
That gives you the desired output for r.
As of SciPy version 1.1, you can also use find_peaks. Below are two examples taken from the documentation itself.
Using the height argument, one can select all maxima above a certain threshold (in this example, all non-negative maxima; this can be very useful if one has to deal with a noisy baseline; if you want to find minima, just multiply you input by -1):
import matplotlib.pyplot as plt
from scipy.misc import electrocardiogram
from scipy.signal import find_peaks
import numpy as np
x = electrocardiogram()[2000:4000]
peaks, _ = find_peaks(x, height=0)
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.plot(np.zeros_like(x), "--", color="gray")
plt.show()
Another extremely helpful argument is distance, which defines the minimum distance between two peaks:
peaks, _ = find_peaks(x, distance=150)
# difference between peaks is >= 150
print(np.diff(peaks))
# prints [186 180 177 171 177 169 167 164 158 162 172]
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.show()
If your original data is noisy, then using statistical methods is preferable, as not all peaks are going to be significant. For your a array, a possible solution is to use double differentials:
peaks = a[1:-1][np.diff(np.diff(a)) < 0]
# peaks = array([5, 3, 6])
>> import numpy as np
>> from scipy.signal import argrelextrema
>> a = np.array([1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1])
>> argrelextrema(a, np.greater)
array([ 4, 10, 17]),)
>> a[argrelextrema(a, np.greater)]
array([5, 3, 6])
If your input represents a noisy distribution, you can try smoothing it with NumPy convolve function.
If you can exclude maxima at the edges of the arrays you can always check if one elements is bigger than each of it's neighbors by checking:
import numpy as np
array = np.array([1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1])
# Check that it is bigger than either of it's neighbors exluding edges:
max = (array[1:-1] > array[:-2]) & (array[1:-1] > array[2:])
# Print these values
print(array[1:-1][max])
# Locations of the maxima
print(np.arange(1, array.size-1)[max])

binning data in python with scipy/numpy

is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.
from scipy import *
from numpy import *
def get_bin_mean(a, b_start, b_end):
ind_upper = nonzero(a >= b_start)[0]
a_upper = a[ind_upper]
a_range = a_upper[nonzero(a_upper < b_end)[0]]
mean_val = mean(a_range)
return mean_val
data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []
n = 0
for n in range(0, len(bins)-1):
b_start = bins[n]
b_end = bins[n+1]
binned_data.append(get_bin_mean(data, b_start, b_end))
print binned_data
It's probably faster and easier to use numpy.digitize():
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram():
bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :)
The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.
For the same example as in the previous answers, the Scipy solution would be
import numpy as np
from scipy.stats import binned_statistic
data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]
Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:
import numpy as np
data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)
mean = np.add.reduceat(data, slices[:-1]) / counts
print mean
The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:
import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))
This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)
I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data
import numpy as np
from scipy.stats import binned_statistic_2d
x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic
the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets
Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices.
We can get the bin position for each datapoint using the searchsorted method.
Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.
np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)
histogram = np.zeros_like(bins)
bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)

Categories