Strange results when scaling data using scikit learn

Strange results when scaling data using scikit learn - python

I have an input dataset that has 4 time series with 288 values for 80 days. So the actual shape is (80,4,288). I would like to cluster differnt days. I have 80 days and all of them have 4 time series: outside temperature, solar radiation, electrical demand, electricity prices. What I want is to group similar days with regard to these 4 time series combined into clusters. Days belonging to the same cluster should have similar time series.
Before clustering the days using k-means or Ward's method, I would like to scale them using scikit learn. For this I have to transform the data into a 2 dimensional shape array with the shape (80, 4*288) = (80, 1152), as the Standard Scaler of scikit learn does not accept 3-dimensional input. The Standard Scaler just standardizes features by removing the mean and scaling to unit variance.
Now I scale this data using sckit learn's standard scaler:
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";")
scaler = StandardScaler()
data_Scaled = scaler.fit_transform(data_Unscaled)
np.savetxt("C:/Users/User1/Desktop/data_Scaled.csv", data_Scaled, delimiter=";")
When I now compare the unscaled and scaled data e.g. for the first day (1 row) and the 4th time series (columns 864 - 1152 in the csv file), the results look quite strange as you can see in the following figure:
As far as I see it, they are not in line with each other. For example in the timeslots between 111 and 201 the unscaled data does not change at all whereas the scaled data fluctuates. I can't explain that. Do you have any idea why this is happening and why they don't seem to be in line?
Here is the unscaled input data with shape (80,1152): https://filetransfer.io/data-package/CfbGV9Uk#link
and here the scaled output of the scaling with shape (80,1152): https://filetransfer.io/data-package/23dmFFCb#link

You have two issues here: scaling and clustering. As the question title refers to scaling, I'll handle that one in detail. The clustering issue is probably better suited for CrossValidated.
You don't say it, but it seems natural that all temperatures, be it on day 1 or day 80, are measured on a same scale. The same holds for the other three variables. So, for the purpose of scaling you essentially have four time series.
StandardScaler, like basically everything in sklearn, expects your observations to be organised in rows and variables in columns. It treats each column separately, deducting its mean from all the values in the column and dividing the resulting values by their standard deviation.
I reckon from your data that the first 288 entries in each row correspond to one variable, the next 288 to the second one etc. You need to reshape these data to form 288*80=23040 rows and 4 columns, one for each variable.
You apply StandardScaler on that array and reformat the data into the original shape, with 80 rows and 4*288=1152 columns. The code below should do the trick:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";", header=None)
X = data_Unscaled.to_numpy()
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
scaler = StandardScaler()
X_narrow_scaled = scaler.fit_transform(X_narrow)
X_scaled = np.array([X_narrow_scaled[i*288:(i+1)*288, :].T.ravel() for i in range(80)])
# Plot the original data:
i=3
j=0
plt.plot(X[j, i*288:(i+1)*288])
plt.title('TimeSeries_Unscaled')
plt.show()
# plot the scaled data:
plt.plot(X_scaled[j, i*288:(i+1)*288])
plt.title('TimeSeries_Scaled')
plt.show()
resulting in the following graphs:
The line
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
uses list comprehension to generate the four columns of the long, narrow array X_narrow. Basically, it is just a shorthand for a for-loop over your four variables. It takes the first 288 columns of X, flattens them into a vector, which it then puts into the first column of X_narrow. Then it does the same for the next 288 columns, X[:, 288:576], and then for the third and the fourth block of the 288 observed values per day. This way, each column in X_narrow contains a long time series, spanning 80 days (and 288 observations per day), of exactly one of your variables (outside temperature, solar radiation, electrical demand, electricity prices).
Now, you might try to cluster X_scaled using K-means, but I doubt it will work. You have just 80 points in a 1152-dimensional space, so the curse of dimensionality will almost certainly kick in. You'll most probably need to perform some kind of dimensionality reduction, but, as I noted above, that's a different question.

Related

Correlation between two dataframes / curves

I have yield curve data for two currencies (Euro and U.S. Dollar). For each of these currencies I have 16 variables (16 maturities). I have generated, using some model, synthetic curves and I want to relate the curves of the two currencies. That is, what is the correlation between the two currencies? I am asking this question as my model should capture this correlation. For example, it wouldn't be great if a 4% euro curve is generated and at the same time a -4% level curve is generated for the dollar. How can I do this? I don't like a correlation matrix as this yields a 16x16 matrix per model (I have multiple). Any thoughts?! Could be very helpful.

Just build a new DataFrame out of the two other DataFrames with the Series you want and call corr?
It is hard to provide a exact solution (and i may not understand your problem fully), as you didn't provide any information on how the data looks or what your code looks already
import pandas as pd
# your code so far...
df_corr = pd.DataFrame()
df_corr['eur_curve'] = df_eur # asumming you have a df_eur with the curve and 16 "variables"
df_corr['usd_cuve'] = df_usd # asumming you have a df_usd with the curve and 16 "variables"
corr = df_corr.corr()

How to determine multiple Periodicities present in Timeseries data?

My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?

I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.

using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.

Clarification of axis used for min/max scaling for data I have

I am building a neural network with keras and need clarification for the pre-processing step.
I have a dataframe that is 1-n rows (features for the machine learning algorithm to learn from) and 1-n columns each column being a sample
My data is currently correctly log-transformed and i simply need to squish to between 0 and 1. I am using the minmax_scale in sklearn and the processing of my data is as follows:
##transpose counts::rows samples and cols features (to correct format for NN)
counts = normCounts.transpose()
##scale counts
scaled = preprocessing.minmax_scale(counts, feature_range=(0,1))
I need clarification on which way around the dataframe needs to be. Reading the minmax documentation on sklearn says that the data are scaled along axis=0
does this mean:
featurecolumn1, featurecolumn2...
sample1 ->
sample2 ->
??
Basically, what I need to ensure is that low counts on the original dataframe are closer to the 0 and the upper end on the dataframe are closer to one... however, I am unsure now as to whether this should actually be the other way around... so not transposing the dataframe from the beginning:
sample1, sample2...
feature1 ->
feature2 ->
Would really appreciate clarification here!
Thank you.

Well, whenever we talk about we usually want to scale the features. So if your data is in format like
feature 1 feature 2
sample 1
sample 2
Then you should apply
minmax_scale(data) # By default axis is 0
And if your data is in format like
sample 1 sample 2
feature 1
feature 2
Then you should apply
minmax_scale(data, axis =1 )
I guess with axis = 0 it takes the columns of any dataframe and with axis=1 it takes the rows of the dataframe. Whatever they have written in their documentation but it behaves like that only
EDIT
Reply to your comment -
You can do standard scaling where the data values will be mean centered with a unit deviation
from sklearn.preprocessing import StandardScaler
scale = StandardScaler().fit(data)
scale.transform(data)

Python dask_ml linear regression Multiple constant columns detected error

I am using python with dask to create a logistic regression model, In order to speed up things when training.
I have x that is the feature array (numpy array) and y that is a label vector.
edit:
The numpy arrays are: x_train (n*m size) array of floats and the y_train is (n*1) vector of integers that are labels for the training. both suits well into sklearn LogisticRegression.fit and working fine there.
I tried to use this code to create a pandas df then converting it to dask ddf and training on it like shown here
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd, sd["label"])
But getting an error
Could not find signature for add_intercept:
I found this issue on Gitgub
Explaining to use this code instead
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd.values, sd["label"])
But I get this error
ValueError: Multiple constant columns detected!
How can I use dask to train a logistic regression over data originated from a numpy array?
Thanks.

You can bypass std verification by using
lr = LogisticRegression(solver_kwargs={"normalize":False})
Or you can use #Emptyless code to get faulty column_indices
and then remove those columns from your array.

This does not seem like an issue with dask_ml. Looking at the source, the std is calculated using:
mean, std = da.compute(X.mean(axis=0), X.std(axis=0))
This means that for every column in your provided array, dask_ml calculates the standard deviation. If the standard deviation of one of those columns is equal to zero (np.where(std == 0))) that means that that column has zero variation.
Including a column with zero variation does not allow any training, ergo it needs to be removed prior to training the model (in a data preparation / cleansing step).
You can quickly check which columns have no variation by checking the following:
import numpy as np
std = sd.std(axis=0)
column_indices = np.where(std == 0)
print(column_indices)

A little late to the party but here I go anyway. Hope future readers appreciate it. This answer is for the Multiple Columns error.
A Dask DataFrame is split up into many Pandas DataFrames. These are called partitions. If you set your npartitions to 1 it should have exactly the same effect as sci-kit learn. If you set it to more partitions it splits it into multiple DataFrames but I found it changes the shape of the DataFrames which in the end resulted in the Multiple Columns error. It also might cause a overflow warning. Unfortunately it is not in my interest to investigate the direct cause of this error. It might simply be because the DataFrame is too large or too small.
A source for partitioning
Below the errors for search engine indexing:
ValueError: Multiple constant columns detected!
RuntimeWarning: overflow encountered in exp return np.exp(A)

Python- np.mean() giving wrong means?

The issue
So I have 50 netCDF4 data files that contain decades of monthly temperature predictions on a global grid. I'm using np.mean() to make an ensemble average of all 50 data files together while preserving time length & spatial scale, but np.mean() gives me two different answers. The first time I run its block of code, it gives me a number that, when averaged over latitude & longitude & plotted against the individual runs, is slightly lower than what the ensemble mean should be. If I re-run the block, it gives me a different mean which looks correct.
The code
I can't copy every line here since it's long, but here's what I do for each run.
#Historical (1950-2020) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_195001-202012.nc") #Import data file
tash1 = ncin_1.variables['tas'][:] #extract tas (temperature) variable
ncin_1.close() #close to save memory
#Repeat for future (2021-2100) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_202101-210012.nc")
tasr1 = ncin_1.variables['tas'][:]
ncin_1.close()
#Concatenate historical & future files together to make one time series array
tas11 = np.concatenate((tash1,tasr1),axis=0)
#Subtract the 1950-1979 mean to obtain anomalies
tas11 = tas11 - np.mean(tas11[0:359],axis=0,dtype=np.float64)
And I repeat that 49 times more for other datasets. Each tas11, tas12, etc file has the shape (1812, 64, 128) corresponding to time length in months, latitude, and longitude.
To get the ensemble mean, I do the following.
#Move all tas data to one array
alltas = np.zeros((1812,64,128,51)) #years, lat, lon, members (no ensemble mean value yet)
alltas[:,:,:,0] = tas11
(...)
alltas[:,:,:,49] = tas50
#Calculate ensemble mean & fill into 51st slot in axis 3
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
When I check a coordinate & month, the ensemble mean is off from what it should be. Here's what a plot of globally averaged temperatures from 1950-2100 looks like with the first mean (with monhly values averaged into annual values. Black line is ensemble mean & colored lines are individual runs.
Obviously that deviated below the real ensemble mean. Here's what the plot looks like when I run alltas[:,:,:,50]=np.mean(alltas,axis=3,dtype=np.float64) a second time & keep everything else the same.
Much better.
The question
Why does np.mean() calculate the wrong value the first time? I tried specifying the data type as a float when using np.mean() like in this question- Wrong numpy mean value?
But it didn't work. Any way I can fix it so it works correctly the first time? I don't want this problem to occur on a calculation where it's not so easy to notice a math error.

In the line
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
the argument to mean should be alltas[:,:,:,:50]:
alltas[:,:,:,50] = np.mean(alltas[:,:,:,:50], axis=3, dtype=np.float64)
Otherwise you are including those final zeros in the calculation of the ensemble means.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange results when scaling data using scikit learn - python

Related

Correlation between two dataframes / curves

How to determine multiple Periodicities present in Timeseries data?

Clarification of axis used for min/max scaling for data I have

Python dask_ml linear regression Multiple constant columns detected error

Python- np.mean() giving wrong means?

Categories

Resources