Python dask_ml linear regression Multiple constant columns detected error - python

I am using python with dask to create a logistic regression model, In order to speed up things when training.
I have x that is the feature array (numpy array) and y that is a label vector.
edit:
The numpy arrays are: x_train (n*m size) array of floats and the y_train is (n*1) vector of integers that are labels for the training. both suits well into sklearn LogisticRegression.fit and working fine there.
I tried to use this code to create a pandas df then converting it to dask ddf and training on it like shown here
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd, sd["label"])
But getting an error
Could not find signature for add_intercept:
I found this issue on Gitgub
Explaining to use this code instead
from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd.values, sd["label"])
But I get this error
ValueError: Multiple constant columns detected!
How can I use dask to train a logistic regression over data originated from a numpy array?
Thanks.

You can bypass std verification by using
lr = LogisticRegression(solver_kwargs={"normalize":False})
Or you can use #Emptyless code to get faulty column_indices
and then remove those columns from your array.

This does not seem like an issue with dask_ml. Looking at the source, the std is calculated using:
mean, std = da.compute(X.mean(axis=0), X.std(axis=0))
This means that for every column in your provided array, dask_ml calculates the standard deviation. If the standard deviation of one of those columns is equal to zero (np.where(std == 0))) that means that that column has zero variation.
Including a column with zero variation does not allow any training, ergo it needs to be removed prior to training the model (in a data preparation / cleansing step).
You can quickly check which columns have no variation by checking the following:
import numpy as np
std = sd.std(axis=0)
column_indices = np.where(std == 0)
print(column_indices)

A little late to the party but here I go anyway. Hope future readers appreciate it. This answer is for the Multiple Columns error.
A Dask DataFrame is split up into many Pandas DataFrames. These are called partitions. If you set your npartitions to 1 it should have exactly the same effect as sci-kit learn. If you set it to more partitions it splits it into multiple DataFrames but I found it changes the shape of the DataFrames which in the end resulted in the Multiple Columns error. It also might cause a overflow warning. Unfortunately it is not in my interest to investigate the direct cause of this error. It might simply be because the DataFrame is too large or too small.
A source for partitioning
Below the errors for search engine indexing:
ValueError: Multiple constant columns detected!
RuntimeWarning: overflow encountered in exp return np.exp(A)

Related

Strange results when scaling data using scikit learn

I have an input dataset that has 4 time series with 288 values for 80 days. So the actual shape is (80,4,288). I would like to cluster differnt days. I have 80 days and all of them have 4 time series: outside temperature, solar radiation, electrical demand, electricity prices. What I want is to group similar days with regard to these 4 time series combined into clusters. Days belonging to the same cluster should have similar time series.
Before clustering the days using k-means or Ward's method, I would like to scale them using scikit learn. For this I have to transform the data into a 2 dimensional shape array with the shape (80, 4*288) = (80, 1152), as the Standard Scaler of scikit learn does not accept 3-dimensional input. The Standard Scaler just standardizes features by removing the mean and scaling to unit variance.
Now I scale this data using sckit learn's standard scaler:
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";")
scaler = StandardScaler()
data_Scaled = scaler.fit_transform(data_Unscaled)
np.savetxt("C:/Users/User1/Desktop/data_Scaled.csv", data_Scaled, delimiter=";")
When I now compare the unscaled and scaled data e.g. for the first day (1 row) and the 4th time series (columns 864 - 1152 in the csv file), the results look quite strange as you can see in the following figure:
As far as I see it, they are not in line with each other. For example in the timeslots between 111 and 201 the unscaled data does not change at all whereas the scaled data fluctuates. I can't explain that. Do you have any idea why this is happening and why they don't seem to be in line?
Here is the unscaled input data with shape (80,1152): https://filetransfer.io/data-package/CfbGV9Uk#link
and here the scaled output of the scaling with shape (80,1152): https://filetransfer.io/data-package/23dmFFCb#link
You have two issues here: scaling and clustering. As the question title refers to scaling, I'll handle that one in detail. The clustering issue is probably better suited for CrossValidated.
You don't say it, but it seems natural that all temperatures, be it on day 1 or day 80, are measured on a same scale. The same holds for the other three variables. So, for the purpose of scaling you essentially have four time series.
StandardScaler, like basically everything in sklearn, expects your observations to be organised in rows and variables in columns. It treats each column separately, deducting its mean from all the values in the column and dividing the resulting values by their standard deviation.
I reckon from your data that the first 288 entries in each row correspond to one variable, the next 288 to the second one etc. You need to reshape these data to form 288*80=23040 rows and 4 columns, one for each variable.
You apply StandardScaler on that array and reformat the data into the original shape, with 80 rows and 4*288=1152 columns. The code below should do the trick:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";", header=None)
X = data_Unscaled.to_numpy()
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
scaler = StandardScaler()
X_narrow_scaled = scaler.fit_transform(X_narrow)
X_scaled = np.array([X_narrow_scaled[i*288:(i+1)*288, :].T.ravel() for i in range(80)])
# Plot the original data:
i=3
j=0
plt.plot(X[j, i*288:(i+1)*288])
plt.title('TimeSeries_Unscaled')
plt.show()
# plot the scaled data:
plt.plot(X_scaled[j, i*288:(i+1)*288])
plt.title('TimeSeries_Scaled')
plt.show()
resulting in the following graphs:
The line
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
uses list comprehension to generate the four columns of the long, narrow array X_narrow. Basically, it is just a shorthand for a for-loop over your four variables. It takes the first 288 columns of X, flattens them into a vector, which it then puts into the first column of X_narrow. Then it does the same for the next 288 columns, X[:, 288:576], and then for the third and the fourth block of the 288 observed values per day. This way, each column in X_narrow contains a long time series, spanning 80 days (and 288 observations per day), of exactly one of your variables (outside temperature, solar radiation, electrical demand, electricity prices).
Now, you might try to cluster X_scaled using K-means, but I doubt it will work. You have just 80 points in a 1152-dimensional space, so the curse of dimensionality will almost certainly kick in. You'll most probably need to perform some kind of dimensionality reduction, but, as I noted above, that's a different question.

Performing PCA and knowing which columns were retained [duplicate]

This question already has answers here:
Recovering features names of explained_variance_ratio_ in PCA with sklearn
(5 answers)
Closed 2 years ago.
When performing PCA on a dataset in Python, the explained_variance_ratio_ will show us the different variances for each feature in our dataset.
How do we know which columnn corresponds with which of the resulting variances?
Context: I'm working on a project and I need to know which components give us 90% of the variance with PCA so that we can perform stepwise feature selection later on.
from sklearn.decomposition import PCA
pcaObj = PCA(n_components=None)
X_train = pcaObj.fit_transform(X_train)
X_test = pcaObj.transform(X_test)
components_variance = pcaObj.explained_variance_ratio_
print(sum(components_variance))
print(components_variance)
Edit: I discovered similar question: Recovering features names of explained_variance_ratio_ in PCA with sklearn
The answers are richer and detailed explanations. I have marked this question as duplication but will leave this comment for time being.
I believe you can get the values with;
pd.DataFrame(pcaObj.components_.T, index=X_train.columns)
if X_train is not DataFrame but numpy, pass in the name of the features as they appeared originally as a list.
pd.DataFrame(pcaObj.components_.T, index=['column_a','column_b','column_c'], columns =['PC-1', 'PC-2'])
# column_x where the name of features
.componets_ should return the values you need. We can place them on Pandas pd, with columns names.
The pca.explained_variance_ratio_ parameter gives you an array of the variance of each dimension. Therefore, pca.explained_variance_ratio[i] will give you the variance of the i+1st dimesion.
I don't believe there is a way to match the variance with the 'name' of the column, but going through the variance array in a for loop and noting the index with 90% variance should allow you to then match the index with the column name.

Can the input values to the spectral clustering of scikit-learn be a negative value?

Let us say, I have a df of 20 columns and 10K rows. Since the data has a wide range values, I use the following code to normalize the data:
from sklearn.preprocessing import StandardScaler
min_max_scaler = preprocessing.StandardScaler()
df_scaled = min_max_scaler.fit_transform(df)
df_scaled now contains both negative and positive values.
Now if I pass this normalized data frame to the spectral cluster as follows,
spectral = SpectralClustering(n_clusters = k,
n_init=30,
affinity='nearest_neighbors', random_state=cluster_seed,
assign_labels='kmeans')
clusters = spectral.fit_predict(df_scaled)
I will get the cluster lables.
Here is what confuses me: the official doc says that
"Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm."
Questions: Do the normalized negative values of df_scaled affect the clustering result?
OR
Does it depend on the affinity computation I am using e.g. precomputed, rbf? If so how can I use the normalized input values to SpectralClustering?
My understanding is that normalizing could improve the clustering results and good for faster computation.
I appreciate any help or tips on how to I can approach the problem.
You are passing a data matrix, not a precomputed affinity matrix.
The "nearest neighbors" uses a binary kernel, which is non-negative.
To better understand the inner workings, please have a look at the source code.

Data Preprocessing Python

I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.
There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.
always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.

having OneHotEncoder to manage unseen values at transform step

I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
ValueError: Feature out of bounds. Try setting n_values.
I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.
Thanks.
Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.
This feature is added to OneHotEncoder now. You can do this by setting the parameter handle_unknown='ignore'.
For example:
from sklearn.preprocessing import OneHotEncoder
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
onehot = OneHotEncoder(handle_unknown='ignore')
A = onehot.fit_transform(A)
B = onehot.transform(B)

Categories