I have a data set that has 10,000 rows each row has 248 values and these values determine if that row is a zero or one. I am trying to figure out why this is so. I am trying to plot the logistic regression line from
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr',fit_intercept=True).fit(X, Y)
So I can see why they are classified how they are. But I can't figure out how to do this, I can't use a scatter plot since there x data has way more value then the label data.
My question is how would I go about plotting this.
I could suggest plotting the logistic regression using
import seaborn as sns
sns.regplot(x='target', y='variable', data=data, logistic=True)
But that takes a single variable input. Since you are trying to find correlations with a large number of inputs, I would look for feature importance first, running this
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
m.fit(X, y)
print(m.coef_)
The next steps would be applying PCA to either eliminate some features or condense them into fewer variables and running a correlation matrix.
P.S. what does a zero or one represent?
Related
I have the following dataset with 10 variables:
I want to identify clusters with this multidimensional dataset, so I tried k-means clustering algorith with the following code:
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
In order to plot the result I used PCA for dimensionality reduction:
reduced_data = PCA(n_components=2).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=kmeans['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
And in the end I get the following result:
So I have following questions:
1.) However, this PCA plot looks really weird splitting the whole dataset in two corners of the plot. Is that even correct or did I code something wrong?
2.) Is there another algorithm for clustering multidimensional data? I look at this but I can not find an approriate algorithm for clustering multidimensional data... How do I even implement e.g. Ward hierarchical clustering in python for my dataset?
3.) Why should I use PCA for dimensionality reduction? Can I also use t SNE? Is it better?
the problem is that you fit your PCA on your dataframe, but the dataframe contains the cluster. Column 'cluster' will probably contain most of the variation in your dataset an therefore the information in the first PC will just coincide with data['cluster'] column. Try to fit your PCA only on the distance columns:
data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
You can fit hierarchical clustering with sklearn by using:
sklearn.cluster.AgglomerativeClustering()`
You can use different distance metrics and linkages like 'ward'
tSNE is used to visualize multivariate data and the goal of this technique is not clustering
I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.
I am working on an anomaly detection project on a call detail record for a telephone operator, I have prepared a sample of 10000 observations and 80 dimensions which represent the totality of the observations for a day of traffic, the data are represented as follows:
this is a small part of the whole dataset.
however, I decided to use the library PYOD which is an API that offers many unsupervised learning algorithms, I decided to start with CNN:
from pyod.models.knn import KNN
knn= KNN(contamination= 0.1)
result = knn.fit_predict(conso)
Then to visualize the result I decided to resize the sample in 2 dimentions and to display it in scatter with in blue the observations that KNN predicted that were not outliers and in red those which are outliers.
from sklearn.manifold import TSNE
result_f = TSNE(n_components = 2).fit_transform(df_final_2)
result_f = pd.DataFrame(result_f)
color= ['red' if row == 1 else 'blue' for row in result_list]
'df_final_2' is the dataframe version of 'conso'.
then I put all that in the right colors:
import matplotlib.pyplot as plt
plt.scatter(result_f[0],result_f[1], s=1, c=color)
The thing that disturbs me in the graph is that the observations predict as outliers are not really outliers because normally the outliers are in the extremity of the graph and not grouped with the normal behaviors, even by analyzing these obseravations aberent they have a normal behavior in the original dataset, I have tried other PYOD algorithms and I have modified the parameters of each algorithm but I have obtained at least the same result. I made a mistake somewhere and I can not distinguish it.
Thnx.
There are several things to check:
using knn, lof, and similar models that rely on distance measures, the data should be first standardized (using sklearn StandardScaler)
tsne may now work in this case and the dimensionality reduction could be off
maybe do not use fit_predict, but do this (use y_train_pred):
# train kNN detector
clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(X)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
If none of these work, feel free to open an issue report on GitHub and we will take a further investigation.
I'm trying to use a MLPregressor from scikit learn in order to do a non linear regression on a set of 260 examples (X,Y). One example is composed of 200 features for X and 1 feature for Y.
File containing X
File containing Y
The link between X and Y is not obvious if directly plotted together but if we plot x=log10(sum(X)) and y=log10(Y), the link between both is almost linear.
As a first approach, I tried to apply my neural network directly on X and Y without success.
I have read that scaling would improve regression. In my case, Y is containing datas in a very wide range of values (from 10e-12 to 10e-5). When computing the error, of course 10e-5 as much more weight than 10e-12. But I would like my neural network to correctly approximate both. When using a linear scaling, let's say preprocessing.MinMaxScaler from scikit learn, 10e-8 ~ -0.99 and 10e-12 ~ -1. So I'm loosing all the information of my target.
My question here is: what kind of scaling could I use to get consistent results?
The only solution I have found is to apply log10(Y) but of course, error is increased exponentially.
The best I could get is with the code below:
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(20,10)
freqIter=[]
for i in np.arange(0,0.2,0.001):
freqIter.append([i,i+0.001])
#############################################################################
X = np.zeros((len(learningFiles),len(freqIter)))
Y = np.zeros(len(learningFiles))
# Import X: loadtxt()
# Import Y: loadtxt
maxy = np.amax(Y)
Y *= 1/maxy
Y = Y.reshape(-1, 1)
maxx = np.amax(X)
X *= 1/maxx
#############################################################################
reg = MLPRegressor(hidden_layer_sizes=(8,2), activation='tanh', solver='adam', alpha=0.0001, learning_rate='adaptive', max_iter=10000, verbose=False, tol = 1e-7)
reg.fit(X, Y)
#############################################################################
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],Y*maxy,label = 'INPUTS',color='blue')
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],reg.predict(X)*maxy,label='Predicted',color='red')
plt.grid()
plt.legend()
plt.show()
Result:
Thanks for your help.
You may want to look at a FunctionTransformer. The example given applies a logarithmic transformation as part of pre-processing. You can also do it for an arbitrary mathematical function.
I would also suggest trying a ReLU activation function if you scale logarithmically. After the transformation your data looks fairly linear, so it may be converge a little faster -- but that's just a hunch.
I've finally found something interesting that is working well on my case.
First, I've used a log scaling for Y. I think it is the most adapted scaling when the range of values is very wide such as mine (from 10e-12 to 10e-5). Target is then between -5 and -12.
Secondly, my error about scaling X was to apply the same scaling to all features. Let's say my X contains 200 features, then I was dividing by the max of all features of all examples. My solution here is to scale feature1 by the max of all feature1 through all examples and then to reapeat it for all features. This gives me feature1 between 0 and 1 for all examples instead of far less previously (feature1 could be betwwen 0 and 0.0001 with my previous scaling).
I get better results, my main issue now is to select the correct parameters (number of layers, tolerance,...) but this is another problem.
I've been trying to implement time series prediction tool using support vector regression in python language. I use SVR module from scikit-learn for non-linear Support vector regression. But I have serious problem with prediction of future events. The regression line fits the original function great (from known data) but as soon as I want to predict future steps, it returns value from the last known step.
My code looks like this:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(X[:-10, np.newaxis], Y[:-10]).predict(X[:, np.newaxis])
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X, y_rbf, label='data', color='blue', linestyle='--')
plt.show()
Any ideas?
thanks in advance,
Tom
You are not really doing time-series prediction. You are trying to predict each element of Y from a single element of X, which means that you are just solving a standard kernelized regression problem.
Another problem is when computing the RBF kernel over a range of vectors [[0],[1],[2],...], you will get a band of positive values along the diagonal of the kernel matrix while values far from the diagonal will be close to zero. The test set portion of your kernel matrix is far from the diagonal and will therefore be very close to zero, which would cause all of the SVR predictions to be close to the bias term.
For time series prediction I suggest building the training test set as
x[0]=Y[0:K]; y[0]=Y[K]
x[1]=Y[1:K+1]; y[1]=Y[K+1]
...
that is, try to predict future elements of the sequence from a window of previous elements.