I have a large 2D-dataset (.csv) with values from a pressure sensor.
The first value is the pressure value, while the second one records the time the measure was taken.
Looking at the plot, I can see a cluster of points (due noise) where you can detect some linear parts (that is the "good working zone") and non-linear zones.
I thought to use a RANSAC algorithm to detect linear zones, but I'm not sure it's the best way.
By OpenCV I can isolate linear path and it seems working well, but my problem is transforming a 2D dataset in a "Mat": my sensor gives me 32bit values and tests takes days with a sub-second data rate so the final 2d-matrix is an enormous set of 0-1!
So, according to you, what is the best way to detect linear patterns in a 2d-dataset?
edit:
Sending a real dataset is quite problematic, because of its weight (approx 100Mb) and time need to achieve a test (days).
I can send a plot to show my problem.
As You can see, RANSAC works apparently well, but my fear is that a kind of dataset as this:
can cause erroneous results (the first linear part not detect).
An idea is to "split" my dataset in parts but it doesn't seem very efficient...
Is there a method to detect multiple linear zones by RANSAC?
P.S
Here an example code by Python for RANSAC
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from skimage.measure import LineModelND, ransac
// x,y are vectors:
// x -> time value
// y -> pressure value #xtime
data = np.column_stack([x, y])
model = LineModelND()
model.estimate(data)
model_robust, inliers = ransac(data, LineModelND, min_samples=2, residual_threshold=0.01, max_trials=1000)
outliers = inliers == False
line_x = np.arange(x.min(), x.max()+1)
fig, ax = plt.subplots()
ax.plot(data[inliers, 0], data[inliers, 1], '.b', alpha=0.6,label='Linear Data')
ax.plot(data[outliers, 0], data[outliers, 1], '.r', alpha=0.6,label='Non Linear Data')
ax.legend(loc='lower right')
plt.show()
Related
I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these:
either clustered in a small area or spread across a larger area. The outliers I'm trying to filter out look like this:
the outlier is away from the "correct" cluster.
I tried z-score filtering but this approach falsely marked many annotations as outliers
std_score = np.abs((points - points.mean(axis=0)) / (np.std(points, axis=0) + 0.01))
validity = np.all(std_score <= np.quantile(std_score, 0.95, axis=0), axis=1)
Is there a method designed to solve this problem?
This seems like a typical clustering problem, and if the data looks as you suggested the KMeans from scikit-learn should do the trick. Lets look how we can do this.
First I am generating a data sample, which might look somewhat like your data.
import numpy as np
import matplotlib.pylab as plt
np.random.seed(1) # For reproducibility
cluster_1 = np.random.normal(loc = [1,1], scale = [0.2,0.2], size = (20,2))
cluster_2 = np.random.normal(loc = [2,1], scale = [0.4,0.4], size = (5,2))
plt.scatter(cluster_1[:,0], cluster_1[:,1])
plt.scatter(cluster_2[:,0], cluster_2[:,1])
plt.show()
points = np.vstack([cluster_1, cluster_2])
This is how the data will look like.
Further we will be doing KMeans clustering.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2).fit(points)
We are choosing n_clusters as 2 believing that there are 2 clusters in the dataset. And after finding these clusters lets look at them.
plt.scatter(points[kmeans.labels_==0][:,0], points[kmeans.labels_==0][:,1], label='cluster_1')
plt.scatter(points[kmeans.labels_==1][:,0], points[kmeans.labels_==1][:,1], label ='cluster_2')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], label = 'cluster_center')
plt.legend()
plt.show()
This will look like as the image shown below.
This should solve your problem. But there ares some things which should be kept in mind.
It will not be perfect all the times.
Might be a problem if you don't have any outliers. Can be solved through silhouette scores.
Difficult to know which cluster to discard (Can be done through locating the center of the clusters (green colored points) or can also be done by finding the cluster with lesser number of points.
Endnote: You might loose some points but would automate the entire process. Depends upon how much you want to trade off in terms of data saved versus manual time saved.
I have daily data like below as a result.
And my daily prediction is like the one below most of the day.
I found a mild trend that value of Y-axis in my prediction where X-axis is between 0-3000 always need to be amplified much more..
How could I find a function that can close the prediction data to result data? I'd imagine I could do it if I Fourier transform both array but I also guess there would be a simpler way to do it..
I assume what you want to get as output is a kind of 'best fit' scaling function to match your output to your prediction. One straightforward approach would be to just calculate the difference of your output and prediction and then apply a smoothing algorithm of your choice to get a scaling function. (e.g. a Savitzky Golay Filter)
Minimal example below:
import numpy as np
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
data = np.sin(x) + np.random.normal(0, 0.2, 100) # a noisy sine function
prediction = x # line with slope 1, a bad initial fit
deviation = prediction - data
fit = savgol_filter(deviation, window_length=71, polyorder=2)
plt.plot(x, prediction-fit)
plt.plot(x, data)
play around with window_length and polyorder to find a suitable degree of noise for your dataset.
I am trying to interpolate a cumulated distribution of e.g. i) number of people to ii) number of owned cars, showing that e.g. the top 20% of people own much more than 20% of all cars - off course 100% of people own 100% of cars. Also I know that there are e.g. 100mn people and 200mn cars.
Now coming to my code:
#import libraries (more than required here)
import pandas as pd
from scipy import interpolate
from scipy.interpolate import interp1d
from sympy import symbols, solve, Eq
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from scipy import interpolate
curve=pd.read_excel('inputs.xlsx',sheet_name='inputdata')
Input data: Curveplot (cumulated people (x) on the left // cumulated cars (y) on the right)
#Input data in list form (I am not sure how to interpolate from a list for the moment)
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
x, y = points[:,0], points[:,1]
interpolation = interp1d(x, y, kind = 'cubic')
number_of_people_mn= 100000000
oneperson = 1 / number_of_people_mn
dataset = pd.DataFrame(range(number_of_people_mn + 1))
dataset.columns = ["nr_of_one_person"]
dataset.drop(dataset.index[:1], inplace=True)
#calculating the position of every single person on the cumulated x-axis (between 0 and 1)
dataset["cumulatedpeople"] = dataset["nr_of_one_person"] / number_of_people_mn
#finding the "cumulatedcars" to the "cumulatedpeople" via interpolation (between 0 and 1)
dataset["cumulatedcars"] = interpolation(dataset["cumulatedpeople"])
plt.plot(dataset["cumulatedpeople"], dataset["cumulatedcars"])
plt.legend(['Cubic interpolation'], loc = 'best')
plt.xlabel('Cumulated people')
plt.ylabel('Cumulated cars')
plt.title("People-to-car cumulated curve")
plt.show()
However when looking at the actual plot, I get the following result which is false: Cubic interpolation
In fact, the curve should look almost like the one from a linear interpolation with the exact same input data - however this is not accurate enough for my purpose: Linear interpolation
Is there any relevant step I am missing out or what would be the best way to get an accurate interpolation from the inputs that almost looks like the one from a linear interpolation?
Short answer: your code is doing the right thing, but the data is unsuitable for cubic interpolation.
Let me explain. Here is your code that I simplified for clarity
from scipy.interpolate import interp1d
from matplotlib import pyplot as plt
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
interpolation = interp1d(cumulatedpeople, cumulatedcars, kind = 'cubic')
number_of_people_mn= 100#000000
cumppl = np.arange(number_of_people_mn + 1)/number_of_people_mn
cumcars = interpolation(cumppl)
plt.plot(cumppl, cumcars)
plt.plot(cumulatedpeople, cumulatedcars,'o')
plt.show()
note the last couple of lines -- I am plotting, on the same graph, both the interpolated results and the input date. Here is the result
orange dots are the original data, blue line is cubic interpolation. The interpolator passes through all the points so technically is doing the right thing
Clearly it is not doing what you would want
The reason for such strange behavior is mostly at the right end where you have a few x-points that are very close together -- the interpolator produces massive wiggles trying to fit very closely spaced points.
If I remove two right-most points from the interpolator:
interpolation = interp1d(cumulatedpeople[:-2], cumulatedcars[:-2], kind = 'cubic')
it looks a bit more reasonable:
But still one would argue linear interpolation is better. The wiggles on the left end now because the gaps between initial x-poonts are too large
The moral here is that cubic interpolation should really be used only if gaps between x points are roughly the same
Your best bet here, I think, is to use something like curve_fit
a related discussion can be found here
specifically monotone interpolation as explained here yields good results on your data. Copying the relevant bits here, you would replace the interpolator with
from scipy.interpolate import pchip
interpolation = pchip(cumulatedpeople, cumulatedcars)
and get a decent-looking fit:
I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.
I came up with a custom interpolation method for my problem and I'd like to ask if there are any risks using it. I am not a math or programming expert, that's why I'd like a feedback :)
Story:
I was searching for a good curve-fit method for my data when I came up with an idea to interpolate the data.
I am mixing paints together and making reflectance measurements with a spectrophotometer when the film is dry. I would like to calculate the required proportions of white and colored paints to reach a certain lightness, regardless of any hue shift (e.g. black+white paints gives a bluish grey) or chroma loss (e.g. orange+white gives "pastel" yellowish orange, etc.)
I check if Beer-Lambert law applies, but it does not. Pigment-mixing behaves in a more complicated fashion than dye-dilutions. So I wanted to fit a curve to my data points (the process is explained here: Interpolation for color-mixing
First step was doing a calibration curve, I tested the following ratios of colored VS white paints mixed together:
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
This is the plot of my carefully prepared samples, measured with a spectrophotometer, the blue curve represents the full color (ratio = 1), the red curve represents the white paint (ratio = 0), the black curves the mixed samples:
Second step I wanted to guess from this data a function that would compute a spectral curve for any ration between 0 and 1. I did test several curve fitting (fitting an exponential function) and interpolation (quadratic, cubic) methods but the results were of a poor quality.
For example, this is my reflectance data at 380nm for all the color samples:
This is the result of scipy.optimize.curve_fit using the function:
def func(x, a, b, c):
return a * np.exp(-b * x) + c
popt, pcov = curve_fit(func, x, y)
Then I came-up with this idea: the logarithm of the spectral data gives a closer match to a straight line, and the logarithm of the logarithm of the data is almost a straight line, as demonstrated by this code and graph:
import numpy as np
import matplotlib.pyplot as plt
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
plt.plot(ratios, linear_approx)
plt.show()
What I did then is to interpolate the linear approximation an then convert the data back to linear, then I got a very nice interpolation of my data, much better than what I got before:
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
xnew = np.arange(100)/100.
cs = scipy.interpolate.spline(ratios, linear_approx, xnew, order=1)
cs = np.exp(np.exp(cs))
plt.plot(xnew,cs)
plt.plot(x,y,'ro')
plt.show()
So my question is for experts: how good is this interpolation method and what are the risks of using it? Can it lead to wrong results?
Also: can this method be improved or does it already exists and if so how is it called?
Thank you for reading
This looks similar to the Kernel Method that is used for fitting regression lines or finding decision boundaries for classification problems.
The idea behind the Kernel trick being, the data is transformed into a dimensional space (often higher dimensional), where the data is linearly separable (for classification), or has a linear curve-fit (for regression). After the curve-fitting is done, inverse transformations can be applied. In your case successive exponentiations (exp(exp(X))), seems to be the inverse transformation and successive logarithms (log(log(x)))seems to be the transformation.
I am not sure if there is a kernel that does exactly this, but the intuition is similar. Here is a medium article explaining this for classification using SVM:
https://medium.com/#zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d
Since it is a method that is quite popularly used in Machine Learning, I doubt it will lead to wrong results if the fit is done properly (not under-fit or over-fit) - and this needs to be judged by statistical testing.