scikit learn LDA giving unexpected results - python

I am attempting to classify some data with the scikit learn LDA classifier. I'm not entirely sure what to "expect" from it, but what I am getting is weird. Seems like a good opportunity to learn about either a shortcoming of the technique, or a way in which I am applying it wrong. I understand that no line could completely separate this data, but it seems that there are much "better" lines than the one it is finding. I'm just using the default options. Any thoughts on how to do this better? I'm using LDA because it is linear in the size of my dataset. Although I think a linear SVM has a similar complexity. Perhaps it would be better for such data? I will update when I have tested other possibilities.
The picture: (light blue is what my LDA classifier predicts will be dark blue)
The code:
import numpy as np
from numpy import array
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import itertools
X = array([[ 0.23125754, 0.79170351],
[ 0.78021491, -0.24999486],
[ 0.00856446, 0.41452734],
[ 0.66381753, -0.09872504],
[-0.03178685, 0.04876317],
[ 0.65574645, -0.68214948],
[ 0.14290684, 0.38256002],
[ 0.05156987, 0.11094875],
[ 0.06843403, 0.19110019],
[ 0.24070898, -0.07403764],
[ 0.03184353, 0.4411446 ],
[ 0.58708124, -0.38838008],
[-0.00700369, 0.07540799],
[-0.01907816, 0.07641038],
[ 0.30778608, 0.30317186],
[ 0.55774143, -0.38017325],
[-0.00957214, -0.03303287],
[ 0.8410637 , 0.158594 ],
[-0.00294113, -0.00380608],
[ 0.26577841, 0.07833684],
[-0.32249375, 0.49290502],
[ 0.11313078, 0.35697211],
[ 0.41153679, -0.4471876 ],
[-0.00313315, 0.30065913],
[ 0.14344143, -0.19127107],
[ 0.04857767, 0.01339191],
[ 0.5865007 , 0.71209886],
[ 0.08157439, 0.40909955],
[ 0.72495202, 0.29583866],
[-0.09391461, 0.17976605],
[ 0.06149141, 0.79323099],
[ 0.52208024, -0.2877661 ],
[ 0.01992141, -0.00435266],
[ 0.68492617, -0.46981335],
[-0.00641231, 0.29699622],
[ 0.2369677 , 0.140319 ],
[ 0.6602586 , 0.11200433],
[ 0.25311836, -0.03085372],
[-0.0895014 , 0.45147252],
[-0.18485667, 0.43744524],
[ 0.94636701, 0.16534406],
[ 0.01887734, -0.07702135],
[ 0.91586801, 0.17693792],
[-0.18834833, 0.31944796],
[ 0.20468328, 0.07099982],
[-0.15506378, 0.94527383],
[-0.14560083, 0.72027034],
[-0.31037647, 0.81962815],
[ 0.01719756, -0.01802322],
[-0.08495304, 0.28148978],
[ 0.01487427, 0.07632112],
[ 0.65414479, 0.17391618],
[ 0.00626276, 0.01200355],
[ 0.43328095, -0.34016614],
[ 0.05728525, -0.05233956],
[ 0.61218382, 0.20922571],
[-0.69803697, 2.16018536],
[ 1.38616732, -1.86041621],
[-1.21724616, 2.72682759],
[-1.26584365, 1.80585403],
[ 1.67900048, -2.36561699],
[ 1.35537903, -1.60023078],
[-0.77289615, 2.67040114],
[ 1.62928969, -1.20851808],
[-0.95174264, 2.51515935],
[-1.61953649, 2.34420531],
[ 1.38580104, -1.9908369 ],
[ 1.53224512, -1.96537012]])
y = array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.])
classifier = LDA()
classifier.fit(X,y)
xx = np.array(list(itertools.product(np.linspace(-4,4,300), np.linspace(-4,4,300))))
yy = classifier.predict(xx)
b_colors = ['salmon' if yyy==0 else 'deepskyblue' for yyy in yy]
p_colors = ['r' if yyy==0 else 'b' for yyy in y]
plt.scatter(xx[:,0],xx[:,1],s=1,marker='o',edgecolor=b_colors,c=b_colors)
plt.scatter(X[:,0], X[:,1], marker='o', s=5, c=p_colors, edgecolor=p_colors)
plt.show()
UPDATE: Changing from using sklearn.discriminant_analysis.LinearDiscriminantAnalysis to sklearn.svm.LinearSVC also using the default options gives the following picture:
I think using the zero-one loss instead of the hinge loss would help, but sklearn.svm.LinearSVC doesn't seem to allow custom loss functions.
UPDATE: The loss function to sklearn.svm.LinearSVC approaches the zero-one loss as the parameter C goes to infinity. Setting C = 1000 gives me what I was originally hoping for. Not posting this as an answer, because the original question was about LDA.
picture:

LDA models each class as a Gaussian, so the model for each class is determined by the class' estimated mean vector and covariance matrix.
Judging by the eye only, your blue and red classes have approximately the same mean and same covariance, which means the 2 Gaussians will 'sit' on top of each other, and the discrimination will be poor. Actually it also means that the separator (the blue-pink border) will be noisy, that is it will change a lot between random samples of your data.
Btw your data is clearly not linearly-separable, so every linear model will have a hard time discriminating the data.
If you must use a linear model, try using LDA with 3 components, such that the top-left blue blob is classified as '0', the bottom-right blue blob as '1', and the red as '2'. This way you will get a much better linear model. You can do it by preprocessing the blue class with a clustering algorithm with K=2 classes.

Related

CV2 Distance with projectPoints

is it possible to calculate the distance between points which i created with cv2.projectPoints?
I have two aruco markers and from both markers i have created (with cv2.projectPoints) points which are in a specific distance to the marker. Now i want to know how far these points are away from each other?
I know you cant give a specific code without an MVP and it is not necessary i only need an idea how this is possible to calculate. I would be awesome if someone knows maybe a cv2 function or a way to calculate these?
Thank you very much <3
Edit:
I generated the four identity matrixes and inversed all of them. Code and result ist below.
#T_point1_marker1 = np.linalg.inv(T_marker1_point1)
#T_marker1_cam = np.linalg.inv(T_cam_marker1)
T_point1_marker1 = np.array([
[ 1., 0., 0., -0.1 ],
[ 0., 1., 0., -0.05],
[ 0., 0., 1., 0. ],
[ 0., 0., 0., 1. ],
])
T_marker1_cam = np.array([
[ 1., 0., 0., 0.10809129],
[ 0., 1., 0., 0.03833054],
[ 0., 0., 1., -0.35931477],
[ 0., 0., 0., 1. ],
])
T_cam_marker2 = np.array([
[ 1., 0., 0., 0.09360527],
[ 0., 1., 0., -0.01229168],
[ 0., 0., 1., 0.36470099],
[ 0., 0., 0., 1. ],
])
T_marker2_point2 = np.array([
[ 1., 0., 0., 0.005],
[ 0., 1., 0., 0.1 ],
[ 0., 0., 1., 0. ],
[ 0., 0., 0., 1. ],
])
#Process finished with exit code 1
The think i don't understand is this part:
T_point1_point2 = T_point1_marker1 # T_marker1_cam # T_cam_marker2 # T_marker2_point2
How do i bring these four matrixes together so i get T_point1_point2?
Thanks again :)
Since your graphic contains measurements of physical distance, rather than pixels, I'll assume you're asking about 3D, i.e. you want a 3D distance between those points...
You just need to define the poses of those points, relative to their markers. That is T_marker1_point1 and T_marker2_point2. Make those be pure translation, probably with Z=0 if these points are in each respective marker's plane. Literally make a 4x4 identity matrix, then stick your nominal (constructed) dimensions into the last column.
Then you need the marker poses relative to the camera, T_cam_marker1 and T_cam_marker2.
Finally you calculate
T_point1_point2 = T_point1_marker1 # T_marker1_cam # T_cam_marker2 # T_marker2_point2
# where
# T_marker1_cam = np.linalg.inv(T_cam_marker1)
# and so on
The translation part of that pose matrix gives you the distance between those points. You can ignore the rotation component. That'd only give you the rotation between those markers, because your points were defined as poses of the same orientation as their respective markers. Yes, orientation is silly for points but eh...
All of that is 4x4 matrices. Compose from tvec, put in third column, and rvec, turned into a 3x3 rotation matrix using cv.Rodrigues. Decompose 4x4 matrix into rvec and tvec accordingly (Rodrigues goes both ways).

Get output function from facebook prophet

My understanding of the facebook prophet is that it fits data with multiple components (some linear stuff, some Fourier stuff, some noise, maybe something more?). I would like the get the function that it uses to make forecasts. I was able to get the parameters that it got to make the forecast but I have no idea what these parameters are. I also can't seem to find proper documentation about this (https://facebook.github.io/prophet/docs/diagnostics.html this is the only one I can find), but it does not even say anything about the model.params function).
from prophet import Prophet
df = pd.DataFrame()
df['ds'] = data['fake_dates']
df['y'] = data['ma']
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=60)
forecast = m.predict(future)
print(m.params)
This gives as output (ran on some data):
{'k': array([[-3.95155406]]),
'm': array([[0.42781757]]),
'delta': array([[-3.29078537e-01, 2.78262436e+00, 3.31685505e+00,
4.57598611e+00, -1.62887827e+00, -6.87254136e+00,
-3.32557982e+00, 3.78801974e+00, 3.40895873e+00,
4.23882762e+00, -3.98031792e-06, -7.42245159e+00,
-2.37269115e+00, -1.61448980e+00, 1.14914927e+00,
1.11661060e+01, 4.18376686e-01, -5.93554379e+00,
-5.50573948e+00, 2.41918207e-07, 1.12679450e+00,
1.11730956e+01, -1.13463489e-07, -6.22587811e+00,
-3.22371819e+00]]),
'sigma_obs': array([[0.01471452]]),
'beta': array([[ 1.39353263e-02, -1.71326299e-01, 4.63865816e-03,
-9.30015605e-03, -5.67143463e-03, 7.82285614e-03,
-1.71196036e-03, 1.18696777e-03, -5.05833319e-04,
-1.70835807e-03, 6.47786761e-04, 2.56994786e-04,
-1.29550834e-05, -1.07983945e-04, 6.83514888e-04,
-7.02620836e-05, -2.25578924e-04, 6.36561710e-04,
1.64453201e-04, 5.05444195e-04, -2.89627488e-05,
5.18925939e-05, 4.12627053e-05, 1.93531548e-05,
1.57462000e-05, -9.68859475e-07]]),
'trend': array([[0.42781757, 0.42565115, 0.42348473, ..., 0.67689983, 0.6762072 ,
0.67551457]]),
'Y': array([[0.25557853, 0.2535342 , 0.2515422 , ..., 0.50456002, 0.50390483,
0.50321808]]),
'beta_m': array([[ 0., -0., 0., -0., -0., 0., -0., 0., -0., -0., 0., 0., -0.,
-0., 0., -0., -0., 0., 0., 0., -0., 0., 0., 0., 0., -0.]]),
'beta_a': array([[ 1.39353263e-02, -1.71326299e-01, 4.63865816e-03,
-9.30015605e-03, -5.67143463e-03, 7.82285614e-03,
-1.71196036e-03, 1.18696777e-03, -5.05833319e-04,
-1.70835807e-03, 6.47786761e-04, 2.56994786e-04,
-1.29550834e-05, -1.07983945e-04, 6.83514888e-04,
-7.02620836e-05, -2.25578924e-04, 6.36561710e-04,
1.64453201e-04, 5.05444195e-04, -2.89627488e-05,
5.18925939e-05, 4.12627053e-05, 1.93531548e-05,
1.57462000e-05, -9.68859475e-07]])}
I know that it should be possible to reconstruct the function that is used to forecast from this.

Scipy UnivariateSpline exit code -1073741819 for some case

I use UnivariateSpline from scipy module to fit data.It works for almost all cases except for this one, which gives rise to Process finished with exit code -1073741819 (0xC0000005) error. If I change smoothing factor s to 0, it also works. Any suggestions to solve this problem will help.
Update1
My working environment is:
python 3.7
scipy 1.3.2
numpy 1.17.4
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import UnivariateSpline, InterpolatedUnivariateSpline
x = np.arange(78)
y = np.asarray([
0., 0., 0., 0., 0., 0.,
0., 0., 5.03989319, 4.03191455, 4.03191455, 3.02393591,
3.02393591, 2.01595727, 2.01595727, 1.00797864, 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.])
spl = UnivariateSpline(x, y, k=1, s=0.01)
knots = list(map(int, spl.get_knots()))
plt.plot(knots, y[knots], 'rx')
plt.plot(knots, y[knots], 'r-')
plt.plot(x, y, 'b-')
plt.show()
The combination of you s and k parameter are causing the issue.
According to the documentation, the number of knots increases until the condition sum((w[i] * (y[i]-spl(x[i])))**2, axis=0) <= s is met. However, because you have a limited number of non-zero data points, you can only add so many meaningful knots to the data set, and because you are doing k=1 spline (as opposed to cubic for example), the difference between the spline value and the data values is never reaching the prescribed s value.
Your options include increasing k (I tested with k=3 and it worked) or increase the s value to have a less strict condition (anything above s=0.08 worked for me). Note your code worked when s=0 because for that condition, instead of doing a smoothing, the algorithm just interpolates between each point and does no smoothing (which maybe is what you want).

Scikit-learn cross val score: too many indices for array

I have the following code
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
#split the dataset for train and test
combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75
train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False]
et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0)
min_samples_split=10, random_state=0 )
labels = train[list(label_columns)].values
tlabels = test[list(label_columns)].values
features = train[list(columns)].values
tfeatures = test[list(columns)].values
et_score = cross_val_score(et, features, labels, n_jobs=-1)
print("{0} -> ET: {1})".format(label_columns, et_score))
Checking the shape of the arrays:
features.shape
Out[19]:(43069, 34)
And
labels.shape
Out[20]:(43069, 1)
and I'm getting:
IndexError: too many indices for array
and this relevant part of the traceback:
---> 22 et_score = cross_val_score(et, features, labels, n_jobs=-1)
I'm creating the data from Pandas dataframes and I searched here and saw some reference to possible errors via this method but can't figure out how to correct?
What the data arrays look like:
features
Out[21]:
array([[ 0., 1., 1., ..., 0., 0., 1.],
[ 0., 1., 1., ..., 0., 0., 1.],
[ 1., 1., 1., ..., 0., 0., 1.],
...,
[ 0., 0., 1., ..., 0., 0., 1.],
[ 0., 0., 1., ..., 0., 0., 1.],
[ 0., 0., 1., ..., 0., 0., 1.]])
labels
Out[22]:
array([[1],
[1],
[1],
...,
[1],
[1],
[1]])
When we do cross validation in scikit-learn, the process requires an (R,) shape label instead of (R,1). Although they are the same thing to some extend, their indexing mechanisms are different. So in your case, just add:
c, r = labels.shape
labels = labels.reshape(c,)
before passing it to the cross-validation function.
It seems to be fixable if you specify the target labels as a single data column from Pandas. If the target has multiple columns, I get a similar error. For example try:
labels = train['Y']
Adding .ravel() to the Y/Labels variable passed into the formula helped solve this problem within KNN as well.
try target:
y=df['Survived']
instead , i used
y=df[['Survived']]
which made the target y a dateframe, it seems series would be ok
You might need to play with the dimensions a bit, e.g.
et_score = cross_val_score(et, features, labels, n_jobs=-1)[:,n]
or
et_score = cross_val_score(et, features, labels, n_jobs=-1)[n,:]
n being the dimension.

Convolution & Deconvolution using Scipy

I am trying to compute Deconvolution using Python. I have a signal let say f(t) which is the convoluted by the window function say g(t). Is there some direct way to compute the deconvolution so I can get back the original signal?
For instance f(t) = exp(-t**2/3); Gaussian function
and g(t) = Trapezoidal function
Thanks in advance for your kind suggestion.
Is this an analytical or numerical problem?
If it's numerical, use scipy.signal.devconvolve: http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.deconvolve.html
From the docs:
>>> from scipy import signal
>>> sig = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1,])
>>> filter = np.array([1,1,0])
>>> res = signal.convolve(sig, filter)
>>> signal.deconvolve(res, filter)
(array([ 0., 0., 0., 0., 0., 1., 1., 1., 1.]),
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]))
Otherwise, if you want an analytic solution, you might be using the wrong tool.
Additionally, just a tip for future google-ing, when you're talking about convolution, the action is usually/often "convolved" not "convoluted", see https://english.stackexchange.com/questions/64046/convolve-vs-convolute

Categories