Diffrent PCA plots - python

I was trying to to learn pca(using the iris dataset) with python and i got some results,so i wanted to test the results ir R to make sure it was good.When i checked the results,it gave me a mirror diagram that of python(in the y axis),and the negative numeric sign in some of the values(python: [140,1]=0.1826089,r[141,2]=-0.1826089[python counts form zero]).
The python code:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.decomposition as p
data=np.loadtxt("sample_data/iris.txt",delimiter=';',usecols=(0,1,2,3))
pca=p.PCA().fit(data)
pcaData=pca.transform(data)
plt.scatter(pcaData[:,0],pcaData[:,1])
print(pcaData[140,1])
My python diagram
The R code:
data=read.csv("C:\\Users\\George\\Desktop\\iris.csv",sep=";",colClasses=c(NA, NA, NA,NA,"NULL"));data=data[-151,]
pca=prcomp(data)
plot(pca$x[,1],pca$x[,2])
print(pca$x[141,2])
My R diagram
In search i did on the internet,i found the same happens.
The R diagram on the internet-Source
The Python diagram on the internet-Source.
I was expecting to be the same.
Is somthing that i do not understand well?
Thank you.

ScikitLearn uses a pseudo-randomized method to determine an approximation of the singular value decomposition.
see https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html
Therefore, unless you can guarantee that the methods are the same and use the same random seed, you will not obtain exactly the same values for the principal components.

Related

Can we rank K-Means clusters or assign weights to certain clusters?

I am working on a K-Means Clustering task and I am wondering if there is some way to do some kind of ranking of clusters, or maybe assign specific weights to some specific clusters. Is there a way to do this? Here is my code.
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
import numpy as np
from scipy.cluster.vq import kmeans,vq
import pandas as pd
import pandas_datareader as dr
from math import sqrt
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
df = pd.read_csv('C:\\my_path\\analytics.csv')
data = np.asarray([np.asarray(dataset['Rating']),np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
centroids,_ = kmeans(data,1000)
idx,_ = vq(data,centroids)
details = [(name,cluster) for name, cluster in zip(dataset.Cusip,idx)]
So, I get my 'details', I look at it, and everything seems fine at this point. I end up with around 700 clusters. I'm just wondering if there is a way to rank-order these clusters, assuming 'Rating' is the most important feature. Or, perhaps there is a way to assign a higher weight to 'Rating'. I'm not sure this makes 100% sense. I'm just thinking about the concept and wondering if there is some obvious solution or maybe this is just nonsense. I can easily count the records in each cluster, but I don't think that has any significance whatsoever. I Googled this and didn't find anything useful.
One "cheat" trick would be to use the feature ratingtwice or three times, then it automatically gets more weight:
data = np.asarray([np.asarray(dataset['Rating']), np.asarray(dataset['Rating']), np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
there are also adjustments of kmeans around, but they are not implemented in python.

Python How to compute and store Residuals in a FOR-LOOP Regression

The title outlines my problem for the following script(please, run it first and then read my final question):
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import datetime
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
The code works however I have 2 questions:
Is that 'reg._residues' the real residuals from the Y(real value of 'EXO.MI') and y predicted?I ask that because the plot of residuals was everything but normally distributed or stationary
Guys I'm getting crazy: HOW CAN I COMPUTE THE everyday residuals in a 'FOR'LOOP ?????
I mean, I tried to:
make the difference between real y values and reg.predict
make the manual computation: y_predicted= Intercetta + Hedge*bank_matrix[['LDO.MI]]
But Python always report me problems. I honestly find very hard to understand how Python works for this....
Thanks
It's still not 100% clear to me what you want to do here, but I hope this will get you somewhere.
First of all, your code runs fine if you just add import datetime in the beginning, and replace y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:]) Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred) with y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:]) Residuals.append(bank_matrix[['EXO.MI']][lookback:]-y_pred).
Then you can visually check your residuals for each sub-period using:
for df in Residuals:
df.plot.hist()
Using Residuals[-3:] will plot the last three residual series of your calculations:
You can also easily run a Shapiro-Wilk test for normality for each of your residual series and append the results in a dataframe:
from scipy import stats
shapiro=[]
for df in Residuals[-3:]:
shapiro.append(stats.shapiro(df[df.columns[0]].values))
df_shapiro = pd.DataFrame(shapiro)
df_shapiro[0] returns the W-statistic and df_shapiro[1] returns the p-values.
Take a closer look at the p-values using:
df_pVal=df_shapiro[1].to_frame()
df_pVal['alpha']=0.05
df_pVal.plot()
Take a look at here for more information on how to use the test.
The question still remains what you're aiming to do here. A detailed explanation would be great. Until then, I hope my effort gets you a few steps further.

Finding Hot Spots in Python. KernelDensity

I am facing the following problem:
I have (large) sample of unevenly distributed points $(X_i,Y_i)$ in a 2D space. I would like to determine the local extremas of the density of the distribution.
Does the function KernelDensity allow to estimate the density of the sample in a point outside the sample ?
if yes, i cannot find the right syntax ?
Here is an example:
import numpy as np
import pandas as pd
mean0=[0,0]
cov0=[[1,0],[0,1]]
mean1=[3,3]
cov1=[[1,0.2],[0.2,1]]
A=pd.DataFrame(np.vstack((np.random.multivariate_normal(mean0, cov0, 5000),np.random.multivariate_normal(mean1, cov1, 5000))))
A.columns=['X','Y']
A.describe()
from sklearn.neighbors import KernelDensity
kde = KernelDensity(bandwidth=0.04, metric='euclidean',
kernel='gaussian', algorithm='ball_tree')
kde.fit(A)
If I make this query
kde.score_samples([(0,0)])
i get a negative number, clearly not a density !!
array([-2.88134574])
I don't know if its the right approach. I would like then use that function to use an optimizer to get local extremas. (which library/function would you recommend ?)
EDIT: yes this is a log density, not a density so it can be a negative number

Mixture of Gaussians using scikit learn mixture

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.
The data looks like this:
So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:
data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")
And here's what I say when I try it with python:
from sklearn import mixture
import numpy as np
import os
import pyplot
os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)
classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()
Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?
Is there a difference in the models used by Mclust and by sklearn.mixture?
But more important: what is the best way in sklearn to cluster my data?
The trick is to set GMM's min_covar. So in this case I get good results from:
mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)
The large default value for min_covar assigns all points to one cluster.

Image transformation in OpenCV

This question is related to this question: How to remove convexity defects in sudoku square
I was trying to implement nikie's answer in Mathematica to OpenCV-Python. But i am stuck at the final step of procedure.
ie I got the all intersection points in square like below:
Now, i want to transform this into a perfect square of size (450,450) as given below:
(Never mind the brightness difference of two images).
Question:
How can i do this in OpenCV-Python? I am using cv2 version.
Apart from etarion's suggestion, you could also use the remap function. I wrote a quick script to show how you can do this. As you see coding this is really easy in Python. This is the test image:
and this is the result after warping:
And here is the code:
import cv2
from scipy.interpolate import griddata
import numpy as np
grid_x, grid_y = np.mgrid[0:149:150j, 0:149:150j]
destination = np.array([[0,0], [0,49], [0,99], [0,149],
[49,0],[49,49],[49,99],[49,149],
[99,0],[99,49],[99,99],[99,149],
[149,0],[149,49],[149,99],[149,149]])
source = np.array([[22,22], [24,68], [26,116], [25,162],
[64,19],[65,64],[65,114],[64,159],
[107,16],[108,62],[108,111],[107,157],
[151,11],[151,58],[151,107],[151,156]])
grid_z = griddata(destination, source, (grid_x, grid_y), method='cubic')
map_x = np.append([], [ar[:,1] for ar in grid_z]).reshape(150,150)
map_y = np.append([], [ar[:,0] for ar in grid_z]).reshape(150,150)
map_x_32 = map_x.astype('float32')
map_y_32 = map_y.astype('float32')
orig = cv2.imread("tmp.png")
warped = cv2.remap(orig, map_x_32, map_y_32, cv2.INTER_CUBIC)
cv2.imwrite("warped.png", warped)
I suppose you can google and find what griddata does. In short, it does interpolation and here we use it to convert sparse mappings to dense mappings as cv2.remap requires dense mappings. We just need to convert to the values to float32 as OpenCV complains about the float64 type. Please let me know how it goes.
Update: If you don't want to rely on Scipy, one way is to implement the 2d interpolation function in your code, for example, see the source code of griddata in Scipy or a simpler one like this http://inasafe.readthedocs.org/en/latest/_modules/engine/interpolation2d.html which depends only on numpy. Though, I'd suggest to use Scipy or another library for this, though I see why requiring only CV2 and numpy may be better for a case like this. I'd like to hear how your final code solves Sudokus.
if you have source points and end points (you only need 4), you can plug them into cv2.getPerspectiveTransform, and use that result in cv2.warpPerspective. Gives you a nice flat result.

Categories