Finding Hot Spots in Python. KernelDensity - python

I am facing the following problem:
I have (large) sample of unevenly distributed points $(X_i,Y_i)$ in a 2D space. I would like to determine the local extremas of the density of the distribution.
Does the function KernelDensity allow to estimate the density of the sample in a point outside the sample ?
if yes, i cannot find the right syntax ?
Here is an example:
import numpy as np
import pandas as pd
mean0=[0,0]
cov0=[[1,0],[0,1]]
mean1=[3,3]
cov1=[[1,0.2],[0.2,1]]
A=pd.DataFrame(np.vstack((np.random.multivariate_normal(mean0, cov0, 5000),np.random.multivariate_normal(mean1, cov1, 5000))))
A.columns=['X','Y']
A.describe()
from sklearn.neighbors import KernelDensity
kde = KernelDensity(bandwidth=0.04, metric='euclidean',
kernel='gaussian', algorithm='ball_tree')
kde.fit(A)
If I make this query
kde.score_samples([(0,0)])
i get a negative number, clearly not a density !!
array([-2.88134574])
I don't know if its the right approach. I would like then use that function to use an optimizer to get local extremas. (which library/function would you recommend ?)
EDIT: yes this is a log density, not a density so it can be a negative number

Related

Why does my fourier transformed gaussianlike function, does not look like a gaussian in fourierspace?

I have got the following function , the central peak is approximately gaussianlike. I used the numpy library for the FFT algorithm from numpy.fft import fft as fourier, ifft as ifourier and transformed my function into fourierspacefunc_fourier = fourier(func), I expected a gaussianlike function in fourier space aswell, but I got this result, while plotting plt.plot(np.abs(func_fourier)), since I dont know what x-values to plot over in fourierspace. Since I do not have information on the frequency spacing.:
Any tips or ideas what the reason might be or how to interpret this result from numpy.fft() ?
As Cris Luengo mentioned in his comment you want to use np.fft.fftshift() to shift the zero frequency. So usingfunc_fourier = np.fft.fftshift(fourier(func)) and plotting plt.plot(np.abs(func_fourier)) you receive the expected gaussianlike transform:

Can we rank K-Means clusters or assign weights to certain clusters?

I am working on a K-Means Clustering task and I am wondering if there is some way to do some kind of ranking of clusters, or maybe assign specific weights to some specific clusters. Is there a way to do this? Here is my code.
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
import numpy as np
from scipy.cluster.vq import kmeans,vq
import pandas as pd
import pandas_datareader as dr
from math import sqrt
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
df = pd.read_csv('C:\\my_path\\analytics.csv')
data = np.asarray([np.asarray(dataset['Rating']),np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
centroids,_ = kmeans(data,1000)
idx,_ = vq(data,centroids)
details = [(name,cluster) for name, cluster in zip(dataset.Cusip,idx)]
So, I get my 'details', I look at it, and everything seems fine at this point. I end up with around 700 clusters. I'm just wondering if there is a way to rank-order these clusters, assuming 'Rating' is the most important feature. Or, perhaps there is a way to assign a higher weight to 'Rating'. I'm not sure this makes 100% sense. I'm just thinking about the concept and wondering if there is some obvious solution or maybe this is just nonsense. I can easily count the records in each cluster, but I don't think that has any significance whatsoever. I Googled this and didn't find anything useful.
One "cheat" trick would be to use the feature ratingtwice or three times, then it automatically gets more weight:
data = np.asarray([np.asarray(dataset['Rating']), np.asarray(dataset['Rating']), np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
there are also adjustments of kmeans around, but they are not implemented in python.

Python How to compute and store Residuals in a FOR-LOOP Regression

The title outlines my problem for the following script(please, run it first and then read my final question):
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import datetime
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
The code works however I have 2 questions:
Is that 'reg._residues' the real residuals from the Y(real value of 'EXO.MI') and y predicted?I ask that because the plot of residuals was everything but normally distributed or stationary
Guys I'm getting crazy: HOW CAN I COMPUTE THE everyday residuals in a 'FOR'LOOP ?????
I mean, I tried to:
make the difference between real y values and reg.predict
make the manual computation: y_predicted= Intercetta + Hedge*bank_matrix[['LDO.MI]]
But Python always report me problems. I honestly find very hard to understand how Python works for this....
Thanks
It's still not 100% clear to me what you want to do here, but I hope this will get you somewhere.
First of all, your code runs fine if you just add import datetime in the beginning, and replace y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:]) Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred) with y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:]) Residuals.append(bank_matrix[['EXO.MI']][lookback:]-y_pred).
Then you can visually check your residuals for each sub-period using:
for df in Residuals:
df.plot.hist()
Using Residuals[-3:] will plot the last three residual series of your calculations:
You can also easily run a Shapiro-Wilk test for normality for each of your residual series and append the results in a dataframe:
from scipy import stats
shapiro=[]
for df in Residuals[-3:]:
shapiro.append(stats.shapiro(df[df.columns[0]].values))
df_shapiro = pd.DataFrame(shapiro)
df_shapiro[0] returns the W-statistic and df_shapiro[1] returns the p-values.
Take a closer look at the p-values using:
df_pVal=df_shapiro[1].to_frame()
df_pVal['alpha']=0.05
df_pVal.plot()
Take a look at here for more information on how to use the test.
The question still remains what you're aiming to do here. A detailed explanation would be great. Until then, I hope my effort gets you a few steps further.

Diffrent PCA plots

I was trying to to learn pca(using the iris dataset) with python and i got some results,so i wanted to test the results ir R to make sure it was good.When i checked the results,it gave me a mirror diagram that of python(in the y axis),and the negative numeric sign in some of the values(python: [140,1]=0.1826089,r[141,2]=-0.1826089[python counts form zero]).
The python code:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.decomposition as p
data=np.loadtxt("sample_data/iris.txt",delimiter=';',usecols=(0,1,2,3))
pca=p.PCA().fit(data)
pcaData=pca.transform(data)
plt.scatter(pcaData[:,0],pcaData[:,1])
print(pcaData[140,1])
My python diagram
The R code:
data=read.csv("C:\\Users\\George\\Desktop\\iris.csv",sep=";",colClasses=c(NA, NA, NA,NA,"NULL"));data=data[-151,]
pca=prcomp(data)
plot(pca$x[,1],pca$x[,2])
print(pca$x[141,2])
My R diagram
In search i did on the internet,i found the same happens.
The R diagram on the internet-Source
The Python diagram on the internet-Source.
I was expecting to be the same.
Is somthing that i do not understand well?
Thank you.
ScikitLearn uses a pseudo-randomized method to determine an approximation of the singular value decomposition.
see https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html
Therefore, unless you can guarantee that the methods are the same and use the same random seed, you will not obtain exactly the same values for the principal components.

Mixture of Gaussians using scikit learn mixture

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.
The data looks like this:
So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:
data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")
And here's what I say when I try it with python:
from sklearn import mixture
import numpy as np
import os
import pyplot
os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)
classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()
Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?
Is there a difference in the models used by Mclust and by sklearn.mixture?
But more important: what is the best way in sklearn to cluster my data?
The trick is to set GMM's min_covar. So in this case I get good results from:
mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)
The large default value for min_covar assigns all points to one cluster.

Categories