I am using the pysurvival library to model with the Cox proportional hazard model (CPH). Instead of getting the survival curves, I am interested in getting point predictions. In the library, the function predict_survival returns an array-like representing the prediction of the survival function which I assume that I can use to get the expected values - but I just cant find the right way.
Below I've attached a dummy example.
# Initializing the simulation model
sim = SimulationModel( survival_distribution = 'log-logistic',
risk_type = 'linear',
censored_parameter = 10.1,
alpha = 0.1, beta=3.2 )
# Generating N random samples
N = 200
dataset = sim.generate_data(num_samples = N, num_features = 4)
# Defining the features
features = sim.features
# Creating the X, T and E input
X, T, E = dataset[features], dataset['time'].values, dataset['event'].values
# Building the model
coxph = CoxPHModel()
coxph.fit(X,T,E, lr=0.5, l2_reg=1e-2, init_method='zeros')
When applying the function:
coxph.predict_survival(x=X)
It returns an array of arrays with the shape (200, 87) - why does it give exactly 87 values for every observation?
As far as I understand I should be able to get the expected value by taking the integral below the curve of the survival curve.
To do this I need to calculate the area under the curve which I think can be done using trapz in the numpy library, but I need to know how the spacing between the points are done.
As mentioned we can use the function predict_survival to get the estimated survival probability. Furthermore, by calling coxPH.times we obtain the time of every estimated survival probability, and thereby for example can plot the individual survival curve for every observation and calculate the area under the curve. By using the auc function from the sklearn.metrics library the following definition gives point predictions for the training and test set given a CPH model, the X_test data and X_train data:
## Get point predictions
def point_pred(model, X_test, X_train):
T_pred = []
T_pred_train = []
# Get survival curves
cph_pred = model.predict_survival(X_test)
cph_pred_train = model.predict_survival(X_train)
# get times of survival prediction
time = model.times
# test
for i in range(0,len(cph_pred)):
T_pred.append(auc(time,cph_pred[i]))
# train
for i in range(0,len(cph_pred_train)):
T_pred_train.append(auc(time,cph_pred_train[i]))
return T_pred, T_pred_train
Related
I'm doing an exercise and I'm stuck.
Here's what I have to do:
I've been given a function to implement which has 4 arguments.
def squared_errors(slope, intercept, surfaces, prices
And I tried with a friend to get that function to work but none of us found the solution.
Basically, I have been given a dataset and I have to make sure that our estimator line is the best possible one, we need to compute the Mean Squared Error between price and
predicted_price (slope * surface + intercept). The dataset is a vector of shape(1000,1).
for each row, we should evaluate the squared_error (predicted_price - price)**2
But my brain is just numb and I can't come to a solution, and help would be greatly appreciate. !
Given the slope and the intercept, for any given data point x you can get its prediction as slope*x+intercept (or generally as slope^T.X+intrecept when it is vetorized)
Now that we have the predictions, if we have the actual ground truth then we can measure how good/bad our predictions are using squared loss which is nothing but just the square root of the mean of the squared difference between the prediction and the corresponding actual ground truth.
Sample (documented inline)
import numpy as np
# Actual slope
slope = 2
# Actual intercept
intercept = 0
# Some data
X = np.random.rand(10,1)
# The ground truth
prices = slope*X + intercept
# Loss
def squared_errors(slope, intercept, surfaces, prices):
y_hat = slope*surfaces + intercept
return np.sqrt(np.mean((y_hat - prices)**2))
# prefect prediction
print (squared_errors(2, 0, X, prices))
# Non prefect prediction
print (squared_errors(2, 0.5, X, prices))
print (squared_errors(1, 0, X, prices))
Output:
0.0
0.5
0.6286343914881158
As you can see for the prefect prediction the error is 0 and non zero for the rest based on how far way on average the predictions are from the ground truth.
I have recently been working with gpflow, in-particular Gaussian process regression, to model a process for which I have access to approximated moments for each input. I have a vector of input values X of size (N,1) and a vector of responses Y of size (N,1). However, I also know, for each (x,y) pair, an approximation of the associated variance, skewness, kurtosis and so on for the particular y value.
From this, I know properties that inform me of appropriate likelihoods to use for each data point.
In the simplest case, I just assume all likelihoods are Gaussian, and specify the variance at each point. I've created a minimal example of my code by adapting the tutorial on: https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/advanced/varying_noise.ipynb#Demo-2:-grouped-noise-variances.
import numpy as np
import gpflow
def generate_data(N=100):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
groups = np.arange( 0, N, 1 ).reshape(-1,1)
NoiseVar = np.array([i/100.0 for i in range(N)])[groups]
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, groups, NoiseVar
# Get data
X, Y, groups, NoiseVar = generate_data()
Y_data = np.hstack([Y, groups])
# Generate one likelihood per data-point
likelihood = gpflow.likelihoods.SwitchedLikelihood( [gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])])
# model construction (notice that num_latent is 1)
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# Specify the likelihood as non-trainable
model.likelihood.set_trainable(False)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
# Stop Adam from optimising the variational parameters
model.q_mu.trainable = False
model.q_sqrt.trainable = False
# Create Adam tensor
adam_tensor = gpflow.train.AdamOptimizer(learning_rate=0.1).make_optimize_tensor(model)
for i in range(200):
session.run(natgrad_tensor)
session.run(adam_tensor)
# update the cache of the parameters in the current session
model.anchor(session)
print(model)
The above code works for a gaussian likelihood, and known variances. Inspecting my real data, I see that it is skewed very often and as a result, I want to use non-gaussian likelihoods to model it, but am unsure how to specify these other likelihood parameters given what I know.
So my question is: Given this setup, how can I adapt my code so far to include non-Gaussian likelihoods at each step, in-particular specifying and fixing their parameters based on my known variances, skewness, kurtosis and so on associated with each individual y value?
Firstly, you will need to choose which non-Gaussian likelihood you use. GPflow includes various ones in likelihoods.py. You then need to adapt the line
likelihood = gpflow.likelihoods.SwitchedLikelihood(
[gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])]
)
to give a list of your non-Gaussian likelihoods.
Which likelihood can take advantage of your skewness and kurtosis information is a statistical question. Depending on what you come up with, you may need to implement your own likelihood class, which can be done by inheriting from Likelihood. You should be able to follow some other examples from likelihoods.py.
I have some 2D data (GPS data) with clusters (stop locations) that I know resemble Gaussians with a characteristic standard deviation (proportional to the inherent noise of GPS samples). The figure below visualizes a sample that I expect has two such clusters. The image is 25 meters wide and 13 meters tall.
The sklearn module has a function sklearn.mixture.GaussianMixture which allows you to fit a mixture of Gaussians to data. The function has a parameter, covariance_type, that enables you to assume different things about the shape of the Gaussians. You can, for example, assume them to be uniform using the 'tied' argument.
However, it does not appear directly possible to assume the covariance matrices to remain constant. From the sklearn source code it seems trivial to make a modification that enables this but it feels a bit excessive to make a pull request with an update that allows this (also I don't want to accidentally add bugs in sklearn). Is there a better way to fit a mixture to data where the covariance matrix of each Gaussian is fixed?
I want to assume that the SD should remain constant at around 3 meters for each component, since that is roughly the noise level of my GPS samples.
It is simple enough to write your own implementation of EM algorithm. It would also give you a good intuition of the process. I assume that covariance is known and that prior probabilities of components are equal, and fit only means.
The class would look like this (in Python 3):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
class FixedCovMixture:
""" The model to estimate gaussian mixture with fixed covariance matrix. """
def __init__(self, n_components, cov, max_iter=100, random_state=None, tol=1e-10):
self.n_components = n_components
self.cov = cov
self.random_state = random_state
self.max_iter = max_iter
self.tol=tol
def fit(self, X):
# initialize the process:
np.random.seed(self.random_state)
n_obs, n_features = X.shape
self.mean_ = X[np.random.choice(n_obs, size=self.n_components)]
# make EM loop until convergence
i = 0
for i in range(self.max_iter):
new_centers = self.updated_centers(X)
if np.sum(np.abs(new_centers-self.mean_)) < self.tol:
break
else:
self.mean_ = new_centers
self.n_iter_ = i
def updated_centers(self, X):
""" A single iteration """
# E-step: estimate probability of each cluster given cluster centers
cluster_posterior = self.predict_proba(X)
# M-step: update cluster centers as weighted average of observations
weights = (cluster_posterior.T / cluster_posterior.sum(axis=1)).T
new_centers = np.dot(weights, X)
return new_centers
def predict_proba(self, X):
likelihood = np.stack([multivariate_normal.pdf(X, mean=center, cov=self.cov)
for center in self.mean_])
cluster_posterior = (likelihood / likelihood.sum(axis=0))
return cluster_posterior
def predict(self, X):
return np.argmax(self.predict_proba(X), axis=0)
On the data like yours, the model would converge quickly:
np.random.seed(1)
X = np.random.normal(size=(100,2), scale=3)
X[50:] += (10, 5)
model = FixedCovMixture(2, cov=[[3,0],[0,3]], random_state=1)
model.fit(X)
print(model.n_iter_, 'iterations')
print(model.mean_)
plt.scatter(X[:,0], X[:,1], s=10, c=model.predict(X))
plt.scatter(model.mean_[:,0], model.mean_[:,1], s=100, c='k')
plt.axis('equal')
plt.show();
and output
11 iterations
[[9.92301067 4.62282807]
[0.09413883 0.03527411]]
You can see that the estimated centers ((9.9, 4.6) and (0.09, 0.03)) are close to the true centers ((10, 5) and (0, 0)).
I think the best option would be to "roll your own" GMM model by defining a new scikit-learn class that inherits from GaussianMixture and overwrites the methods to get the behavior you want. This way you just have an implementation yourself and you don't have to change the scikit-learn code (and create a pull-request).
Another option that might work is to look at the Bayesian version of GMM in scikit-learn. You might be able to set the prior for the covariance matrix so that the covariance is fixed. It seems to use the Wishart distribution as a prior for the covariance. However I'm not familiar enough with this distribution to help you out more.
First, you can use spherical option, which will give you single variance value for each component. This way you can check yourself, and if the received values of variance are too different then something went wrong.
In a case you want to preset the variance, you problem degenerates to finding only best centers for your components. You can do it by using k-means, for example. If you don't know the number of the components, you may sweep over all logical values (like 1 to 20) and evaluate the decrement in fitting error. Or you can optimize your own EM function, to find the centers and the number of components simultaneously.
I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)
I've been working on getting a hierarchical model of some psychophysical behavioral data up and running in pymc3. I'm incredibly impressed with things overall, but after trying to get up to speed with Theano and pymc3 I have a model that mostly works, however has a couple problems.
The code is built to fit a parameterized version of a Weibull to seven sets of data. Each trial is modeled as a binary Bernoulli outcome, while the thresholds (output of thact as the y values which are used to fit a Gaussian function for height, width, and elevation (a, c, and d on a typical Gaussian).
Using the parameterized Weibull seems to be working nicely, and is now hierarchical for the slope of the Weibull while the thresholds are fit separately for each chunk of data. However - the output I'm getting from k and y_est leads me to believe they may not be the correct size, and unlike the probability distributions, it doesn't look like I can specify shape (unless there's a theano way to do this that I haven't found - though from what I've read specifying shape in theano is tricky).
Ultimately, I'd like to use y_est to estimate the gaussian height or width, however the output right now results in an incredible mess that I think originates with size problems in y_est and k. Any help would be fantastic - the code below should simulate some data and is followed by the model. The model does a nice job fitting each individual threshold and getting the slopes, but falls apart when dealing with the rest.
Thanks for having a look - I'm super impressed with pymc3 so far!
EDIT: Okay, so the shape output by y_est.tag.test_value.shape looks like this
y_est.tag.test_value.shape
(101, 7)
k.tag.test_value.shape
(7,)
I think this is where I'm running into trouble, though it may just be poorly constructed on my part. k has the right shape (one k value per unique_xval). y_est is outputting an entire set of data (101x7) instead of a single estimate (one y_est per unique_xval) for each difficulty level. Is there some way to specify that y_est get specific subsets of df_y_vals to control this?
#Import necessary modules and define our weibull function
import numpy as np
import pylab as pl
from scipy.stats import bernoulli
#x stimulus intensity
#g chance (0.5 for 2AFC)
# m slope
# t threshold
# a performance level defining threshold
def weib(x,g,a,m,t):
k=-np.log(((1-a)/(1-g))**(1/t))
return 1- (1-g)*np.exp(- (k*x/t)**m);
#Output values from weibull function
xit=101
xvals=np.linspace(0.05,1,xit)
out_weib=weib(xvals, 0.5, 0.8, 3, 0.6)
#Okay, fitting the perfect output of a Weibull should be easy, contaminate with some noise
#Slope of 3, threshold of 0.6
#How about 5% noise!
noise=0.05*np.random.randn(np.size(out_weib))
out=out_weib+noise
#Let's make this more like a typical experiment -
#i.e. no percent correct, just one or zero
#Randomly pick based on the probability at each point whether they got the trial right or wrong
trial=np.zeros_like(out)
for i in np.arange(out.size):
p=out_weib[i]
trial[i] = bernoulli.rvs(p)
#Iterate for 6 sets of data, similar slope (from a normal dist), different thresh (output from gaussian)
#Gauss parameters=
true_gauss_height = 0.3
true_gauss_width = 0.01
true_gauss_elevation = 0.2
#What thresholds will we get then? 6 discrete points along that gaussian, from 0 to 180 degree mask
x_points=[0, 30, 60, 90, 120, 150, 180]
x_points=np.asarray(x_points)
gauss_points=true_gauss_height*np.exp(- ((x_points**2)/2*true_gauss_width**2))+true_gauss_elevation
import pymc as pm2
import pymc3 as pm
import pandas as pd
slopes=pm2.rnormal(3, 3, size=7)
out_weib=np.zeros([xvals.size,x_points.size])
for i in np.arange(x_points.size):
out_weib[:,i]=weib(xvals, 0.5, 0.8, slopes[i], gauss_points[i])
#Let's make this more like a typical experiment - i.e. no percent correct, just one or zero
#Randomly pick based on the probability at each point whether they got the trial right or wrong
trials=np.zeros_like(out_weib)
for i in np.arange(len(trials)):
for ii in np.arange(gauss_points.size):
p=out_weib[i,ii]
trials[i,ii] = bernoulli.rvs(p)
#Let's make that data into a DataFrame for pymc3
y_vals=np.tile(xvals, [7, 1])
df_correct = pd.DataFrame(trials, columns=x_points)
df_y_vals = pd.DataFrame(y_vals.T, columns=x_points)
unique_xvals=x_points
import theano as th
with pm.Model() as hierarchical_model:
# Hyperpriors for group node
mu_slope = pm.Normal('mu_slope', mu=3, sd=1)
sigma_slope = pm.Uniform('sigma_slope', lower=0.1, upper=2)
#Priors for the overall gaussian function - 3 params, the height of the gaussian
#Width, and elevation
gauss_width = pm.HalfNormal('gauss_width', sd=1)
gauss_elevation = pm.HalfNormal('gauss_elevation', sd=1)
slope = pm.Normal('slope', mu=mu_slope, sd=sigma_slope, shape=unique_xvals.size)
thresh=pm.Uniform('thresh', upper=1, lower=0.1, shape=unique_xvals.size)
k = -th.tensor.log(((1-0.8)/(1-0.5))**(1/thresh))
y_est=1-(1-0.5)*th.tensor.exp(-(k*df_y_vals/thresh)**slope)
#We want our model to predict either height or width...height would be easier.
#Our Gaussian function has y values estimated by y_est as the 82% thresholds
#and Xvals based on where each of those psychometrics were taken.
#height_est=pm.Deterministic('height_est', (y_est/(th.tensor.exp((-unique_xvals**2)/2*gauss_width)))+gauss_elevation)
height_est = pm.Deterministic('height_est', (y_est-gauss_elevation)*th.tensor.exp((unique_xvals**2)/2*gauss_width**2))
#Define likelihood as Bernoulli for each binary trial
likelihood = pm.Bernoulli('likelihood',p=y_est, shape=unique_xvals.size, observed=df_correct)
#Find start
start=pm.find_MAP()
step=pm.NUTS(state=start)
#Do MCMC
trace = pm.sample(5000, step, njobs=1, progressbar=True) # draw 5000 posterior samples using NUTS sampling
I'm not sure exactly what you want to do when you say "Is there some way to specify that y_est get specific subsets of df_y_vals to control this". Can you describe for each y_est value what values of df_y_vals are you supposed to use? What's the shape of df_y_vals? What's the shape of y_est supposed to be? (7,)?
I suspect what you want is to index into df_y_vals using numpy advanced indexing, which works the same in PyMC as in numpy. Its hard to say exactly without more information.