Fitting autoregressive models to EEG timeseries

Fitting autoregressive models to EEG timeseries - python

So I read that it is possible to fit AR models to EEG data and then use the AR coefficients as features for clustering or classifying data : e.g. Mohammadi et al, Person identification by using AR model for EEG signals, 2006.
As a quality control step, and as an aid for explanation, I wanted to visually see the type of timeseries produced/simulated by the fitted model. This would also allow me to show the prototype model if I was doing K means or something for classification.
However, all I seem to be able to produce is noise!
Any steps towards getting towards what I want would be more than welcome.
section1 = data[88000:91800]
section2 = data[0:8000]
section3 = data[143500:166000]
section1 -= np.mean(section1)
section2 -= np.mean(section2)
section3 -= np.mean(section3)
When plotted:
maxOrder = 20
model_one = AR(section1).fit(maxOrder, ic = 'aic', trend = 'nc')
model_two = AR(section2).fit(maxOrder, ic = 'aic', trend = 'nc')
model_three = AR(section3).fit(maxOrder, ic = 'aic', trend = 'nc')
fake1 = arma_generate_sample(model_one.params,[1],1000, sigma = 1)
fake2 = arma_generate_sample(model_two.params,[1],1000,sigma = 1)
fake3 = arma_generate_sample(model_three.params,[1],1000,sigma = 1)
ax1.plot(fake1)
ax2.plot(fake2)
ax3.plot(fake3)

The standard simplest more-or-less-true thing to say about EEG data is that it has a 1/f or "pink" distribution. An interesting thing about 1/f signals is that they are non-stationary, and cannot be correctly modelled by an ARMA process of any order. (1/f means that low frequency fluctuations are arbitrarily large, which means that arbitrarily far apart points remain correlated, and the more data you have, the further apart the correlations you can detect -- the ACF never converges to anything finite. Also, it's important to realize that spectral content and ARMA-like processes are super super related, because a signal's auto-correlation function totally determines its spectral distribution, and vice-versa -- the two functions are Fourier transforms of each other.)
So basically this means that anything you do using basic time series statistics is going to be a huge theory-violating hack. It doesn't mean it won't work in practice to produce some useful classification features, but calibrate your expectations accordingly... it might well be that the results you're getting are exactly the same as Mohammadi et al got, and they just didn't didn't bother to do any checking/reporting of goodness of fit.
There are ways to model 1/f noise directly, via wavelets or ARIMA processes.
Depending on your data, you may also need to worry about deviations from the simple 1/f distribution: stuff like alpha (which produces a substantial bump in the spectral distribution at 10 Hz), artifacts like muscle noise, electrical line noise, and heart beat (which also cause substantial deviations from the simple 1/f spectrum -- muscle in particular produces very distinctive broad-band ~whitish noise), and eye blinks (which produce huge impulse deviations that aren't going to be well-modelled by any technique that assumes stationarity or works in the frequency domain).
There's more discussion (with references) of these issues in section 5.3 of my thesis, though in the context of doing ERP-like analyses rather than machine learning.

Related

How to find appropriate clustering algorithm to cluster my data? [duplicate]

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..

Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().

It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.

You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !

Huge amount of noise appeared in the kalman filter

I am implementing a Kalman filter but I am getting a huge amount of noise to the result. The amount of noise which I added and code for it is:
y = np.arange(100)
y = y + 0.0065 * np.random.randn() # yaw rate
My resulting output is:
0.0065 is already very much less noise. Is there any way to clean this noise?

Kalman filter is indeed one way to filter noise out of your observations. If well implemented and tuned to the dynamics of your problem, it can effectively retrieve accurate estimates considering a Gaussian noise scenario (for different noise models the classical Kalman filter may not be suitable). As explained here, you need to tune your process and observation covariance noise matrices. Sometimes its a process of trial and error. The more noisy you consider your observations to be, the more weight the filter puts into the previous knowledge. The more noisy you consider your process to be (highly dynamic), the less weight the filter puts into the previous knowledge.

Sentence selection surrounded to a particular words

Suppose I have a paragraph:
Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."
And have the following test_wrds,
Test_wrds = ['Power curve', 'data-driven','wind turbines']
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string. For example, Test_wrds Power curve appeared first in 1st sentence hence but when we select 2nd sentence there are another Power curve words thus the output would be something like this
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.
And likewise, I would like to slice sentences for data-driven and wind turbines and saved them in separate strings.
How can I implement this using Python in a simple way?
So far I found code which basically removes the entire sentence whenever any Text_wrds is in.
def remove_sentence(Str_wrds , Test_wrds):
return ".".join((sentence for sentence in input.split(".")
if Test_wrds not in sentence))
But I don't understand how to use this for my problem.
update on the problem: Basically, whenever there is test_wrds present in the paragraph, I would like to slice that sentence as well as before and after one sentence and saved it on a single string. So for example for three text_wrds I am expected to get three strings which basically covers sentences with text_wrds individually. I attached pdf, for example, the output, I am looking for

You could define a function something like this one
def find_sentences( word, text ):
sentences = text.split('.')
findings = []
for i in range(len(sentences)):
if word.lower() in sentences[i].lower():
if i==0:
findings.append( sentences[i+1]+'.' )
elif i==len(sentences)-1:
findings.append( sentences[i-1]+'.' )
else:
findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
return findings
This can then be called as
findings = find_sentences( 'Power curve', Str_wrds )
With some pretty printing
for finding in findings:
print( finding +'\n')
We get the results
However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.
The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.
The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..
which I hope is what you where looking for :)

When you say,
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string.
I guess you mean that, all the sentences that have one of the words in Test_wrds in them, the sentence before them, and after them, should also be selected.
Function
def remove_sentence(Str_wrds: str, Test_wrds):
# store all selected sentences
all_selected_sentences = {}
# initialize empty dictionary
for k in Test_wrds:
# one element for each occurrence
all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())
# list of sentences
sentences = Str_wrds.split(".")
word_counter = {}.fromkeys(Test_wrds,0)
for i, sentence in enumerate(sentences):
for j, word in enumerate(Test_wrds):
# case insensitive
if word.lower() in sentence.lower():
if i == 0: # first sentence
chosen_sentences = sentences[0:2]
elif i == len(sentences) - 1: # last sentence
chosen_sentences = sentences[-2:]
else:
chosen_sentences = sentences[i - 1:i + 2]
# get which occurrence of the word is it
k = word_counter[word]
all_selected_sentences[word][k] += '.'.join(
[s for s in chosen_sentences
if s not in all_selected_sentences[word][k]]) + "."
word_counter[word] += 1 # increment the word counter
return all_selected_sentences
Running this
answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)
with the provided values for Str_wrds and Test_wrds,
returns this output
{
'Power curve': [
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
],
'data-driven': [
' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
],
'wind turbines': [
' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
]
}
Notes:
the function returns a dict of lists
every key is a word in Test_wrds, and list element is an occurrence of the word.
for example, because the word 'power curve' occurs 4 times in the entire text, the value for 'power curve' in the output is a list of 4 elements.

trouble getting started with simple pymc3 example

I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!

I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer:
see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)
Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?

DBSCAN indeed does not impose a total size constraint on the cluster.
The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).
I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).
You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.
However, I also have the impression that you aren't really sure on what you need and why.
A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.
Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.
If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.
Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).

I had the same problem and ended up solving it by using DBSCAN in combination with KMeans clustering: First I use DBSCAN to identify high density clusters and remove outliers, then I take any cluster larger than 250 Miles (in my case) and break it apart. Here's the code:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.3, min_samples=100).fit(load_geocodes[['lat', 'long']])
load_geocodes.loc[:,'cluster'] = clustering.labels_
import mpu
def calculate_cluster_size(lat, long):
left_top = (max(lat), min(long))
right_bottom = (min(lat), max(long))
distance = mpu.haversine_distance(left_top, right_bottom)*0.621371
return distance
for c, df in load_geocodes.groupby('cluster'):
if c == -1:
continue # don't do this for outliers
distance = calculate_cluster_size(df['lat'], df['long'])
print(distance)
if distance > 250:
# break clusters into more clusters until the maximum size of a cluster is less than 250 Miles
max_distance = distance
i = 2
while max_distance > 250:
kmeans = KMeans(n_clusters=i, random_state=0).fit(df[['lat', 'long']])
df.loc[:, 'cl_temp'] = kmeans.labels_
max_temp_cl_size = 0
for temp_cl, temp_cl_df in df.groupby('cl_temp'):
temp_cl_size = calculate_cluster_size(temp_cl_df['lat'], temp_cl_df['long'])
if temp_cl_size > max_temp_cl_size:
max_temp_cl_size = temp_cl_size
i += 1
max_distance = max_temp_cl_size
load_geocodes.loc[df.index,'subcluster'] = kmeans.labels_

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.