I'm relatively new to Python and Machine Learning, but I've been working on building out a predictive model for Mortgage prices. Where I'm struggling is using the K-Nearest Neighbor algorithm to create a feature.
Here's how I understand the mechanics of what I want to accomplish:
I have two data files: Mortgages Sold and Mortgages Listed
In both data files I have the same features (including Lat/Long).
I want to create a column in Mortgages Listed that represents the median price of the most closely related homes in the immediate area.
I'll use the methodology listed in 3 to create columns for 1-3 months, 4-6 months, 7-12 months.
Another column would be the trend of those three columns.
I've found something on KNN imputation, but that doesn't seem to be what I'm looking for.
How do I go about executing this idea? Are there resources that I may have missed that would help?
Any guidance would be appreciated. Thanks!
So, from what I understand, you want to fit the KNN Model using Mortgages Sold data to predict the prices for Mortgages Listed data.
This is a classical KNN problem where you will need to find the nearest features vectors in Sold data for each feature vector in Listed data, and then take the median of those feature vectors.
Consider there are n rows in Sold data, and the feature vectors for each row are X1,X2, ..., Xn and the corresponding prices are P1, P2, ..., Pn
X_train = [X1, X2, ..., Xn]
y_train = [P1, P2, ..., Pn]
Note here that each Xi itself is a feature vector and the representative of ith row
For now, consider that you want 5 closest rows in Sold data for each row in Listed data. So, a KNN model parameter here which might need to be optimised later is:
NUMBER_OF_NEIGHBOURS = 5
Now, the training code will look something like this:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=NUMBER_OF_NEIGHBOURS)
knn_model.fit(X_train, y_train)
For prediction, consider there are m rows in Listed data, and the feature vectors for each row are F1, F2, ..., Fm. The corresponding median prices Z1, Z2, ..., Zm need to be determined.
X_test = [F1, F2, ..., Fm]
Note that the feature vectors in X_train and X_test should be vectorized using the same Vectorizer/Transformer. Read more about Vectorizers here.
The prediction code will look something like this:
y_predicted = knn_model.predict(X_test)
Each element of this y_predicted list will contain (in this case) 5 closest prices from y_train. That is:
y_predicted = [(P11, P12, .., P15), (P21, P22, .., P25), .., (Pm1, Pm2, .., Pm5)]
For each jth element of y_predicted:
import numpy as np
Zj = np.median(np.array([Pj1, Pj2, .., Pj5]))
Hence, in that way, you can find the median price Zj for each row of Listed data
Now, coming to the parameter optimisation part. The only hyper-parameter in your KNN Model would be NUMBER_OF_NEIGHBOURS. You can find the optimal value of this parameter by dividing the X_train itself into say 80:20 ratio. Train on the 80% part and cross-validate on the remaining 20% part. Once, you are sure that the accuracy numbers are good enough, you can use this value of the hyper-parameter NUMBER_OF_NEIGHBOURS for prediction on the y_test.
In the end, for month-wise analysis, you will need to create month-wise models. For example, M1 = Trained on 1-3 month Sold data, M2 = Trained on 4-6 month Sold data, M3 = Trained on 7-12 month Sold data, etc.
Reference: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Related
I want to use BIC criterion to find the optimal number of clusters for GMM clustering. I plotted the BIC scores for cluster numbers 2 to 41, and get the attached curve. I have no idea how to interpret this, can someone help?
For reference, this is the code I used to do GMM clustering. It is applied to daily wind vector data over a region, totaling approximately 5,500 columns and 13,880 rows.
def gmm_clusters(df_std, dates):
ks = range(2, 44, 3)
bic_scores = []
csv_files = []
for k in ks:
model = GaussianMixture(n_components=k,
n_init=1,
init_params='random',
covariance_type='full',
verbose=0,
random_state=123)
fitted_model = model.fit(df_std)
bic_score = fitted_model.bic(df_std)
bic_scores.append(bic_score)
labels = fitted_model.predict(df_std)
print("Labels counts")
print(np.bincount(labels))
df_label = pandas.DataFrame(df_std)
print("############ dataframe AFTER CLUSTERING ###############")
df_dates = pandas.DataFrame(dates)
df_dates.columns = ['Date']
df_dates = df_dates.reset_index(drop=True)
df_label = df_label.join(df_dates)
df_label["Cluster"] = labels
print(df_label)
csv_file = "{0}_GMM_2_Countries_850hPa.csv".format(k)
df_label.to_csv(csv_file)
csv_files.append(csv_file)
return ks, bic_scores, csv_files
Thank you!!
EDIT:
Using K-means on the same data, I get this elbow plot (plot of SSE):
This is fairly clear to interpret, indicating that 11 clusters is the optimum.
The first thing that springs to mind is check the numbers of clusters below 10 with a step of 1, not 3. Maybe there is a dip in BIC you are missing there.
The second thing is maybe check aic vs bic. See here: https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
The third thing is that your dataset has 5,500 dimensions, but only 13,880 points. There is less than 3 points per dimension. I would be surprised to find any clustering at all (which is what the BIC chart is indicating). You'd need to tell more about the data and what each column means and what clustering you are looking for.
I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !
I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !
I have data in this format-
[0.266465 0.9203907 1.007363 ... 0. 0.09623989 0.39632136]
It is the value of the first row and first column.
It is the value of the second column of the first row:
[0.9042176 1.135085 1.2988662 ... 0. 0.13614458 0.28000486]
I have 2200 such rows and I want to train a classifier to identify that if the two set of values are similar or not?
P.S.- These are extracted feature vector values.
If you assume relation between two extracted feature vectors to be linear, you could try using Pearson correlation:
import numpy as np
from scipy.stats import pearsonr
list1 = np.random.random(100)
list2 = np.random.random(100)
pearsonr(list1, list2)
An example output is:
(0.0746901299996632, 0.4601843257734832)
Where first value refers to correlation (7%), the second to its significance (with > 0,05 you accept the null hypothesis that the correlation is insignificant at significance level alfa = 5%). And if vectors are correlated, they are be in a way similar. More about the method here.
Also, I came across Normalized Cross-Correlation that is used for identifying similarity between pictures (not an expert, so rather check this).
I am trying to build a content-based recommender system in python/pandas/numpy/sklearn.
Here are the matrix involved and their size:
X: n_customers * n_features (contains the features of each customer)
Y: n_customers *n_products (contains the scores given by each customer to each product)
Theta: n_features * n_products
The aim is to learn Theta in order to be able to predict the score given by a customer to all products (X*Theta). Indeed, Y is a sparse matrix, a customer score only a very small % of the whole quantity of products. This is why Y contains a lot of NaN values.
Here is my problem:
This is a regression problem with many targets (here target=product). But I want to do the regression only on not null values. because the number of NaN differ from one product to another, how can I vectorize that ?
Assume there are 1000 products and 100 000 customers, each one having 20 features.
For each product I need to the regression on the not null values. So without vectorization, I would need 1000 different regressor learning each one a Theta vector of length 20.
If possible I would like to solve this problem with sklearn. The ridge regression for example takes into account multiple targets (Y as a matrix)
I hope it's clear enough.
Thank you for your help.
I believe You can use centered cosine similarity /pearson corelation to make this work and make use of collaborative filtering technique to achieve this
Before you use pearson co -relation you need to fill the Null ( the fields which dont have any entries) with zero ,now pearson co relation centers the similarity matrix around zero ,which gives optimum recommendation .