There is a nice example of linear regression in sklearn using a diabetes dataset.
I copied the notebook version and played with it a bit in Jupyterlab. Of course, it works just like the example. But I wondered what I was really seeing.
There is a chart with unlabeled axes.
I wondered what the label (dependent variable) was.
I wondered which of the 10 independent variables was being used.
So I played around with the nice features provided by ipython/jupyter:
diabetes.DESCR
Diabetes dataset
================
Notes
-----
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of
n = 442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attributes:
:Age:
:Sex:
:Body mass index:
:Average blood pressure:
:S1:
:S2:
:S3:
:S4:
:S5:
:S6:
Note: Each of these 10 feature variables have been mean centered and scaled by the standard
deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'
From the Source URL, we are led to the original raw data which is a tab-separated unnormalized copy of the data. It also further explains what the "S" features were in the problem domain.
Interestingly, sex was one of [1,2] with a guess as to what they meant.
But my real question is whether there is a way within sklearn to determine
how to denormalize the data in sklearn?
Is there a way to denormalize the coefficients and intercept so that one could
express the fit algebraically?
or is this just a demonstration of linear regression?
There is no way to denormalize data without any information about the data prior to the normalization. However, note that the sklearn.preprocessing classes MinMaxScaler, StandardScaler, etc. do include inverse_transform methods (example), so if this were also provided in the example it would be easy to do. As it stands, as you say, this is just a regression demonstration.
Related
I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !
I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !
I want to study the relationship between car accidents and weather temperature.
So, I have a dataset for car accidents that have different attributes related to accidents and weather temperature for when the accidents occurred as follow:
To study this relationship, I want to formulate my hypothesis as follow:
H0: There is no relationship between hot weather (greater than 28
degrees Celsius) and the number of car accidents
H1: There is a relationship between hot weather (greater than 28
degrees Celsius) and the number of car accidents
I am not sure of how to calculate the p-value for the above hypothesis in python. I did the following:
import pandas as pd
from scipy.stats import ttest_ind
cd = pd.read_csv('Accidents.csv', parse_dates=['DATE'])
hot = cd[cd['Temperature Celsius']>28]
notHot = cd[cd['Temperature Celsius']<=28]
ttest_ind(hot['Temperature Celsius'], notHot['Temperature Celsius'])
How do I calculate the p-value for the above hypothesis? is my implementation correct by just getting the records that match my criteria and passing them to ttest-ind or I have to pass all the dataset instead of "notHot"?
or should I summarize the data using a different approach such as the number of accidents in each months compared to targeted weather temperature and perform different statistical test as follow:
I am a bit lost on how to choose the best statistical test and how to perform it. I am interested in the statistical significance of the effect of temperature on the number of accidents. So, is it better to calculate the statistical significance using the above hypothesis or using regression for example?
Thank you very much.
The sample dataset is available in the following link:
https://drive.google.com/open?id=1WWtihWyUhL1m5Bp094SINTF14_icncnh
I'm very new to PCA.
I have 11 X variables for my model. These are the X variable labels
x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]
This is the graph I generated from the explained variance. With the x axis being the principal component.
[ 3.47567089e-01 1.72406623e-01 1.68663799e-01 8.86739892e-02
4.06427375e-02 2.75054035e-02 2.26578769e-02 5.72892368e-03
2.49272688e-03 6.37160140e-05]
I need to know whether I have a good selection of features. And how can I know which feature contributions the most.
from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_
Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.
By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.
In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.
Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.
When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.
When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.
You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.
I have a pandas.dataframe with a column passengers with a range which may vary greatly depending on the function creating the dataframe.
The other columns are often more or less of constant ranges (they're economy indicators).
segments.head(2);
passengers gdp gdp_per_capita inflation unemployment \
Month
2002-01-01 11688 4461.087 31634.953 150.847 14.418
2002-02-01 9049 4142.153 29321.702 204.132 14.738
population
Month
2002-01-01 339.59
2002-02-01 343.32
My most valuable data is the number of passengers, so I do not want to transform it. However, the differences of scale of the other measures, which I want to use as predictors, make it difficult to track the variations (sometimes in tens of thousands, sometimes in decimals).
How could I standardize the range of all my columns to be consistent with the mean(passengers)?
There are different ways you can approach that problem, you can make/apply a manual transformation function, or you can use a pre existing function, such as sklearn.preprocessing.StandardScaler.
StandardScaler will "Standardize features by removing the mean and scaling to unit variance". You can hence shift mean and adjust unit variance accordingly to your desires/needs.
However, it looks to me you are going to try and build a predictive model on that data, if so,the best approach would be to test all hypothesis, and keep what works best, my advice is:
Remove skew from passagers (if present) - Log & Log1p are most common transforms, but depending on your data other transforms might be better. You should test arbitrary functions as well (inverse, or 1/(X+1) for example) and use the best transform (skew closest to 0)
Test both scaled / non scaled features. If data is skewed test both with transform/without as above.
If outliers are present test both with and without (outliers converted to borderline values / outliers converted to np.nan) Make a boolean feature column identifying outliers for each feature. Test to see if its valuable information or just noise to the model.
Hope that helps,