how i know the accuracy of the Kmeans? - python

from datetime import time
from numpy import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.read_csv( "spam.csv" )
feature_col=['total']
X=df[feature_col]
y=df.target
clf_km=KMeans(n_clusters=1)
clf_km=clf_km.fit(X)
clf_km.cluster_centers_
clf_km.labels_
I try to implement the Kmeans clustering but I don't know how I can plot the original clusters and the new ones I created by the kmeans, I want to plot to scatter for the original one and another for the newpart of the csv file .

Related

How to plot a column seperated by date into 12 monthly bars?

I have a dataframe containing hotel prices separated by the date of the listing and
I would like to plot the median of those prices each in a monthly bar.
So I first want to group the dates into the months. Then calculate the median of the months and then plit them in a bar chart.
Can you please show me how to do that? (Python beginner here)
Thank you in advance.
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None # default='warn'
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
calendar_df = pd.read_csv("calendar.csv")
calendar_df_kurz.head()

Data mining for machine learning

I start in data analysis and I encounter a problem on an exercise to recover on kaggle: file 'ENBsv' I import my data, determine the correlation, create a new column in my dataframe which totals my target variables
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import model_selection
from sklearn.model_selection import validation_curve
from sklearn import ensemble
from sklearn import svm
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier
df = pd.read_csv('ENB.csv')
df.columns= ["relative_compactness","surface_area","wall_area","roof_area","overall_height","orientaion",
"glazing_area","glazing_area_dist","heating_load","cooling_load"]
df.head()
corr =df.corr(method = 'pearson')
plt.figure(figsize = (20,10))
sns.heatmap(df.corr(), annot=True, cmap='Greens');
df['total_charges'] = pd.Series([1]).astype(dtype=float)
df['total_charges'] = df['heating_load'] + df['cooling_load']
I have to instantiate new variable 'charges_classes' split the buildings into 4 distinct classes with the label 0,1,2,3 according to the 3 quantiles of the new variable created. But I have to look and seek I can not find a solution, someone can help me here is what I did:
charge_classes = pd.get_dummies(df['total_charges'])
charge_classes
You could use qcut:
df['charge_classes'] = pd.qcut(df['total_charges'], 4, labels=False)

Combine multiple probability distributions together into one distribution in python

I have an experiment that involves using sensors and I have around 5 data files contains data collected from the sensors are in the time domain. For simplicity, let's say we concentrate on one sensor and I would require to obtain the probability distributions for all data files. I looked online and managed to find the best fit distribution using the following link:
Fitting empirical distribution to theoretical ones with Scipy (Python)
For my case, it turns out that normal distribution fits my data. So, I have multiple distributions and would like to combine them all into one distribution. What I did was that I averaged each probability densities by getting each density values and divide it by 5.
The average code is done using the following code:
def average(l):
llen = len(l)
def divide(x):
return x / llen
return map(divide, map(sum, zip(*l)))
for _ in range(5):
# read sensor data
# Obtain the probability distribution using code in the first link
# Getting list of pdf:
np_pdf = list(y_axis_pdf)
lt.append(np_pdf)
Average_list = average(lt)
Average_list = list(Average_list)
However, I asked a couple of people and searched online and it said that averaging is not the best way. So, what could be the correct way to combine several probability distributions together into one?
My second question is that I searched online and found this article:
How to Combine Independent Data Sets for the Same Quantity
How is it possible to use the code from the first link into the method in the article?
Edit 1:
Based on comment from #SeverinPappadeux, I edited my code and it is the following:
# Combining all PDF files into one dataset:
pdf_data = [np_pdf_01, np_pdf_02, np_pdf_03, np_pdf_04, np_pdf_05]
pdf_dataframe_ini = pd.DataFrame(pdf_data)
pdf_dataframe = pd.DataFrame.transpose(pdf_dataframe_ini)
# Creating one PDF from the PDF dataset:
gmm = GMM(n_components=1)
gmm.fit(pdf_dataframe)
x_pdf_data = [x_axis_pdf_01, x_axis_pdf_02, x_axis_pdf_03, x_axis_pdf_04, x_axis_pdf_05]
x_pdf = average(x_pdf_data)
x_pdf = list(x_pdf)
x = np.linspace(np.min(x_pdf), np.max(x_pdf), len(x_pdf)).reshape(len(x_pdf), 1)
logprob = gmm.score_samples(x)
pdf = np.exp(logprob)
I keep on getting the following error:
logprob = gmm.score_samples(x)
ValueError: Expected the input data X have 10 features, but got 1 features
How Can I solve this error and get the pdf plot for combined pdfs?
Sources:
How can I plot the probability density function for a fitted Gaussian mixture model under scikit-learn?
Edit 2:
I tried to implement Multivariate normal in order to combine several distributions together, however, I got the following error message:
ValueError: shapes (5,2000) and (1,1) not aligned: 2000 (dim 1) != 1 (dim 0)
How would I solve this error? Find below for the code:
Code:
import scipy.stats as st
import numpy as np
import pandas as pd
import scipy.stats as st
from matplotlib import pyplot as plt
from scipy.integrate import quad,simps, quad_vec, nquad
import winsound
from functools import reduce
from itertools import chain
import scipy.stats as st
from glob import glob
from collections import defaultdict, Counter
from sklearn.neighbors import KDTree
import pywt
import peakutils
import scipy
import os
from scipy import signal
from scipy.fftpack import fft, fftfreq, rfft, rfftfreq, dst, idst, dct, idct
from scipy.signal import find_peaks, find_peaks_cwt, argrelextrema, welch, lfilter, butter, savgol_filter, medfilt, freqz, filtfilt
from pylab import *
import glob
import sys
import re
from numpy import NaN, Inf, arange, isscalar, asarray, array
from scipy.stats import skew, kurtosis, median_absolute_deviation
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, kendalltau, spearmanr, ppcc_max
import matplotlib.mlab as mlab
from statsmodels.graphics.tsaplots import plot_acf
from tsfresh.feature_extraction.feature_calculators import mean_abs_change as mac
from tsfresh.feature_extraction.feature_calculators import mean_change as mc
from tsfresh.feature_extraction.feature_calculators import mean_second_derivative_central as msdc
from pyAudioAnalysis.ShortTermFeatures import energy as stEnergy
import pymannkendall as mk_test
from sklearn.preprocessing import MinMaxScaler, Normalizer, normalize, StandardScaler
import time
from tsfresh.feature_extraction.feature_calculators import mean_abs_change as mac
from tsfresh.feature_extraction.feature_calculators import mean_change as mc
from tsfresh.feature_extraction.feature_calculators import absolute_sum_of_changes as asc
from tsfresh.feature_extraction.feature_calculators import mean_second_derivative_central as msdc
from sklearn.decomposition import PCA, KernelPCA, SparsePCA, IncrementalPCA
from sklearn.preprocessing import MinMaxScaler, Normalizer, normalize, StandardScaler
import circle_fit as cf
from scipy import optimize
import functools
from math import sqrt, pi
from ellipse import LsqEllipse
import time
from matplotlib.patches import Ellipse
import pandas as pd
import numpy as np
import time
from mlxtend.feature_extraction import PrincipalComponentAnalysis
from sklearn.pipeline import make_pipeline
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns # data visualization library
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from scipy.stats import f
# from statsmodels import api as sm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import (KNeighborsClassifier,NeighborhoodComponentsAnalysis)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.cross_decomposition import PLSRegression
from sklearn.covariance import EmpiricalCovariance, MinCovDet
from sklearn.decomposition import kernel_pca, KernelPCA
from sklearn.decomposition import sparse_pca, SparsePCA
from sklearn.decomposition import incremental_pca, IncrementalPCA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import make_scorer
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm
from sklearn.metrics import r2_score
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from scipy.signal import savgol_filter
# import tflearn
# import tensorflow as tf
from statistics import mean
import seaborn
import warnings
from sklearn import preprocessing, neighbors
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from scipy.stats import mstats, multivariate_normal
def normalizer(list_values):
norm = [float(i) / sum(list_values) for i in list_values]
return norm
lb=-10
ub=10
domain=np.arange(lb,ub,.01)
domain_size=domain.shape[0]
print(domain_size)
dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)
# dist_1_norm = normalizer(dist_1)
# dist_2_norm = normalizer(dist_2)
# dist_3_norm = normalizer(dist_3)
# dist_4_norm = normalizer(dist_4)
# dist_5_norm = normalizer(dist_5)
dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.legend()
plt.plot(domain, st.norm.pdf(domain, 2,1), 'r', label='Dist. 1')
plt.plot(domain, st.norm.pdf(domain, 2.5,1.5), 'g', label='Dist. 2')
plt.plot(domain, st.norm.pdf(domain, 2.2,1.6), 'b', label='Dist. 3')
plt.plot(domain, st.norm.pdf(domain, 2.4,1.3), 'y', label='Dist. 4')
plt.plot(domain, st.norm.pdf(domain, 2.7,1.5), 'c', label='Dist. 5')
dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
graph=multivariate_normal.pdf(dists)
plt.plot(domain,graph, 'm', label='Combined Dist.')
plt.legend()
plt.show()

I would like to know how i can apply this clustering algorithm on my own data please?

I'd like to replace the iris data by my own data. please tell me what are the steps to follow to do that ?
thanks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report import matplotlib.pyplot as plt plt.rc('figure', figsize=(7,4))
iris = datasets.load_iris()
X = scale(iris.data)
Y = pd.DataFrame(iris.target)
variable_name = iris.feature_names X[0:10,]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
iris_df = pd.DataFrame(iris.data)
iris_df.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'] Y.columns = ['Targets']
The import section will stay the same.
Lets assume you have a dataframe:
#read your dataframe(several types possible)
df = pd.read_csv('test.csv')
#you need to define a target variable (named target in my case) and the features X
Y = df['target']
X = df.drop(['target'], axis=1)
#here your k-means algorithm gets start
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
let me add one more think, for what are you using kmeans? it is an unsupervised learning method, so you do not have any target variable, so what are you doing?
Normally it should be:
df = pd.read_csv('test.csv')
#columns header you want to use
relevant_columns = ['A', 'B']
X = df[relevant_columns]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report
# CHANGED CODE START
df = pd.read_excel('tmp.xlsx')
Y = df['target']
X = df.drop(['target'], axis=1)
# CHANGED CODE END
variable_name = X.columns
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)

Receiving ValueError: x and y must be the same size for x and y values. Any help would be appreciated

Currently working a machine learning problem for predicting the weather. Here however I while I was running my code in Jupyter notebook I came across the above error and I am not sure where I am going wrong as the values for my data should both be in 2d arrays. Any help would be greatly appreciated. In my notebook it specifically mentions line 133
axes[row, col]. scatter(df2[feature], df2['meantempm'])
as the problem. If it helps I am using https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-2/ as my pain resource for this
import jupyter
import IPython
from IPython import get_ipython
from datetime import datetime
from datetime import timedelta
import time
from collections import namedtuple
import pandas as pd
import requests
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, median_absolute_error
from sklearn.metrics import explained_variance_score, \
mean_absolute_error, \
median_absolute_error
import tensorflow as tf
df = pd.read_csv('end-part2_df.csv').set_index('date')
df.corr()[['meantempm']].sort_values('meantempm')
predictors = ['meantempm_1', 'meantempm_2', 'meantempm_3',
'mintempm_1', 'mintempm_2', 'mintempm_2',
'meandewptm_1', 'meandewptm_2', 'meandewptm_3',
'maxdewptm_1', 'maxdewptm_2', 'maxdewptm_3',
'mindewptm_1', 'mindewptm_2', 'mindewptm_3',
'maxtempm_1', 'maxtempm_2', 'maxtempm_3']
df2 = df[['meantempm'] + predictors]
get_ipython().run_line_magic('matplotlib','inline')
plt.rcParams['figure.figsize'] = [16, 22]
fig, axes = plt.subplots(nrows=6, ncols=3, sharey=True)
arr = np.array(predictors).reshape(6, 3)
for row, col_arr in enumerate(arr):
for col, feature in enumerate(col_arr):
axes[row, col]. scatter(df2[feature], df2['meantempm'])
if col == 0:
axes[row, col].set(xlabel=feature, ylabel='meantempm')
else:
axes[row, col].set(xlabel=feature)
plt.show()
Your df['mintempm_2'] is 2D (997, 2). This is because in your predictors array you have included 'mintempm_2' twice.

Categories