KMeans Clustering of CSV Data Set

KMeans Clustering of CSV Data Set - python

I am trying to create a KMeans clustering model based on a csv data set that I have compiled. The data set is organized as such:
population longitude latitude
Atlanta, GA
Austin, TX
...
I tried just plotting the data, which isn't working, if produces a scatter plot where you can't see the axis or the data points, and I can't really tell of the Kmeans algorithim is working.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
import pandas as pd
import csv
data = pd.read_csv("data.csv")
print (data.head())
plt.scatter(x=data['Population'].astype(bytes), y=data['Longitude'].astype(bytes), z=data["Latitude"].astype(bytes))
plt.xlim(0,1000000)
plt.ylim(0,5000)
plt.zlim(0,5000)
plt.xlabel('Population')
plt.ylabel('Longitude')
plt.zlabel('Latitude')
plt.title('KMeans Clustering for Population vs. Latitude and Longitude', fontsize = 10)
plt.show()
x = data.iloc[:,1:3] #selecting features
#Clustering
kmeans = KMeans(3)
kmeans.fit(x)
#Clustering Results
indentified_clusters = kmeans.fit_predict(x)
indentified_clusters
array([1,1,0.0,2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters
plt.scatter(data_with_clusters['Population'],data_with_clusters['Longitude'],data_with_clusters['Latitude']c=data_with_clusters['Clusters'],cmap='rainbow')

Try the following :
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
#2 Importing the mall dataset
data= pd.read_csv("xxx")
print(data.head())
plt.scatter(data['Longitude'],data['Latitude'])
plt.xlim(-180,180)
plt.ylim(-90,90)
plt.show()
x = data.iloc[:,1:3] # 1t for rows and second for columns
x
kmeans = KMeans(3)
kmeans.fit(x)
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = indentified_clusters
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')
plt.show()

Related

How to add anomaly points on the boxplot

I used the ellipticenvelope method to find the anomalies in the iris dataset as below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
cols = iris.feature_names
X = pd.DataFrame(iris.data, columns=cols)
X.head()
from sklearn.preprocessing import StandardScaler
from sklearn.covariance import EllipticEnvelope
scaler = StandardScaler()
scaler.fit_transform(X)
cov = EllipticEnvelope(store_precision=True,
assume_centered=True,
support_fraction=None,
contamination=0.01,
random_state=0)
cov.fit(X)
X['Anomaly'] = cov.predict(X)
Now you can find the anomalies in the last column with the value -1.
X[X['Anomaly'] == -1]
Now I want to do a root cause analysis to find the source of the anomaly, so I want to plot the anomalies in the boxplot with red dots for example. Is it possible or not? if yes, how can I add it?
X.boxplot(column=cols, grid=False, rot=45)
# code to plot anomalies on boxplot
plt.show()

Data mining for machine learning

I start in data analysis and I encounter a problem on an exercise to recover on kaggle: file 'ENBsv' I import my data, determine the correlation, create a new column in my dataframe which totals my target variables
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import model_selection
from sklearn.model_selection import validation_curve
from sklearn import ensemble
from sklearn import svm
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier
df = pd.read_csv('ENB.csv')
df.columns= ["relative_compactness","surface_area","wall_area","roof_area","overall_height","orientaion",
"glazing_area","glazing_area_dist","heating_load","cooling_load"]
df.head()
corr =df.corr(method = 'pearson')
plt.figure(figsize = (20,10))
sns.heatmap(df.corr(), annot=True, cmap='Greens');
df['total_charges'] = pd.Series([1]).astype(dtype=float)
df['total_charges'] = df['heating_load'] + df['cooling_load']
I have to instantiate new variable 'charges_classes' split the buildings into 4 distinct classes with the label 0,1,2,3 according to the 3 quantiles of the new variable created. But I have to look and seek I can not find a solution, someone can help me here is what I did:
charge_classes = pd.get_dummies(df['total_charges'])
charge_classes

You could use qcut:
df['charge_classes'] = pd.qcut(df['total_charges'], 4, labels=False)

How to scatter plot 3 columns

Code is below
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
df = pd.DataFrame(np.random.rand(10,3), columns=["A", "B","C"])
km = KMeans(n_clusters=3).fit(df)
df['cluster_id'] = km.labels_
test = {0:"Blue", 1:"Red", 2:"Green"}
#sns.scatterplot()
plt.show()
I am trying to plot without x,y that is column constraints. I need to plot any number of columns just want to plot the cluster graph

I would like to know how i can apply this clustering algorithm on my own data please?

I'd like to replace the iris data by my own data. please tell me what are the steps to follow to do that ?
thanks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report import matplotlib.pyplot as plt plt.rc('figure', figsize=(7,4))
iris = datasets.load_iris()
X = scale(iris.data)
Y = pd.DataFrame(iris.target)
variable_name = iris.feature_names X[0:10,]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
iris_df = pd.DataFrame(iris.data)
iris_df.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'] Y.columns = ['Targets']

The import section will stay the same.
Lets assume you have a dataframe:
#read your dataframe(several types possible)
df = pd.read_csv('test.csv')
#you need to define a target variable (named target in my case) and the features X
Y = df['target']
X = df.drop(['target'], axis=1)
#here your k-means algorithm gets start
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
let me add one more think, for what are you using kmeans? it is an unsupervised learning method, so you do not have any target variable, so what are you doing?
Normally it should be:
df = pd.read_csv('test.csv')
#columns header you want to use
relevant_columns = ['A', 'B']
X = df[relevant_columns]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report
# CHANGED CODE START
df = pd.read_excel('tmp.xlsx')
Y = df['target']
X = df.drop(['target'], axis=1)
# CHANGED CODE END
variable_name = X.columns
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)

Unable to get Regression line and the variance bounds in Seaborn pairplot

I am unable to get regression line and the variance bounds around it while plotting seaborn.pairplot with kind=reg as shown in the examples at http://seaborn.pydata.org/generated/seaborn.pairplot.html
import pandas pd
import seaborn as sns
import numpy as np
import matplotlib as plt
# Preparing random dataFrame with two colums, viz., random x and lag-1 values
lst1 = list(np.random.rand(10000))
df = pd.DataFrame({'x1':lst1})
df['x2'] = df['x1'].shift(1)
df = df[df['x2'] > 0]
# Plotting now
pplot = sns.pairplot(df, kind="reg")
pplot.set(ylim=(min(df['x1']), max(df['x1'])))
pplot.set(xlim=(min(df['x1']), max(df['x1'])))
plt.show()

The regression line is there, you just don't see it, because it's hidden by the unnaturally high number of points in the plot.
So let's reduce the number of points and you'll see the regression as expected.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Preparing random dataFrame with two colums, viz., random x and lag-1 values
lst1 = list(np.random.rand(100))
df = pd.DataFrame({'x1':lst1})
df['x2'] = df['x1'].shift(1)
df = df[df['x2'] > 0]
# Plotting now
pplot = sns.pairplot(df, kind="reg")
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

KMeans Clustering of CSV Data Set - python

Related

How to add anomaly points on the boxplot

Data mining for machine learning

How to scatter plot 3 columns

I would like to know how i can apply this clustering algorithm on my own data please?

Unable to get Regression line and the variance bounds in Seaborn pairplot

Categories

Resources