K-Mean for outlier detection - python

I have a dataset which consists of 13 columns and about 10million rows. Part of my project is to use isolation forest, elliptic envelope and K-mean in order to detect and remove outliers. I'm trying to use K-mean but everytime I run the code, nothing happens to the csv file, am I doing something wrong?
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
df = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\5-FINAL2\\Final After Simple Filtering.csv')
KMEAN = KMeans( n_clusters=100)
df['anomaly'] = KMEAN.fit_predict(df)
df = df[df['anomaly'] != -1]
del df['anomaly']
df.to_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\K TEST.csv', index=False)
Thank you.

Related

Grouping clusters based on one feature column

I have not clustered data in a while and at the moment i have a massive list of accounts with their perspective areas (or OUs in the table below).
I have used kmeans and kmodes to try and cluster based on OU - meaning that I want the output to group the 17 OUs i have and cluster them based on the provided information. Thus far the output has provided me with clustering based on each record individually and not based on each OU. can some one help me figure out how to group the output then cluster somehow? below is the same of the code used.
# Building the model with 3 clusters
kmode = KModes(n_clusters=3, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(df)
clusters
#insert the predicted cluster values in our original dataset.
df.insert(0, "Cluster", clusters, True)
df.head(10)
I don't have access to your data set, but below is a generic example of how to do clustering.
# Cluster analysis, or clustering, is an unsupervised machine learning task.
# It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling),
# clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb

sihouette score returns inconsistent number of sample

I am using scikit's silhouette_score hierarchical clustering. I am not from data science background, or python. However i do know some other languages and do know how hierarchical clustering logic works. i was told to use the scikit's silhouette_score to calculate the silhouette score. this code returns an error of
ValueError: Found input variables with inconsistent numbers of samples: [149, 150]
The data used is csv, containing 151 rows with the first row as the data's type. So in total there is 150 datas.
here is my code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.metrics import silhouette_score
iris = pd.read_csv("Iris.csv")
#iris hierarichal
iris_df = iris.iloc[:, 1:5]
plt.figure(figsize=(10, 7))
plt.title("Iris Dendograms Average Method")
link = linkage(iris_df, method='average')
dend = dendrogram(link)
plt.show()
clusters = fcluster(link, 3, criterion='maxclust')
print(silhouette_score(link, clusters))
You've got a problem here:
print(silhouette_score(link, clusters))
Change it and you're fine to go:
print(silhouette_score(X, clusters))
Please see docs for silhouette_score:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise.
Array of pairwise distances between samples, or a feature array.

Issue with Scikit-learn data analysis

am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using to analyze the data is below
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated
I think in your case, the problem is the column name, especially [W/m**2/um/sr].
Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
scaler = StandardScaler(with_mean=True, with_std=True)
pca= PCA(n_components=2)
pca.fit(scaler.fit_transform(data))
print(pca.components_)
Worked well for me. Maybe you just need to specify:
data.columns = data.columns.astype(str)

How to input twitter data (csv/txt) into DBSCAN python?

Could someone guide me how could i cluster twitter data using DBSCAN in python? I am totally new to DBSCAN. Also, how to determine the eps value and the iloc or loc value.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def clusterEvaluate(cluster):
count_cluster = np.bincount(cluster)
count_cluster = np.argmax(count_cluster)
same_clusters = np.count_nonzero(cluster == count_cluster)/np.size(cluster)
return same_clusters
dataset = np.loadtxt('tweetdata.csv') # not sure if this works
X = StandardScaler().fit_transform(dataset)
y_valid = dataset.iloc[:6].values()
dbscan = DBSCAN(eps= 0.5,min_samples=5,metric='euclidean')
y = dbscan.fit_predict(X)
cluster_labels = np.unique(y)
same_clusters = []
i = 0
for index in cluster_labels:
cluster = y_valid[y == index]
same_clusters.insert((i, clusterEvaluate(cluster)))
You need to choose and appropriate data representation and distance function for this. Furthermore, scalability will kill you.
I do not think it will work well. I have it seen anything that gives insightful results beyond counting frequent words in a unnecessary complex fashion. Twitter data is a bitch. The messages are just too short. All the good approaches like LDA need much longer documents.

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?

I have used the following code to convert the sk learn breast cancer data set to data frame : I am not getting the output ? I am very new in python and not able to figure out what is wrong.
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = numpy.c_[cancer.data, cancer.target]
columns = numpy.append(cancer.feature_names, ["target"])
return pandas.DataFrame(data, columns=columns)
answer_one()
Use pandas
There was a great answer here: How to convert a Scikit-learn dataset to a Pandas dataset?
The keys in bunch object give you an idea about which data you want to make columns for.
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = pd.Series(cancer.target)
The following code works
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = np.c_[cancer.data, cancer.target]
columns = np.append(cancer.feature_names, ["target"])
return pd.DataFrame(data, columns=columns)
answer_one()
The reason why your code doesn't work before was you try to call numpy and pandas package again after defining it as np and pd respectively.
However, i suggest that the package loading and redefinition is done at the beginning of the script, outside a function definition.
As of scikit-learn 0.23 you can do the following to get a DataFrame and save some keystrokes:
df = load_breast_cancer(as_frame=True)
df.frame
dataframe = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
dataframe['target'] = cancer.target
return dataframe

Categories