Data mining for machine learning

Data mining for machine learning - python

I start in data analysis and I encounter a problem on an exercise to recover on kaggle: file 'ENBsv' I import my data, determine the correlation, create a new column in my dataframe which totals my target variables
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import model_selection
from sklearn.model_selection import validation_curve
from sklearn import ensemble
from sklearn import svm
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier
df = pd.read_csv('ENB.csv')
df.columns= ["relative_compactness","surface_area","wall_area","roof_area","overall_height","orientaion",
"glazing_area","glazing_area_dist","heating_load","cooling_load"]
df.head()
corr =df.corr(method = 'pearson')
plt.figure(figsize = (20,10))
sns.heatmap(df.corr(), annot=True, cmap='Greens');
df['total_charges'] = pd.Series([1]).astype(dtype=float)
df['total_charges'] = df['heating_load'] + df['cooling_load']
I have to instantiate new variable 'charges_classes' split the buildings into 4 distinct classes with the label 0,1,2,3 according to the 3 quantiles of the new variable created. But I have to look and seek I can not find a solution, someone can help me here is what I did:
charge_classes = pd.get_dummies(df['total_charges'])
charge_classes

You could use qcut:
df['charge_classes'] = pd.qcut(df['total_charges'], 4, labels=False)

Related

How to plot a column seperated by date into 12 monthly bars?

I have a dataframe containing hotel prices separated by the date of the listing and
I would like to plot the median of those prices each in a monthly bar.
So I first want to group the dates into the months. Then calculate the median of the months and then plit them in a bar chart.
Can you please show me how to do that? (Python beginner here)
Thank you in advance.
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None # default='warn'
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
calendar_df = pd.read_csv("calendar.csv")
calendar_df_kurz.head()

KMeans Clustering of CSV Data Set

I am trying to create a KMeans clustering model based on a csv data set that I have compiled. The data set is organized as such:
population longitude latitude
Atlanta, GA
Austin, TX
...
I tried just plotting the data, which isn't working, if produces a scatter plot where you can't see the axis or the data points, and I can't really tell of the Kmeans algorithim is working.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
import pandas as pd
import csv
data = pd.read_csv("data.csv")
print (data.head())
plt.scatter(x=data['Population'].astype(bytes), y=data['Longitude'].astype(bytes), z=data["Latitude"].astype(bytes))
plt.xlim(0,1000000)
plt.ylim(0,5000)
plt.zlim(0,5000)
plt.xlabel('Population')
plt.ylabel('Longitude')
plt.zlabel('Latitude')
plt.title('KMeans Clustering for Population vs. Latitude and Longitude', fontsize = 10)
plt.show()
x = data.iloc[:,1:3] #selecting features
#Clustering
kmeans = KMeans(3)
kmeans.fit(x)
#Clustering Results
indentified_clusters = kmeans.fit_predict(x)
indentified_clusters
array([1,1,0.0,2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters
plt.scatter(data_with_clusters['Population'],data_with_clusters['Longitude'],data_with_clusters['Latitude']c=data_with_clusters['Clusters'],cmap='rainbow')

Try the following :
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
#2 Importing the mall dataset
data= pd.read_csv("xxx")
print(data.head())
plt.scatter(data['Longitude'],data['Latitude'])
plt.xlim(-180,180)
plt.ylim(-90,90)
plt.show()
x = data.iloc[:,1:3] # 1t for rows and second for columns
x
kmeans = KMeans(3)
kmeans.fit(x)
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = indentified_clusters
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')
plt.show()

select subset for regression

I have the following codes that I want to use. The column 0 is year (1950-2020) then the rest of the columns are months. I only want to use the data from 1979-2020 in my linear regression model.
Can you help me? I am quite a beginner in using python. Below is my code:
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
data1 = pd.read_csv (r'C:\Users\User-PC\sample.csv')
x1 = pd.DataFrame(data,columns=['Year','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
#data2 = pd.read_csv (r'C:\Users\User-PC\sample2.csv', parse_dates=[0], index_col=0)
#x2 = pd.DataFrame(data2,columns=['Year','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.plot(x1['Year'], x1['Jan'], color='green')
plt.title('Model 1')
plt.xlabel('Year')
plt.ylabel('index')
plt.show()

You can filter your dataframe by year before applying linear regression:
new_df = df[df['Year'].between(1979, 2000, inclusive="both")]

how i know the accuracy of the Kmeans?

from datetime import time
from numpy import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.read_csv( "spam.csv" )
feature_col=['total']
X=df[feature_col]
y=df.target
clf_km=KMeans(n_clusters=1)
clf_km=clf_km.fit(X)
clf_km.cluster_centers_
clf_km.labels_
I try to implement the Kmeans clustering but I don't know how I can plot the original clusters and the new ones I created by the kmeans, I want to plot to scatter for the original one and another for the newpart of the csv file .

I would like to know how i can apply this clustering algorithm on my own data please?

I'd like to replace the iris data by my own data. please tell me what are the steps to follow to do that ?
thanks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report import matplotlib.pyplot as plt plt.rc('figure', figsize=(7,4))
iris = datasets.load_iris()
X = scale(iris.data)
Y = pd.DataFrame(iris.target)
variable_name = iris.feature_names X[0:10,]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
iris_df = pd.DataFrame(iris.data)
iris_df.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'] Y.columns = ['Targets']

The import section will stay the same.
Lets assume you have a dataframe:
#read your dataframe(several types possible)
df = pd.read_csv('test.csv')
#you need to define a target variable (named target in my case) and the features X
Y = df['target']
X = df.drop(['target'], axis=1)
#here your k-means algorithm gets start
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
let me add one more think, for what are you using kmeans? it is an unsupervised learning method, so you do not have any target variable, so what are you doing?
Normally it should be:
df = pd.read_csv('test.csv')
#columns header you want to use
relevant_columns = ['A', 'B']
X = df[relevant_columns]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report
# CHANGED CODE START
df = pd.read_excel('tmp.xlsx')
Y = df['target']
X = df.drop(['target'], axis=1)
# CHANGED CODE END
variable_name = X.columns
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data mining for machine learning - python

You could use qcut: df['charge_classes'] = pd.qcut(df['total_charges'], 4, labels=False)

Related

How to plot a column seperated by date into 12 monthly bars?

KMeans Clustering of CSV Data Set

select subset for regression

how i know the accuracy of the Kmeans?

I would like to know how i can apply this clustering algorithm on my own data please?

Categories

Resources