Cannot get good accuracy from sklearn MLP classifier - python
I have been given some years data of Ozone, NO, NO2 and CO to work on. The task is to use this data to predict the value of ozone. Suppose i have data of year 2015,2016,2018 and 2019. I need to predict ozone value of 2019 using 2015,2016,2018 data which is with me.
Data format is hourly recorded and is present in the form of monthsimage. So in this format data is present.
What i have done: First of all the years data in one excel file which contains 4 columns NO,NO2,CO,O3. And added all the data month by month. So this is the master file which has been usedAttached image
I have used python. First the data has to be cleared. Let me explain a bit. No,No2 and CO are predecessors of ozone which means that ozone gas creation depends on these gases and the data has to be cleaned before hand and the constraints were to remove any negative value and to remove that whole row including others column so if any of the values of Ozone,No,NO2 and CO is invalid we have to remove the whole row and not count it. And the data contained some string format which also has to be removed. It was all done. Then i applied MLP classifier from sk learn Here the code which i have done.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neural_network import MLPClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
bugs = ['NOx', '* 43.3', '* 312', '11/19', '11/28', '06:00', '09/30', '09/04', '14:00', '06/25', '07:00', '06/02',
'17:00', '04/10', '04/17', '18:00', '02/26', '02/03', '01:00', '11/23', '15:00', '11/12', '24:00', '09/02',
'16:00', '09/28', '* 16.8', '* 121', '12:00', '06/24', '13:00', '06/26', 'Span', 'NoData', 'ppb', 'Zero',
'Samp<', 'RS232']
dataset = pd.read_excel("Testing.xlsx")
dataset = pd.DataFrame(dataset).replace(bugs, 0)
dataset.dropna(subset=["O3"], inplace=True)
dataset.dropna(subset=["NO"], inplace=True)
dataset.dropna(subset=["NO2"], inplace=True)
dataset.dropna(subset=["CO"], inplace=True)
dataset.drop(dataset[dataset['O3'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['O3'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['O3'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['NO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['CO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['CO'] > 4000].index, inplace=True)
dataset.drop(dataset[dataset['CO'] == 0].index, inplace=True)
dataset = dataset.reset_index()
dataset = dataset.drop(['index'], axis=1)
feat = dataset[["NO", "NO2", "CO"]].astype(int)
label = dataset[["O3"]].astype(int)
X_train, X_test, y_train, y_test = train_test_split(feat, label, test_size=0.1)
# X_train = dataset.iloc[0:9200, 0:3].values.astype(int)
# y_train = dataset.iloc[0:9200, 3].values.astype(int)
# X_test = dataset.iloc[9200:9393, 0:3].values.astype(int)
# y_test = dataset.iloc[9200:9393, 3].values.astype(int)
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.fit_transform(X_test)
def accuracy(confusion_matrix): # <--==
diagonal_sum = confusion_matrix.trace()
sum_of_all_elements = confusion_matrix.sum()
return diagonal_sum / sum_of_all_elements
classifier = MLPClassifier(hidden_layer_sizes=(250, 100, 10), max_iter=100000, activation='relu', solver='adam',
random_state=1)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
print(f"\n{X_test}\n ----> \nPredictions : \n{y_pred}\n{y_pred.shape}\n")
cm = confusion_matrix(y_pred, y_test)
print(f"\nAccuracy of MLP.Cl : {accuracy(cm)}\n")
print(accuracy_score(y_test, y_pred))
y_test = pd.DataFrame(y_test)
y_test = y_test.reset_index(0)
y_test = y_test.drop(['index'], axis=1)
y_test = y_test.head(100)
# y_test = y_test.drop([19,20],axis=0)
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred.shift(-1)
y_pred = y_pred.head(100)
# y_pred = y_pred.drop([19,20],axis=0)
plt.figure(figsize=(10, 5))
plt.plot(y_pred, color='r', label='PredictedO3')
plt.plot(y_test, color='g', label='OriginalO3')
plt.legend()
plt.show()
This the code
Attaching the plot here
console here:
PyDev console:
[[-0.53939794 -0.59019756 -0.53257553]
[ 2.55576818 0.45245455 -0.7648624 ]
[-0.36744427 0.73681421 -0.30028866]
...
[-0.59671583 -0.02147823 1.81678204]
[-0.25280849 0.73681421 1.31145621]
[-0.53939794 0.64202766 0.18466113]]
---->
Predictions :
[15 39 45 40 42 11 14 32 23 23 21 23 3 15 23 59 15 34 12 10 42 23 12 8
14 3 8 42 12 61 36 13 11 20 12 10 14 42 12 20 9 5 14 11 20 14 10 85
42 73 43 23 61 85 55 13 14 20 85 32 15 15 42 42 12 23 13 23 85 8 23 11
36 32 20 12 27 35 55 17 15 23 12 44 42 17 23 45 35 23 3 11 23 12 60 11
15 39 15 44 49 7 35 42 45 13 12 55 42 18 42 6 23 14 60 43 16 18 10 43
85 20 23 88 8 20 26 23 53 45 16 4 48 27 3 61 15 7 23 6 40 12 44 12
12 4 12 13 24 24 23 15 16 13 40 12 12 10 12 15 53 12 42 45 38 23 45 17
12 30 12 45 60 65 12 52 4 35 3 15 11 23 40 42 18 23 45 45 49 43 35 62
46 14 21 11 6 24 23 16 23 21 45 42 85 39 12 16 10 38 43 6 23 20 11 65
14 14 14 45 24 18 85 60 15 10 16 14 23 10 17 6 13 42 4 7 17 51 23 3
85 42 23 55 21 15 32 14 17 12 42 18 16 8 6 10 14 12 42 15 14 43 25 12
14 15 85 20 42 23 43 32 18 12 42 35 6 47 12 20 12 6 51 8 20 45 40 43
12 14 44 23 23 21 15 45 24 12 23 23 42 15 12 46 35 8 14 16 42 11 42 16
13 61 60 25 26 16 45 10 17 5 43 21 26 12 49 12 42 11 38 48 21 45 9 48
11 20 13 23 16 21 11 12 44 55 11 16 53 45 8 17 12 9 85 56 7 23 23 26
12 42 42 51 17 23 43 52 24 12 29 11 21 42 16 6 20 18 16 8 14 15 13 43
10 23 16 15 42 43 23 11 14 25 47 16 24 14 7 43 45 14 5 18 51 42 20 15
39 32 12 44 13 51 12 43 42 23 42 17 11 12 11 42 12 5 35 51 23 51 14 9
11 34 18 21 88 21 15 15 6 49 12 51 8 12 49 8 4 17 15 6 26 3 15 43
14 5 23 15 88 21 85 11 23 25 45 14 12 65 45 27 48 42 12 14 44 45 4 44
40 16 23 25 15 10 20 12 15 62 6 13 20 20 11 56 12 40 11 14 25 6 25 12
40 85 40 85 43 11 14 32 11 8 6 8 23 12 26 18 60 18 51 40 13 51 12 8
23 45 20 4 23 11 3 12 51 11 18 12 40 14 40 7 85 44 60 85 45 14 14 14
11 55 18 16 45 13 23 51 11 14 23 18 14 7 40 23 15 32 12 12 23 42 49 88
11 11 42 6 25 12 6 11 18 6 13 35 8 15 42 39 23 9 23 32 20 21 12 20
20 38 7 12 42 8 13 17 55 60 16 39 18 42 42 12 60 14 16 40 9 18 85 40
5 14 23 45 10 24 14 25 11 17 15 42 42 15 23 15 8 34 16 60 42 14 48 51
11 6 51 15 42 12 42 20 12 25 26 25 45 26 40 48 23 45 23 21 11 17 48 12
12 6 15 34 10 16 18 17 13 20 45 3 9 39 12 11 15 23 42 45 45 65 51 6
45 15 15 17 51 8 51 34 14 17 13 38 38 21 18 51 55 16 9 44 42 6 42 17
6 25 88 11 10 48 20 40 21 12 44 27 47 42 38 15 49 12 12 12 6 12 8 16
42 9 20 18 23 18 12 13 20 16 14 12 23 10 60 18 25 23 43 21 12 12 10 61
21 40 6 16 45 38 12 17 12 15 32 9 38 17 14 11 6 15 14 6 48 21 13 13
15 36 3 45 25 29 24 16 8 10 27 21 20 51 10 16 21 12 20 23 46 23 3 34
29 15 23 15 48 42 17 42 43 15 35 34 23 23 44 23 4 35 12 42 49 36 15 18
15 14 11 18 16 20 15 25 9 43 51 45 12 15 39 21 51 18 24 26 17 9 42 44
12 30 32 8 20 44 52 20 23 23 15 12 12 42 8 5 42 23 21 16 24 65 16 12
38 36 43 60 15 7 85 15 26 42 40 11 12 23 12 20 40 23 42 6 23 52 16 20
23 45 51 9 42 42 25 6 21 23 15 8 12 12 26 11 16 15 39 8 26 43 48 47
12 48 12 11]
(940,)
and
Accuracy of MLP.Cl : 0.0425531914893617
0.0425531914893617
I can't get the right result or you can say right predictions.
You are trying to predict a continuous value, which is a regression problem, not a classification one; consequently, MLPClassifier is the wrong model to apply here - the correct one being an MLPRegressor.
On top of this, accuracy is meaningful for classification problems only, and it is meaningless in regression ones, like yours here; so, after switching to the correct model, you should also use some other performance metric suitable for regression problems.
Related
Pandas line plot without transposing the dataframe
I have a pandas dataframe which looks like this - Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Marks 0 30 31 29 15 30 30 30 50 30 30 30 26 Student1 1 45 45 45 45 41 45 35 45 45 45 37 45 Student2 2 21 11 21 21 21 21 21 21 21 21 17 21 Student3 3 30 30 33 30 30 30 50 30 30 30 22 30 Student4 4 39 34 34 34 34 34 23 34 40 34 34 34 Student5 5 41 41 41 28 41 56 41 41 41 41 41 41 Student6 If I transpose the data like below, I am able to plot a line graph Marks Student1 Student2 Student3 Student4 Student5 Student6 0 Jan 30 45 21 30 39 41 1 Feb 31 45 11 30 34 41 2 Mar 29 45 21 33 34 41 3 Apr 15 45 21 30 34 28 4 May 30 41 21 30 34 41 5 Jun 30 45 21 30 34 56 6 Jul 30 35 21 50 23 41 7 Aug 50 45 21 30 34 41 8 Sep 30 45 21 30 40 41 9 Oct 30 45 21 30 34 41 10 Nov 30 37 17 22 34 41 11 Dec 26 45 21 30 34 41 However, my original data is huge, and transposing it is taking too long. Is there some other way to achieve this? Please note - this is just a dummy dataframe I created for the sake of simplicity, my original data is quite complex and huge.
If you're data is huge, you're probably not going to be able to see anything on the line plot anyways... import matplotlib.pyplot as plt import pandas as pd from io import StringIO import numpy as np df = pd.read_table(StringIO(""" Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Marks 0 30 31 29 15 30 30 30 50 30 30 30 26 Student1 1 45 45 45 45 41 45 35 45 45 45 37 45 Student2 2 21 11 21 21 21 21 21 21 21 21 17 21 Student3 3 30 30 33 30 30 30 50 30 30 30 22 30 Student4 4 39 34 34 34 34 34 23 34 40 34 34 34 Student5 5 41 41 41 28 41 56 41 41 41 41 41 41 Student6"""), sep='\s+') x = df.columns.tolist()[:-1] y = df.iloc[:, :-1].values for i, j in enumerate(y): plt.plot(x, j, label=df['Marks'].iloc[i]) plt.ylim(bottom=0) plt.legend(loc='upper right')
Split several columns by "space" pandas
I want to split my data frame by "space" for all columns. I can do it for 1 column. How to apply it to the whole data? (maybe with loop) df = 0 1 2 4 11 22 12 22 13 22 14 22 15 16 17 18 33 44 22 55 19 20 21 22 66 55 33 66 23 24 25 26 22 44 66 44 I am splitting in like: df[0].str.split(' ', 1, expand=True) Output is: 0 1 11 22 15 16 19 20 23 24
You can stack and unstack: df.stack().str.split(' ', expand=True).unstack() Output: 0 1 0 1 2 4 0 1 2 4 0 11 12 13 14 22 22 22 22 1 15 17 33 22 16 18 44 55 2 19 21 66 33 20 22 55 66 3 23 25 22 66 24 26 44 44
Why does OpenCV matchShape() match "X" with "N"?
I am comparing the contours of letters and have several cases of unexpected results. The most confusing to me is how X and N are being identified as best matches. In the images below, yellow represents the unknown shape and blue represents candidate shapes. The white numbers are the result returned by cv.matchShapes using CONTOURS_MATCH_I3. (I've tried the other matching methods and just get similar odd results but with a different set of letters.) Below shows X matching N better than X Below shows N matching X better than N At the end of the post are the raw data and below is a chart of the the data. I can't come up with a rotation, scale, or skew to show that this is an optical illusion. I'm not suggesting there is an issue in matchShapes but rather an issue in my understanding of Hu moments. I'd appreciate if someone would take a moment (pun intended) and explain how cv.matchShapes is producing these results. --- edited ---- The images below are the result of using poly-filled shapes. I am still baffled how these letters match better than the correct ones. target_letter 33 23 32 24 30 24 28 26 28 30 29 31 29 32 31 34 31 35 33 37 33 38 36 41 36 42 38 44 38 47 35 50 35 51 33 53 33 54 30 57 30 58 28 60 28 61 27 62 27 67 29 69 34 69 38 65 38 64 40 62 40 61 42 59 42 58 46 54 47 54 49 56 49 57 51 59 51 60 53 62 53 63 56 66 56 67 58 69 63 69 65 67 65 60 63 58 63 57 60 54 60 53 58 51 58 50 55 47 55 44 57 42 57 41 61 37 61 36 64 33 64 32 65 31 65 25 64 24 62 24 61 23 60 24 58 24 55 27 55 28 52 31 52 32 50 34 50 35 47 38 45 36 45 35 41 31 41 30 40 29 40 28 38 26 38 25 37 24 35 24 34 23 candidateLetter N 10 3 9 4 7 4 6 5 5 5 5 6 4 7 4 9 3 10 3 44 4 45 4 47 6 49 12 49 13 48 13 47 14 46 14 23 15 22 17 24 17 25 21 29 21 30 24 33 24 34 27 37 27 38 31 42 31 43 34 46 34 47 35 48 36 48 37 49 43 49 45 47 45 6 43 4 38 4 36 6 36 8 35 9 35 27 36 28 36 29 34 31 33 30 33 29 31 27 31 26 27 22 27 21 24 18 24 17 21 14 21 13 18 10 18 9 13 4 11 4 candidateLetter X 10 2 9 3 7 3 6 4 6 6 5 7 5 8 6 9 6 11 8 13 8 14 10 16 10 17 14 21 14 22 16 24 16 25 13 28 13 29 10 32 10 33 7 36 7 37 5 39 5 40 4 41 4 46 6 48 11 48 15 44 15 43 17 41 17 40 19 38 19 37 21 35 21 34 23 32 26 35 26 36 28 38 28 39 30 41 30 42 33 45 33 46 35 48 40 48 42 46 42 39 40 37 40 36 37 33 37 32 34 29 34 28 32 26 32 23 34 21 34 20 37 17 37 16 41 12 41 11 42 10 42 4 41 3 39 3 38 2 37 3 35 3 32 6 32 7 29 10 29 11 27 13 27 14 24 17 21 14 21 13 18 10 18 9 17 8 17 7 15 5 15 4 14 3 12 3 11 2
Grid of integers
I need to make a grid with the numbers generated by the code, but I'm not understanding how to align them in columns. Is there a parameter of print or something else that could help me out? #main() a=0 b=0 for i in range(1, 13): a=a+1 print(" ") b=b+1 for f in range(1,13): print(f*b, end=" ") My output at the moment:
I would recommend using python's f-strings: for i in range(1, 13): print(''.join(f"{i*j: 4}" for j in range(1,13))) Here's the output: 1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 12 14 16 18 20 22 24 3 6 9 12 15 18 21 24 27 30 33 36 4 8 12 16 20 24 28 32 36 40 44 48 5 10 15 20 25 30 35 40 45 50 55 60 6 12 18 24 30 36 42 48 54 60 66 72 7 14 21 28 35 42 49 56 63 70 77 84 8 16 24 32 40 48 56 64 72 80 88 96 9 18 27 36 45 54 63 72 81 90 99 108 10 20 30 40 50 60 70 80 90 100 110 120 11 22 33 44 55 66 77 88 99 110 121 132 12 24 36 48 60 72 84 96 108 120 132 144 The most common form is to use almost any arbitrary expression within the curly braces. This can include dictionary values, function calls and so on. The above usage specifies formatting after the colon. The space before the 4 indicates that the fill character should be a space, and the 4 indicates that the whole expression should take up 4 characters total. For more info, check out the documentation.
Considering the width of each grid cell is stored as w, which for above snippet suffices as 4, a regularly spaced grid can be printed using w = 4 a, b = 0, 0 for i in range(1, 13): a, b = a+1, b+1 for f in range(1, 13): print(('{:'+str(w)+'}').format(f*b), end='') print('') Its output is 1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 12 14 16 18 20 22 24 3 6 9 12 15 18 21 24 27 30 33 36 4 8 12 16 20 24 28 32 36 40 44 48 5 10 15 20 25 30 35 40 45 50 55 60 6 12 18 24 30 36 42 48 54 60 66 72 7 14 21 28 35 42 49 56 63 70 77 84 8 16 24 32 40 48 56 64 72 80 88 96 9 18 27 36 45 54 63 72 81 90 99 108 10 20 30 40 50 60 70 80 90 100 110 120 11 22 33 44 55 66 77 88 99 110 121 132 12 24 36 48 60 72 84 96 108 120 132 144
You can reference keyword argument values passed to the str.format() method in the format string by name via {name}. Here's an example of doing that where the value referenced is computed (as opposed to being a constant): mx = 12 w = len(str(mx*mx)) + 1 for b in range(1, mx+1): for f in range(1, mx+1): print(('{:{w}}').format(f*b, w=w), end='') print('') Output: 1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 12 14 16 18 20 22 24 3 6 9 12 15 18 21 24 27 30 33 36 4 8 12 16 20 24 28 32 36 40 44 48 5 10 15 20 25 30 35 40 45 50 55 60 6 12 18 24 30 36 42 48 54 60 66 72 7 14 21 28 35 42 49 56 63 70 77 84 8 16 24 32 40 48 56 64 72 80 88 96 9 18 27 36 45 54 63 72 81 90 99 108 10 20 30 40 50 60 70 80 90 100 110 120 11 22 33 44 55 66 77 88 99 110 121 132 12 24 36 48 60 72 84 96 108 120 132 144
Reading csv file with delimiter | using pandas
def main(): l=[] for i in range(1981,2018): df = pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt") print(df[12:]) I am trying to download and read the "CONUS" row in Population.Heating.txt from 1981 to 2017. My code seems to get the CONUS parts, but How can I actually read it like csv format with |? Thank you!
Try this: def main(): l=[] url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt" for i in range(1981,2018): df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python') print(df[12:]) Demo: In [14]: url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt" In [15]: i = 2017 In [16]: df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python') In [17]: df Out[17]: Region 20170101 20170102 20170103 20170104 20170105 20170106 20170107 20170108 20170109 ... 20171222 20171223 \ 0 1 30 36 31 25 37 39 47 51 55 ... 40 32 1 2 28 32 28 23 39 41 46 49 51 ... 31 25 2 3 34 30 26 43 52 58 57 54 44 ... 29 32 3 4 37 34 37 57 60 62 59 54 43 ... 39 45 4 5 15 11 9 10 20 21 27 36 33 ... 12 7 5 6 16 9 7 22 31 38 45 44 35 ... 9 9 6 7 8 5 9 23 23 34 37 32 17 ... 9 19 7 8 30 32 34 33 36 42 42 31 23 ... 36 33 8 9 25 25 24 23 22 25 23 15 17 ... 23 20 9 CONUS 24 23 21 26 33 38 40 39 34 ... 23 22 20171224 20171225 20171226 20171227 20171228 20171229 20171230 20171231 0 32 34 43 53 59 59 57 59 1 30 33 43 49 54 53 50 55 2 41 47 58 62 60 54 54 60 3 47 55 61 64 57 54 62 68 4 12 20 21 22 27 26 24 29 5 22 33 31 35 37 33 32 39 6 19 24 23 28 28 23 19 27 7 34 30 32 29 26 24 27 30 8 18 17 17 15 13 11 12 15 9 26 30 34 37 38 35 34 40 [10 rows x 366 columns]
def main(): l=[] for i in range(1981,2018): l.append( pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt" , sep='|', skiprows=3)) Files look like: Product: Daily Heating Degree Days Regions: Regions::CensusDivisions Weights: Population [... data ...] so you need to skip 3 rows. Afterwards you have several 'df' in your list 'l' for further processing.