I have been given some years data of Ozone, NO, NO2 and CO to work on. The task is to use this data to predict the value of ozone. Suppose i have data of year 2015,2016,2018 and 2019. I need to predict ozone value of 2019 using 2015,2016,2018 data which is with me.
Data format is hourly recorded and is present in the form of monthsimage. So in this format data is present.
What i have done: First of all the years data in one excel file which contains 4 columns NO,NO2,CO,O3. And added all the data month by month. So this is the master file which has been usedAttached image
I have used python. First the data has to be cleared. Let me explain a bit. No,No2 and CO are predecessors of ozone which means that ozone gas creation depends on these gases and the data has to be cleaned before hand and the constraints were to remove any negative value and to remove that whole row including others column so if any of the values of Ozone,No,NO2 and CO is invalid we have to remove the whole row and not count it. And the data contained some string format which also has to be removed. It was all done. Then i applied MLP classifier from sk learn Here the code which i have done.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neural_network import MLPClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
bugs = ['NOx', '* 43.3', '* 312', '11/19', '11/28', '06:00', '09/30', '09/04', '14:00', '06/25', '07:00', '06/02',
'17:00', '04/10', '04/17', '18:00', '02/26', '02/03', '01:00', '11/23', '15:00', '11/12', '24:00', '09/02',
'16:00', '09/28', '* 16.8', '* 121', '12:00', '06/24', '13:00', '06/26', 'Span', 'NoData', 'ppb', 'Zero',
'Samp<', 'RS232']
dataset = pd.read_excel("Testing.xlsx")
dataset = pd.DataFrame(dataset).replace(bugs, 0)
dataset.dropna(subset=["O3"], inplace=True)
dataset.dropna(subset=["NO"], inplace=True)
dataset.dropna(subset=["NO2"], inplace=True)
dataset.dropna(subset=["CO"], inplace=True)
dataset.drop(dataset[dataset['O3'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['O3'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['O3'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['NO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['CO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['CO'] > 4000].index, inplace=True)
dataset.drop(dataset[dataset['CO'] == 0].index, inplace=True)
dataset = dataset.reset_index()
dataset = dataset.drop(['index'], axis=1)
feat = dataset[["NO", "NO2", "CO"]].astype(int)
label = dataset[["O3"]].astype(int)
X_train, X_test, y_train, y_test = train_test_split(feat, label, test_size=0.1)
# X_train = dataset.iloc[0:9200, 0:3].values.astype(int)
# y_train = dataset.iloc[0:9200, 3].values.astype(int)
# X_test = dataset.iloc[9200:9393, 0:3].values.astype(int)
# y_test = dataset.iloc[9200:9393, 3].values.astype(int)
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.fit_transform(X_test)
def accuracy(confusion_matrix): # <--==
diagonal_sum = confusion_matrix.trace()
sum_of_all_elements = confusion_matrix.sum()
return diagonal_sum / sum_of_all_elements
classifier = MLPClassifier(hidden_layer_sizes=(250, 100, 10), max_iter=100000, activation='relu', solver='adam',
random_state=1)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
print(f"\n{X_test}\n ----> \nPredictions : \n{y_pred}\n{y_pred.shape}\n")
cm = confusion_matrix(y_pred, y_test)
print(f"\nAccuracy of MLP.Cl : {accuracy(cm)}\n")
print(accuracy_score(y_test, y_pred))
y_test = pd.DataFrame(y_test)
y_test = y_test.reset_index(0)
y_test = y_test.drop(['index'], axis=1)
y_test = y_test.head(100)
# y_test = y_test.drop([19,20],axis=0)
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred.shift(-1)
y_pred = y_pred.head(100)
# y_pred = y_pred.drop([19,20],axis=0)
plt.figure(figsize=(10, 5))
plt.plot(y_pred, color='r', label='PredictedO3')
plt.plot(y_test, color='g', label='OriginalO3')
plt.legend()
plt.show()
This the code
Attaching the plot here
console here:
PyDev console:
[[-0.53939794 -0.59019756 -0.53257553]
[ 2.55576818 0.45245455 -0.7648624 ]
[-0.36744427 0.73681421 -0.30028866]
...
[-0.59671583 -0.02147823 1.81678204]
[-0.25280849 0.73681421 1.31145621]
[-0.53939794 0.64202766 0.18466113]]
---->
Predictions :
[15 39 45 40 42 11 14 32 23 23 21 23 3 15 23 59 15 34 12 10 42 23 12 8
14 3 8 42 12 61 36 13 11 20 12 10 14 42 12 20 9 5 14 11 20 14 10 85
42 73 43 23 61 85 55 13 14 20 85 32 15 15 42 42 12 23 13 23 85 8 23 11
36 32 20 12 27 35 55 17 15 23 12 44 42 17 23 45 35 23 3 11 23 12 60 11
15 39 15 44 49 7 35 42 45 13 12 55 42 18 42 6 23 14 60 43 16 18 10 43
85 20 23 88 8 20 26 23 53 45 16 4 48 27 3 61 15 7 23 6 40 12 44 12
12 4 12 13 24 24 23 15 16 13 40 12 12 10 12 15 53 12 42 45 38 23 45 17
12 30 12 45 60 65 12 52 4 35 3 15 11 23 40 42 18 23 45 45 49 43 35 62
46 14 21 11 6 24 23 16 23 21 45 42 85 39 12 16 10 38 43 6 23 20 11 65
14 14 14 45 24 18 85 60 15 10 16 14 23 10 17 6 13 42 4 7 17 51 23 3
85 42 23 55 21 15 32 14 17 12 42 18 16 8 6 10 14 12 42 15 14 43 25 12
14 15 85 20 42 23 43 32 18 12 42 35 6 47 12 20 12 6 51 8 20 45 40 43
12 14 44 23 23 21 15 45 24 12 23 23 42 15 12 46 35 8 14 16 42 11 42 16
13 61 60 25 26 16 45 10 17 5 43 21 26 12 49 12 42 11 38 48 21 45 9 48
11 20 13 23 16 21 11 12 44 55 11 16 53 45 8 17 12 9 85 56 7 23 23 26
12 42 42 51 17 23 43 52 24 12 29 11 21 42 16 6 20 18 16 8 14 15 13 43
10 23 16 15 42 43 23 11 14 25 47 16 24 14 7 43 45 14 5 18 51 42 20 15
39 32 12 44 13 51 12 43 42 23 42 17 11 12 11 42 12 5 35 51 23 51 14 9
11 34 18 21 88 21 15 15 6 49 12 51 8 12 49 8 4 17 15 6 26 3 15 43
14 5 23 15 88 21 85 11 23 25 45 14 12 65 45 27 48 42 12 14 44 45 4 44
40 16 23 25 15 10 20 12 15 62 6 13 20 20 11 56 12 40 11 14 25 6 25 12
40 85 40 85 43 11 14 32 11 8 6 8 23 12 26 18 60 18 51 40 13 51 12 8
23 45 20 4 23 11 3 12 51 11 18 12 40 14 40 7 85 44 60 85 45 14 14 14
11 55 18 16 45 13 23 51 11 14 23 18 14 7 40 23 15 32 12 12 23 42 49 88
11 11 42 6 25 12 6 11 18 6 13 35 8 15 42 39 23 9 23 32 20 21 12 20
20 38 7 12 42 8 13 17 55 60 16 39 18 42 42 12 60 14 16 40 9 18 85 40
5 14 23 45 10 24 14 25 11 17 15 42 42 15 23 15 8 34 16 60 42 14 48 51
11 6 51 15 42 12 42 20 12 25 26 25 45 26 40 48 23 45 23 21 11 17 48 12
12 6 15 34 10 16 18 17 13 20 45 3 9 39 12 11 15 23 42 45 45 65 51 6
45 15 15 17 51 8 51 34 14 17 13 38 38 21 18 51 55 16 9 44 42 6 42 17
6 25 88 11 10 48 20 40 21 12 44 27 47 42 38 15 49 12 12 12 6 12 8 16
42 9 20 18 23 18 12 13 20 16 14 12 23 10 60 18 25 23 43 21 12 12 10 61
21 40 6 16 45 38 12 17 12 15 32 9 38 17 14 11 6 15 14 6 48 21 13 13
15 36 3 45 25 29 24 16 8 10 27 21 20 51 10 16 21 12 20 23 46 23 3 34
29 15 23 15 48 42 17 42 43 15 35 34 23 23 44 23 4 35 12 42 49 36 15 18
15 14 11 18 16 20 15 25 9 43 51 45 12 15 39 21 51 18 24 26 17 9 42 44
12 30 32 8 20 44 52 20 23 23 15 12 12 42 8 5 42 23 21 16 24 65 16 12
38 36 43 60 15 7 85 15 26 42 40 11 12 23 12 20 40 23 42 6 23 52 16 20
23 45 51 9 42 42 25 6 21 23 15 8 12 12 26 11 16 15 39 8 26 43 48 47
12 48 12 11]
(940,)
and
Accuracy of MLP.Cl : 0.0425531914893617
0.0425531914893617
I can't get the right result or you can say right predictions.
You are trying to predict a continuous value, which is a regression problem, not a classification one; consequently, MLPClassifier is the wrong model to apply here - the correct one being an MLPRegressor.
On top of this, accuracy is meaningful for classification problems only, and it is meaningless in regression ones, like yours here; so, after switching to the correct model, you should also use some other performance metric suitable for regression problems.
Is there any efficient way to reshape a dataframe from:
(A1, A2, A3, B1, B2, B3, C1, C2, C3, TT, YY and ZZ are columns)
A1 A2 A3 B1 B2 B3 C1 C2 C3 TT YY ZZ
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
TO:
HH JJ KK TT YY ZZ
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
HH, JJ and KK are new columns where I would make a vertical stack of column A, B, C and keeping in horizontal stack TT, YY and ZZ
A1 A2 A3 TT YY ZZ
B1 B2 B3 TT YY ZZ
C1 C2 C3 TT YY ZZ
Thanks for your help
You can use Column splitting and concatenation
df = pd.read_clipboard()
ColSets= [df.columns[i:i+3] for i in np.arange(0,len(df.columns)-3,3)]
LCols = df.columns[-3:]
NewDf = pd.concat([df[ColSet].join(df[LCols]).T.reset_index(drop=True).T for ColSet in ColSets])
NewDf.columns = ['HH', 'JJ', 'KK', 'TT', 'YY', 'ZZ']
Out:
HH JJ KK TT YY ZZ
0 11 22 33 23 24 25
1 11 22 33 23 24 25
2 11 22 33 23 24 25
3 11 22 33 23 24 25
4 11 22 33 23 24 25
5 11 22 33 23 24 25
0 44 55 66 23 24 25
1 44 55 66 23 24 25
2 44 55 66 23 24 25
3 44 55 66 23 24 25
4 44 55 66 23 24 25
5 44 55 66 23 24 25
0 77 88 99 23 24 25
1 77 88 99 23 24 25
2 77 88 99 23 24 25
3 77 88 99 23 24 25
4 77 88 99 23 24 25
5 77 88 99 23 24 25
a bit longer than the previous solution :
#extract columns ending with numbers
abc = df.filter(regex='\d$')
#sort columns into separate lists
from itertools import groupby
from operator import itemgetter
cols = sorted(abc.columns,key=itemgetter(0))
filtered_columns = [list(g) for k,g in groupby(cols,key=itemgetter(0))]
#iterate through the dataframe
#and stack them
abc_stack = pd.concat([abc.filter(col)
.set_axis(['HH','JJ','KK'],axis='columns')
for col in filtered_columns],
ignore_index=True)
#filter for columns ending with alphabets
tyz = df.filter(regex= '[A-Z]$')
#get the dataframe to be the same length as abc_stack
tyz_stack = pd.concat([tyz] * len(filtered_columns),ignore_index=True)
#combine both dataframes
res = pd.concat([abc_stack,tyz_stack], axis=1)
res
HH JJ KK TT YY ZZ
0 11 22 33 23 24 25
1 11 22 33 23 24 25
2 11 22 33 23 24 25
3 11 22 33 23 24 25
4 11 22 33 23 24 25
5 11 22 33 23 24 25
6 44 55 66 23 24 25
7 44 55 66 23 24 25
8 44 55 66 23 24 25
9 44 55 66 23 24 25
10 44 55 66 23 24 25
11 44 55 66 23 24 25
12 77 88 99 23 24 25
13 77 88 99 23 24 25
14 77 88 99 23 24 25
15 77 88 99 23 24 25
16 77 88 99 23 24 25
17 77 88 99 23 24 25
UPDATE : 2021-01-08
The reshaping process could be abstracted by using the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
The data you shared has patterns (some columns ends with 1, others with 2, and some end with 3), we can use these patterns to reshape the data;
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=("HH", "JJ", "KK"),
names_pattern=("1$", "2$", "3$"),
index=("TT", "YY", "ZZ")
)
.sort_index(axis="columns"))
Basically, what it does is look for columns that end with 1, aggregates them into one column ("TT") and does the same for 2 and 3.
I want to split my data frame by "space" for all columns. I can do it for 1 column. How to apply it to the whole data? (maybe with loop)
df =
0 1 2 4
11 22 12 22 13 22 14 22
15 16 17 18 33 44 22 55
19 20 21 22 66 55 33 66
23 24 25 26 22 44 66 44
I am splitting in like:
df[0].str.split(' ', 1, expand=True)
Output is:
0 1
11 22
15 16
19 20
23 24
You can stack and unstack:
df.stack().str.split(' ', expand=True).unstack()
Output:
0 1
0 1 2 4 0 1 2 4
0 11 12 13 14 22 22 22 22
1 15 17 33 22 16 18 44 55
2 19 21 66 33 20 22 55 66
3 23 25 22 66 24 26 44 44
I am comparing the contours of letters and have several cases of unexpected results. The most confusing to me is how X and N are being identified as best matches.
In the images below, yellow represents the unknown shape and blue represents candidate shapes. The white numbers are the result returned by cv.matchShapes using CONTOURS_MATCH_I3. (I've tried the other matching methods and just get similar odd results but with a different set of letters.)
Below shows X matching N better than X
Below shows N matching X better than N
At the end of the post are the raw data and below is a chart of the the data.
I can't come up with a rotation, scale, or skew to show that this is an optical illusion. I'm not suggesting there is an issue in matchShapes but rather an issue in my understanding of Hu moments.
I'd appreciate if someone would take a moment (pun intended) and explain how cv.matchShapes is producing these results.
--- edited ----
The images below are the result of using poly-filled shapes. I am still baffled how these letters match better than the correct ones.
target_letter
33 23
32 24
30 24
28 26
28 30
29 31
29 32
31 34
31 35
33 37
33 38
36 41
36 42
38 44
38 47
35 50
35 51
33 53
33 54
30 57
30 58
28 60
28 61
27 62
27 67
29 69
34 69
38 65
38 64
40 62
40 61
42 59
42 58
46 54
47 54
49 56
49 57
51 59
51 60
53 62
53 63
56 66
56 67
58 69
63 69
65 67
65 60
63 58
63 57
60 54
60 53
58 51
58 50
55 47
55 44
57 42
57 41
61 37
61 36
64 33
64 32
65 31
65 25
64 24
62 24
61 23
60 24
58 24
55 27
55 28
52 31
52 32
50 34
50 35
47 38
45 36
45 35
41 31
41 30
40 29
40 28
38 26
38 25
37 24
35 24
34 23
candidateLetter N
10 3
9 4
7 4
6 5
5 5
5 6
4 7
4 9
3 10
3 44
4 45
4 47
6 49
12 49
13 48
13 47
14 46
14 23
15 22
17 24
17 25
21 29
21 30
24 33
24 34
27 37
27 38
31 42
31 43
34 46
34 47
35 48
36 48
37 49
43 49
45 47
45 6
43 4
38 4
36 6
36 8
35 9
35 27
36 28
36 29
34 31
33 30
33 29
31 27
31 26
27 22
27 21
24 18
24 17
21 14
21 13
18 10
18 9
13 4
11 4
candidateLetter X
10 2
9 3
7 3
6 4
6 6
5 7
5 8
6 9
6 11
8 13
8 14
10 16
10 17
14 21
14 22
16 24
16 25
13 28
13 29
10 32
10 33
7 36
7 37
5 39
5 40
4 41
4 46
6 48
11 48
15 44
15 43
17 41
17 40
19 38
19 37
21 35
21 34
23 32
26 35
26 36
28 38
28 39
30 41
30 42
33 45
33 46
35 48
40 48
42 46
42 39
40 37
40 36
37 33
37 32
34 29
34 28
32 26
32 23
34 21
34 20
37 17
37 16
41 12
41 11
42 10
42 4
41 3
39 3
38 2
37 3
35 3
32 6
32 7
29 10
29 11
27 13
27 14
24 17
21 14
21 13
18 10
18 9
17 8
17 7
15 5
15 4
14 3
12 3
11 2
def main():
l=[]
for i in range(1981,2018):
df = pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt")
print(df[12:])
I am trying to download and read the "CONUS" row in Population.Heating.txt from 1981 to 2017.
My code seems to get the CONUS parts, but How can I actually read it like csv format with |?
Thank you!
Try this:
def main():
l=[]
url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
for i in range(1981,2018):
df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
print(df[12:])
Demo:
In [14]: url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
In [15]: i = 2017
In [16]: df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
In [17]: df
Out[17]:
Region 20170101 20170102 20170103 20170104 20170105 20170106 20170107 20170108 20170109 ... 20171222 20171223 \
0 1 30 36 31 25 37 39 47 51 55 ... 40 32
1 2 28 32 28 23 39 41 46 49 51 ... 31 25
2 3 34 30 26 43 52 58 57 54 44 ... 29 32
3 4 37 34 37 57 60 62 59 54 43 ... 39 45
4 5 15 11 9 10 20 21 27 36 33 ... 12 7
5 6 16 9 7 22 31 38 45 44 35 ... 9 9
6 7 8 5 9 23 23 34 37 32 17 ... 9 19
7 8 30 32 34 33 36 42 42 31 23 ... 36 33
8 9 25 25 24 23 22 25 23 15 17 ... 23 20
9 CONUS 24 23 21 26 33 38 40 39 34 ... 23 22
20171224 20171225 20171226 20171227 20171228 20171229 20171230 20171231
0 32 34 43 53 59 59 57 59
1 30 33 43 49 54 53 50 55
2 41 47 58 62 60 54 54 60
3 47 55 61 64 57 54 62 68
4 12 20 21 22 27 26 24 29
5 22 33 31 35 37 33 32 39
6 19 24 23 28 28 23 19 27
7 34 30 32 29 26 24 27 30
8 18 17 17 15 13 11 12 15
9 26 30 34 37 38 35 34 40
[10 rows x 366 columns]
def main():
l=[]
for i in range(1981,2018):
l.append( pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt"
, sep='|', skiprows=3))
Files look like:
Product: Daily Heating Degree Days
Regions: Regions::CensusDivisions
Weights: Population
[... data ...]
so you need to skip 3 rows. Afterwards you have several 'df' in your list 'l' for further processing.