Split several columns by "space" pandas - python

I want to split my data frame by "space" for all columns. I can do it for 1 column. How to apply it to the whole data? (maybe with loop)
df =
0 1 2 4
11 22 12 22 13 22 14 22
15 16 17 18 33 44 22 55
19 20 21 22 66 55 33 66
23 24 25 26 22 44 66 44
I am splitting in like:
df[0].str.split(' ', 1, expand=True)
Output is:
0 1
11 22
15 16
19 20
23 24

You can stack and unstack:
df.stack().str.split(' ', expand=True).unstack()
Output:
0 1
0 1 2 4 0 1 2 4
0 11 12 13 14 22 22 22 22
1 15 17 33 22 16 18 44 55
2 19 21 66 33 20 22 55 66
3 23 25 22 66 24 26 44 44

Related

Get Max Value of a Row in subset of Column respecting a condition

I have a dataframe that looks like this:
FakeDist
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
37
14
17
29
31
34
32
31
21
17
18
2
12
13
12
16
30
33
37
32
32
15
42
3
40
16
29
31
36
32
30
19
16
15
12
4
12
14
12
28
28
30
29
27
16
18
33
5
12
13
16
17
28
32
33
30
29
17
35
I want to add a column that will be the Column_Name of the Maximum Value per Row.
I did that with:
df['MaxVal_Dist'] = df.idxmax(axis=1)
Which gives me this df:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
-5
2
12
13
...
5
3
40
16
...
-5
4
12
14
...
5
5
12
13
...
5
But my real end point would be to add an if condition. I want the Max Value for the column where 'FakeDist' is between -2 and 2. To have the following result:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
0
2
12
13
...
1
3
40
16
...
-1
4
12
14
...
0
5
12
13
...
1
I did try to look at how to add a df.apply but couldn't find how to make it work.
I have a "work around" idea that would be to store a subset of column (from -2 to 2) in a new dataframe, create my new column to get the max there, and then add that result column to my initial dataframe but it seem to me to be a very not elegant solution and I am sure there is much better to do.
I would be really glad to learn the elegant way to do that from you !
You can use boolean indexing with loc to filter the columns in the range -2 to 2, then use idxmax along axis=1:
c = df.columns.astype(int)
df['MaxVal_Dist'] = df.loc[:, (c >= -2) & (c <= 2)].idxmax(1)
Result:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
1 37 14 17 29 31 34 32 31 21 17 18 0
2 12 13 12 16 30 33 37 32 32 15 42 1
3 40 16 29 31 36 32 30 19 16 15 12 -1
4 12 14 12 28 28 30 29 27 16 18 33 0
5 12 13 16 17 28 32 33 30 29 17 35 1
You can try List comprehension:
In [1159]: cols = [i for i in df.columns[1:] if -2 <= int(i) <= 2]
In [1161]: df['MaxVal_Dist'] = df[cols].idxmax(axis=1)
In [1162]: df
Out[1162]:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
0 1 37 14 17 29 31 34 32 31 21 17 18 0
1 2 12 13 12 16 30 33 37 32 32 15 42 1
2 3 40 16 29 31 36 32 30 19 16 15 12 -1
3 4 12 14 12 28 28 30 29 27 16 18 33 0
4 5 12 13 16 17 28 32 33 30 29 17 35 1

Cannot get good accuracy from sklearn MLP classifier

I have been given some years data of Ozone, NO, NO2 and CO to work on. The task is to use this data to predict the value of ozone. Suppose i have data of year 2015,2016,2018 and 2019. I need to predict ozone value of 2019 using 2015,2016,2018 data which is with me.
Data format is hourly recorded and is present in the form of monthsimage. So in this format data is present.
What i have done: First of all the years data in one excel file which contains 4 columns NO,NO2,CO,O3. And added all the data month by month. So this is the master file which has been usedAttached image
I have used python. First the data has to be cleared. Let me explain a bit. No,No2 and CO are predecessors of ozone which means that ozone gas creation depends on these gases and the data has to be cleaned before hand and the constraints were to remove any negative value and to remove that whole row including others column so if any of the values of Ozone,No,NO2 and CO is invalid we have to remove the whole row and not count it. And the data contained some string format which also has to be removed. It was all done. Then i applied MLP classifier from sk learn Here the code which i have done.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neural_network import MLPClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
bugs = ['NOx', '* 43.3', '* 312', '11/19', '11/28', '06:00', '09/30', '09/04', '14:00', '06/25', '07:00', '06/02',
'17:00', '04/10', '04/17', '18:00', '02/26', '02/03', '01:00', '11/23', '15:00', '11/12', '24:00', '09/02',
'16:00', '09/28', '* 16.8', '* 121', '12:00', '06/24', '13:00', '06/26', 'Span', 'NoData', 'ppb', 'Zero',
'Samp<', 'RS232']
dataset = pd.read_excel("Testing.xlsx")
dataset = pd.DataFrame(dataset).replace(bugs, 0)
dataset.dropna(subset=["O3"], inplace=True)
dataset.dropna(subset=["NO"], inplace=True)
dataset.dropna(subset=["NO2"], inplace=True)
dataset.dropna(subset=["CO"], inplace=True)
dataset.drop(dataset[dataset['O3'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['O3'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['O3'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['NO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] == 0].index, inplace=True)
dataset.drop(dataset[dataset['CO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['CO'] > 4000].index, inplace=True)
dataset.drop(dataset[dataset['CO'] == 0].index, inplace=True)
dataset = dataset.reset_index()
dataset = dataset.drop(['index'], axis=1)
feat = dataset[["NO", "NO2", "CO"]].astype(int)
label = dataset[["O3"]].astype(int)
X_train, X_test, y_train, y_test = train_test_split(feat, label, test_size=0.1)
# X_train = dataset.iloc[0:9200, 0:3].values.astype(int)
# y_train = dataset.iloc[0:9200, 3].values.astype(int)
# X_test = dataset.iloc[9200:9393, 0:3].values.astype(int)
# y_test = dataset.iloc[9200:9393, 3].values.astype(int)
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.fit_transform(X_test)
def accuracy(confusion_matrix): # <--==
diagonal_sum = confusion_matrix.trace()
sum_of_all_elements = confusion_matrix.sum()
return diagonal_sum / sum_of_all_elements
classifier = MLPClassifier(hidden_layer_sizes=(250, 100, 10), max_iter=100000, activation='relu', solver='adam',
random_state=1)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
print(f"\n{X_test}\n ----> \nPredictions : \n{y_pred}\n{y_pred.shape}\n")
cm = confusion_matrix(y_pred, y_test)
print(f"\nAccuracy of MLP.Cl : {accuracy(cm)}\n")
print(accuracy_score(y_test, y_pred))
y_test = pd.DataFrame(y_test)
y_test = y_test.reset_index(0)
y_test = y_test.drop(['index'], axis=1)
y_test = y_test.head(100)
# y_test = y_test.drop([19,20],axis=0)
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred.shift(-1)
y_pred = y_pred.head(100)
# y_pred = y_pred.drop([19,20],axis=0)
plt.figure(figsize=(10, 5))
plt.plot(y_pred, color='r', label='PredictedO3')
plt.plot(y_test, color='g', label='OriginalO3')
plt.legend()
plt.show()
This the code
Attaching the plot here
console here:
PyDev console:
[[-0.53939794 -0.59019756 -0.53257553]
[ 2.55576818 0.45245455 -0.7648624 ]
[-0.36744427 0.73681421 -0.30028866]
...
[-0.59671583 -0.02147823 1.81678204]
[-0.25280849 0.73681421 1.31145621]
[-0.53939794 0.64202766 0.18466113]]
---->
Predictions :
[15 39 45 40 42 11 14 32 23 23 21 23 3 15 23 59 15 34 12 10 42 23 12 8
14 3 8 42 12 61 36 13 11 20 12 10 14 42 12 20 9 5 14 11 20 14 10 85
42 73 43 23 61 85 55 13 14 20 85 32 15 15 42 42 12 23 13 23 85 8 23 11
36 32 20 12 27 35 55 17 15 23 12 44 42 17 23 45 35 23 3 11 23 12 60 11
15 39 15 44 49 7 35 42 45 13 12 55 42 18 42 6 23 14 60 43 16 18 10 43
85 20 23 88 8 20 26 23 53 45 16 4 48 27 3 61 15 7 23 6 40 12 44 12
12 4 12 13 24 24 23 15 16 13 40 12 12 10 12 15 53 12 42 45 38 23 45 17
12 30 12 45 60 65 12 52 4 35 3 15 11 23 40 42 18 23 45 45 49 43 35 62
46 14 21 11 6 24 23 16 23 21 45 42 85 39 12 16 10 38 43 6 23 20 11 65
14 14 14 45 24 18 85 60 15 10 16 14 23 10 17 6 13 42 4 7 17 51 23 3
85 42 23 55 21 15 32 14 17 12 42 18 16 8 6 10 14 12 42 15 14 43 25 12
14 15 85 20 42 23 43 32 18 12 42 35 6 47 12 20 12 6 51 8 20 45 40 43
12 14 44 23 23 21 15 45 24 12 23 23 42 15 12 46 35 8 14 16 42 11 42 16
13 61 60 25 26 16 45 10 17 5 43 21 26 12 49 12 42 11 38 48 21 45 9 48
11 20 13 23 16 21 11 12 44 55 11 16 53 45 8 17 12 9 85 56 7 23 23 26
12 42 42 51 17 23 43 52 24 12 29 11 21 42 16 6 20 18 16 8 14 15 13 43
10 23 16 15 42 43 23 11 14 25 47 16 24 14 7 43 45 14 5 18 51 42 20 15
39 32 12 44 13 51 12 43 42 23 42 17 11 12 11 42 12 5 35 51 23 51 14 9
11 34 18 21 88 21 15 15 6 49 12 51 8 12 49 8 4 17 15 6 26 3 15 43
14 5 23 15 88 21 85 11 23 25 45 14 12 65 45 27 48 42 12 14 44 45 4 44
40 16 23 25 15 10 20 12 15 62 6 13 20 20 11 56 12 40 11 14 25 6 25 12
40 85 40 85 43 11 14 32 11 8 6 8 23 12 26 18 60 18 51 40 13 51 12 8
23 45 20 4 23 11 3 12 51 11 18 12 40 14 40 7 85 44 60 85 45 14 14 14
11 55 18 16 45 13 23 51 11 14 23 18 14 7 40 23 15 32 12 12 23 42 49 88
11 11 42 6 25 12 6 11 18 6 13 35 8 15 42 39 23 9 23 32 20 21 12 20
20 38 7 12 42 8 13 17 55 60 16 39 18 42 42 12 60 14 16 40 9 18 85 40
5 14 23 45 10 24 14 25 11 17 15 42 42 15 23 15 8 34 16 60 42 14 48 51
11 6 51 15 42 12 42 20 12 25 26 25 45 26 40 48 23 45 23 21 11 17 48 12
12 6 15 34 10 16 18 17 13 20 45 3 9 39 12 11 15 23 42 45 45 65 51 6
45 15 15 17 51 8 51 34 14 17 13 38 38 21 18 51 55 16 9 44 42 6 42 17
6 25 88 11 10 48 20 40 21 12 44 27 47 42 38 15 49 12 12 12 6 12 8 16
42 9 20 18 23 18 12 13 20 16 14 12 23 10 60 18 25 23 43 21 12 12 10 61
21 40 6 16 45 38 12 17 12 15 32 9 38 17 14 11 6 15 14 6 48 21 13 13
15 36 3 45 25 29 24 16 8 10 27 21 20 51 10 16 21 12 20 23 46 23 3 34
29 15 23 15 48 42 17 42 43 15 35 34 23 23 44 23 4 35 12 42 49 36 15 18
15 14 11 18 16 20 15 25 9 43 51 45 12 15 39 21 51 18 24 26 17 9 42 44
12 30 32 8 20 44 52 20 23 23 15 12 12 42 8 5 42 23 21 16 24 65 16 12
38 36 43 60 15 7 85 15 26 42 40 11 12 23 12 20 40 23 42 6 23 52 16 20
23 45 51 9 42 42 25 6 21 23 15 8 12 12 26 11 16 15 39 8 26 43 48 47
12 48 12 11]
(940,)
and
Accuracy of MLP.Cl : 0.0425531914893617
0.0425531914893617
I can't get the right result or you can say right predictions.
You are trying to predict a continuous value, which is a regression problem, not a classification one; consequently, MLPClassifier is the wrong model to apply here - the correct one being an MLPRegressor.
On top of this, accuracy is meaningful for classification problems only, and it is meaningless in regression ones, like yours here; so, after switching to the correct model, you should also use some other performance metric suitable for regression problems.

Choosing the larger probability from a specific indexID

I have a database as follows:
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 clean 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 clean 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 clean 30
What I want to do is, for each repeated indexID, I would like to choose the entry that is of higher probability and mark that as clean and the other as dirty.
The output should look something like this:
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
If need pandas solution create boolean mask by comparing Probability column by Series.ne (!=) with max values per groups created by transform, because need Series with same size as df:
mask = df['Probability'].ne(df.groupby('indexID')['Probability'].transform('max'))
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
Detail:
print (df.groupby('indexID')['Probability'].transform('max'))
0 75
1 75
2 75
5 85
6 85
9 74
12 72
13 72
14 85
15 76
16 91
19 71
23 71
28 71
32 97
33 97
Name: Probability, dtype: int64
If want compare mean with gt (>):
mask = df['Probability'].gt(df['Probability'].mean())
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 dirty 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 dirty 85
9 4 5 9 dirty 74
12 6 7 12 clean 23
13 6 8 13 dirty 72
14 7 8 14 dirty 85
15 9 10 15 dirty 76
16 10 11 16 dirty 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 dirty 71
32 20 21 32 dirty 97
33 20 22 33 clean 30

Reading csv file with delimiter | using pandas

def main():
l=[]
for i in range(1981,2018):
df = pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt")
print(df[12:])
I am trying to download and read the "CONUS" row in Population.Heating.txt from 1981 to 2017.
My code seems to get the CONUS parts, but How can I actually read it like csv format with |?
Thank you!
Try this:
def main():
l=[]
url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
for i in range(1981,2018):
df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
print(df[12:])
Demo:
In [14]: url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
In [15]: i = 2017
In [16]: df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
In [17]: df
Out[17]:
Region 20170101 20170102 20170103 20170104 20170105 20170106 20170107 20170108 20170109 ... 20171222 20171223 \
0 1 30 36 31 25 37 39 47 51 55 ... 40 32
1 2 28 32 28 23 39 41 46 49 51 ... 31 25
2 3 34 30 26 43 52 58 57 54 44 ... 29 32
3 4 37 34 37 57 60 62 59 54 43 ... 39 45
4 5 15 11 9 10 20 21 27 36 33 ... 12 7
5 6 16 9 7 22 31 38 45 44 35 ... 9 9
6 7 8 5 9 23 23 34 37 32 17 ... 9 19
7 8 30 32 34 33 36 42 42 31 23 ... 36 33
8 9 25 25 24 23 22 25 23 15 17 ... 23 20
9 CONUS 24 23 21 26 33 38 40 39 34 ... 23 22
20171224 20171225 20171226 20171227 20171228 20171229 20171230 20171231
0 32 34 43 53 59 59 57 59
1 30 33 43 49 54 53 50 55
2 41 47 58 62 60 54 54 60
3 47 55 61 64 57 54 62 68
4 12 20 21 22 27 26 24 29
5 22 33 31 35 37 33 32 39
6 19 24 23 28 28 23 19 27
7 34 30 32 29 26 24 27 30
8 18 17 17 15 13 11 12 15
9 26 30 34 37 38 35 34 40
[10 rows x 366 columns]
def main():
l=[]
for i in range(1981,2018):
l.append( pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt"
, sep='|', skiprows=3))
Files look like:
Product: Daily Heating Degree Days
Regions: Regions::CensusDivisions
Weights: Population
[... data ...]
so you need to skip 3 rows. Afterwards you have several 'df' in your list 'l' for further processing.

How to add values by column into a Dataframe

I have a Dataframe with three columns store, hour, count. The problem I'm facing is some hours are missing for some stores and I want them to be 0.
This is how the dataframe looks like
# store_id hour count
# 0 13 0 56
# 1 13 1 78
# 2 13 2 53
# 3 23 13 14
# 4 23 14 13
As you can see for the store with id 13 doesn't have values for hours 3-23, similarly with store 23 it doesn't have values for many other hours.
I tried to solve this by creating a temporal dataframe with two columns id and count and performing a right outer join, but didn't work.
If typo and no duplicates in hour per groups, solution is reindex with MultiIndex.from_product:
df = df.set_index(['store_id','hour'])
mux = pd.MultiIndex.from_product([df.index.levels[0], range(23)], names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
store_id hour count
0 13 0 56
1 13 1 78
2 13 2 53
3 13 3 0
4 13 4 0
5 13 5 0
6 13 6 0
7 13 7 0
8 13 8 0
9 13 9 0
10 13 10 0
11 13 11 0
12 13 12 0
13 13 13 0
14 13 14 0
15 13 15 0
16 13 16 0
17 13 17 0
18 13 18 0
19 13 19 0
20 13 20 0
21 13 21 0
22 13 22 0
23 23 0 0
24 23 1 0
25 23 2 0
26 23 3 0
27 23 4 0
28 23 5 0
29 23 6 0
30 23 7 0
31 23 8 0
32 23 9 0
33 23 10 0
34 23 11 0
35 23 12 0
36 23 13 14
37 23 14 0
38 23 15 0
39 23 16 0
40 23 17 0
41 23 18 0
42 23 19 0
43 23 20 0
44 23 21 0
45 23 22 0
Try this:
all_hours = set(range(24))
for sid in set(df['store_id']):
misshours = list(all_hours - set(df['hour'][df['store_id'] == sid]))
nmiss = len(misshours)
df = pandas.concat([df, DataFrame({'store_id': nmiss * [sid], misshours, 'count': nmiss * [0]})])

Categories