I have a dataset that has 8 mixed features (6 numeric and 2 categorical). Since the numeric values have different ranges, I will have to normalize the dataset as a whole to be able to perform farther actions such as machine learning algorithms, Dimensionality reduction (Feature Extraction).
My original dataset:
time v1 v2 v3 ... v7 v8
00:00:01 15435 0.7 13 ... High True
00:00:06 24356 3.6 23 ... High True
00:00:11 25567 8.3 82 ... LOW False
00:00:16 12345 5.4 110 ... LOW True
00:00:21 43246 1.7 93 ... High False
................................................
23:23:59 23456 3.8 45 ... LOW False
where v1 to v6 are numerical variable at which their values are at different ranges as it can be seen above. Moreover, v7 and v8 are categorical variables that has only two outputs (for v7 {High, Low} and for v8 {True, False}).
I did label encoding for the categorical variables (v7 and v8) where High and True were encoded 1 and LOW and False were encoded 0.
The following illustrate how the dataset looks like after the label encoding:
time v1 v2 v3 ... v7 v8
00:00:01 15435 0.7 13 ... 1 1
00:00:06 24356 3.6 23 ... 1 1
00:00:11 25567 8.3 82 ... 0 0
00:00:16 12345 5.4 110 ... 0 1
00:00:21 43246 1.7 93 ... 1 0
................................................
23:23:59 23456 3.8 45 ... 0 0
My question is as follows: It is easy to standardize the numerical features from v1 to v6. However, I am not sure whether to standardize the categorical observations and if yes what would be the best way to do so??
You can use UNIX time, for example:
import pandas as pd
import numpy as np
date = pd.date_range('1/1/2011', periods = 10, freq ='H')
df = pd.DataFrame({'date':date})
df['unix_time'] = df['date'].astype(np.int64) // 10**9
df
output:
date unix_time
0 2011-01-01 00:00:00 1293840000
1 2011-01-01 01:00:00 1293843600
2 2011-01-01 02:00:00 1293847200
3 2011-01-01 03:00:00 1293850800
4 2011-01-01 04:00:00 1293854400
5 2011-01-01 05:00:00 1293858000
6 2011-01-01 06:00:00 1293861600
7 2011-01-01 07:00:00 1293865200
8 2011-01-01 08:00:00 1293868800
9 2011-01-01 09:00:00 1293872400
Now your machine learning algorithms can compare date, also you can convert date back:
pd.to_datetime(df['unix_time'], unit='s')
output:
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
Name: unix_time, dtype: datetime64[ns]
Normalization rescales the values between the range 0 to 1. Your values are already in this range, you would have required normalization of categorical values only if the cardinality was really really high but for now you can keep them as it is. I will also suggest you to normalize your whole dataset. Then all the values will be in the same range & algo will not erroneously learn anything by giving preference to any feature with higher numerical values. You can find both normalization & scaling in scikit learn itself.
from sklearn import preprocessing
X=your_data
normalized_X = preprocessing.normalize(X)
Related
if I have a dataframe like this:
timestamp price
1596267946298 100.0
1596267946299 101.0
1596267946300 102.0
1596267948301 99.0
1596267948302 98.0
1596267949303 99.0
and I want to create the high, low and average during resampling:
I can duplicate the price column into a high and low column and then during resample do the min, max and mean on the appropriate columns.
But I was wondering if there is a way to make this in one pass?
my expected output would be (let's assume resampling at 100ms for this example)
timestamp price min mean max
1596267946298 100.0 100 100.5 101
1596267946299 101.0 100 100.5 101
1596267946300 102.0 98 99.5 102
1596267948301 99.0 98 99.5 102
1596267948302 98.0 98 99.5 102
1596267949303 99.0 98 995. 102
You could something like this-
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
def custom_func(x):
return x[-1], x.min(), x.max(), x.mean()
result = series.resample('3T').apply(custom_func)
print(pd.DataFrame(result.tolist(), columns=['resampled', 'min', 'max', 'mean'], index=result.index))
Before resampling
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
After resampling
resampled min max mean
2000-01-01 00:00:00 2 0 2 1.0
2000-01-01 00:03:00 5 3 5 4.0
2000-01-01 00:06:00 8 6 8 7.0
I am working on a regression problem where, I have 12 sensors data (Independent) columns and 1 output column, all sampled at 48KHz. I have total 420 seconds of train data. In test dataset, I have 12 sensor data columns and need to predict output.
Till now, I have tried classical machine learning algorithms without considering time feature. I am new to the time series and not sure if this is actually a time series forecasting problem.
I am not sure if I can consider this as a multivariate time series problem and try LSTM/RNNs.
I have been following https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/#comment-442845 but not able to understand how I can predict on test data.
Do I need to append a new column to convert test data from (length,12) to (length, 13) and then predict one by one row and use output to next iteration?
Also, is above a correct approach to solve this kind of problem or do I have to think about something else?
UPDATE
Updating my question on below comments.
Let's say my train data looks like below (Updated heading just to explain better). I am training LSTM network same as mentioned in above link. I have created Y(t),Y(t-1),x1(t-1),x2(t-1),x3(t-1),x4(t-1),x5(t-1),x6(t-1) using series_to_supervised function.
Y x1 x2 x3 x4 x5 x6
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2
Now, I have test data without Y column.
As an example,
x1 x2 x3 x4 x5 x6
date
2010-01-02 00:00:00 -11 -6.0 1020.0 SE 1.79 0
2010-01-02 01:00:00 -12 -1.0 1020.0 SE 2.68 0
2010-01-02 02:00:00 -10 -4.0 1021.0 SE 3.57 0
2010-01-02 03:00:00 -7 -2.0 1022.0 SE 5.36 1
2010-01-02 04:00:00 -7 -5.0 1022.0 SE 6.25 2
What I have done. I have appended fake Y column with 0 padding and replaced first value as mean of train Y column. My idea is to use t-1 predicted value in next prediction. I don't know how I can get it easily. I came up with following logic.
Code snippet
#test_pd is panda frame of size Nx6
#train_pd is panda frame of size Nx5
test_pd['Y'] = 0
train_out_mean = train_pd[0].mean()
test_pd[0][0] = train_out_mean
test_pd = test_pd.values.reshape((test_pd.shape[0],1,test_pd.shape[1]))
out_list = list()
out_list.append(train_out_mean)
for i in range(test_pd.shape[0]):
y = loaded_model.predict(test_pd[i].reshape(1,test_pd.shape[1],test_pd.shape[2]))
y = y[0]
out_list.append(y)
if (i+1>=test_pd.shape[0]):
break
test_pd[i+1][0][0] = y
I have two follow-up question.
Is above approach theoretically correct to solve the problem?
If yes, then is there any better way to predict on test dataset?
I would consider starting with a simpler approach before going for more complex algorithms like a LSTM.
Here in StackOverflow you should objectively ask some doubt about code. So if you share some of your code here, we can try to help you.
Considering that you have a time series like that (example in your link):
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
simpler approach: MLP Regressor
In a simpler approach, assuming you wanted to predict the pollution, you can build a a MLP Regressor, so during the training phase, you should separate the data in 7 features(dew, temp, press, wnd_dir, wnd_spd, snow, rain) to predict the pollution. Here an example:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics
data = dataset.values
# integer encode WIND direction
encoder = LabelEncoder()
data[:,4] = encoder.fit_transform(data[:,4])
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(data)
y, X = np.split(data,[1],axis=1)
mlp = MLPRegressor(learning_rate_init=0.001)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
mlp.fit(X_train,y_train)
y_prediction = mlp.predict(X_test)
print("R2 score:", metrics.r2_score(y_test, y_prediction))
Output:
R2 score: 0.30376681842945985
in LSTM (You need: 3D [samples, timesteps, features])
NOW, Suppose that some feature (wind, air pressure, etc.) at the moment ** t-1 **, ** t-2 ** (1 hour, 2 hours) has some influence on the moment ** t **. So now you intend to solve your problem as a time series by capturing some evolution of wind speed (for example) for some time. So now it makes sense to use LSTM.
So, the function series_to_supervised (example of your link) will help you to create new features...
The function series_to_supervised has 4 arguments:
data: Sequence of observations as a list or 2D NumPy array.
n_in: Number of lag observations as input (X). Values may be between [1..len(data)]
n_out: Number of observations as output (y). Values may be between [0..len(data)-1].
dropnan: Boolean whether or not to drop rows with NaN values
So, supposing this series the only one feature X and the label y:
X y
2018-01-01 00:00:00 1 2
2018-01-01 01:00:00 2 3
2018-01-01 02:00:00 3 4
2018-01-01 03:00:00 4 5
2018-01-01 04:00:00 5 6
2018-01-01 05:00:00 6 7
2018-01-01 06:00:00 7 8
2018-01-01 07:00:00 8 9
2018-01-01 08:00:00 9 10
2018-01-01 09:00:00 10 11
Using this function series_to_supervised(df.values,n_in=2, n_out=1, dropnan=False) you will have some like that (I did some improvements in order to understand):
X(t-2) y(t-2) X(t-1) y(t-1) X(t) y(t)
2018-01-01 00:00:00 NaN NaN NaN NaN 1 2
2018-01-01 01:00:00 NaN NaN 1.0 2.0 2 3
2018-01-01 02:00:00 1.0 2.0 2.0 3.0 3 4
2018-01-01 03:00:00 2.0 3.0 3.0 4.0 4 5
2018-01-01 04:00:00 3.0 4.0 4.0 5.0 5 6
2018-01-01 05:00:00 4.0 5.0 5.0 6.0 6 7
2018-01-01 06:00:00 5.0 6.0 6.0 7.0 7 8
2018-01-01 07:00:00 6.0 7.0 7.0 8.0 8 9
2018-01-01 08:00:00 7.0 8.0 8.0 9.0 9 10
2018-01-01 09:00:00 8.0 9.0 9.0 10.0 10 11
So, in this approach we are considering that to predict, we will al least two records X(t-2, t-1) and y(t-2, t-1) to predict y(t), future.
Why you need to do THIS? Now I think that I will start answering your question. In a LSTM you need to transform your data in 2D in 3D space.
So, after that you need to reshape input to be 3D [samples, timesteps, features] before using a LSTM. So, transform (using this function) your data is just a preparation.
Answering your question. You don't need append just one column. You NEED to transform your data in order to HAVE new features in t-n, t-3, t-2, t-1 to predict some feature in t.
I recommend you follow the steps on pollution case (cited by you) on this blog first, before trying to adapt in your case.
I want to apply some statistics on records within a time window with an offset. My data looks something like this:
lon lat stat ... speed course head
ts ...
2016-09-30 22:00:33.272 5.41463 53.173161 15 ... 0.0 0.0 511
2016-09-30 22:01:42.879 5.41459 53.173180 15 ... 0.0 0.0 511
2016-09-30 22:02:42.879 5.41461 53.173161 15 ... 0.0 0.0 511
2016-09-30 22:03:44.051 5.41464 53.173168 15 ... 0.0 0.0 511
2016-09-30 22:04:53.013 5.41462 53.173141 15 ... 0.0 0.0 511
[5 rows x 7 columns]
I need the records within time windows of 600 seconds, with steps of 300 seconds. For example, these windows:
start end
2016-09-30 22:00:00.000 2016-09-30 22:10:00.000
2016-09-30 22:05:00.000 2016-09-30 22:15:00.000
2016-09-30 22:10:00.000 2016-09-30 22:20:00.000
I have looked at Pandas rolling to do this. But it seems like it does not have the option to add the offset which I described above. Am I overlooking something, or should I create a custom function for this?
What you want to achieve should be possible by combining DataFrame.resample with DataFrame.shift.
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
df = pd.DataFrame(series)
That will give you a primitive timeseries (example taken from api docs DataFrame.resample).
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Now resample by your step size (see DataFrame.shift).
sampled = df.resample('90s').sum()
This will give you non-overlapping windows of the step size.
2000-01-01 00:00:00 1
2000-01-01 00:01:30 2
2000-01-01 00:03:00 7
2000-01-01 00:04:30 5
2000-01-01 00:06:00 13
2000-01-01 00:07:30 8
Finally, shift the sampled df by one step and sum with the previously created df. Window size being twice the step size helps.
sampled.shift(1, fill_value=0) + sampled
This will yield:
2000-01-01 00:00:00 1
2000-01-01 00:01:30 3
2000-01-01 00:03:00 9
2000-01-01 00:04:30 12
2000-01-01 00:06:00 18
2000-01-01 00:07:30 21
There may be a more elegant solution, but I hope this helps.
all, I'm newbie to Python and am stuck with the problem below. I have a DF as:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2013-01-02 93
5 2017-02-01 43
For the yearly gaps, say 2012, 2014, 2015, and 2016, I'd like to fill in the gap using the new year date for each of the missing years, and port_id from previous year. Ideally, I'd like:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2012-01-01 76
5 2013-01-02 93
6 2014-01-01 93
7 2015-01-01 93
8 2016-01-01 93
9 2017-02-01 43
I tried multiple approaches but still no avail. Could some expert shed me some lights on how to make it work out? Thanks much in advance!
You can use set.difference with range to find missing dates and then append a dataframe:
# convert to datetime if not already converted
df['asofdate'] = pd.to_datetime(df['asofdate'])
# calculate missing years
years = df['asofdate'].dt.year
missing = set(range(years.min(), years.max())) - set(years)
# append dataframe, sort and front-fill
df = df.append(pd.DataFrame({'asofdate': pd.to_datetime(list(missing), format='%Y')}))\
.sort_values('asofdate')\
.ffill()
print(df)
asofdate port_id
1 2010-01-01 76.0
2 2010-04-01 43.0
3 2011-02-01 76.0
1 2012-01-01 76.0
4 2013-01-02 93.0
2 2014-01-01 93.0
3 2015-01-01 93.0
0 2016-01-01 93.0
5 2017-02-01 43.0
I would create a helper dataframe, containing all the year start dates, then filter out the ones where the years match what is in df, and finally merge them together:
# First make sure it is proper datetime
df['asofdate'] = pd.to_datetime(df.asofdate)
# Create your temporary dataframe of year start dates
helper = pd.DataFrame({'asofdate':pd.date_range(df.asofdate.min(), df.asofdate.max(), freq='YS')})
# Filter out the rows where the year is already in df
helper = helper[~helper.asofdate.dt.year.isin(df.asofdate.dt.year)]
# Merge back in to df, sort, and forward fill
new_df = df.merge(helper, how='outer').sort_values('asofdate').ffill()
>>> new_df
asofdate port_id
0 2010-01-01 76.0
1 2010-04-01 43.0
2 2011-02-01 76.0
5 2012-01-01 76.0
3 2013-01-02 93.0
6 2014-01-01 93.0
7 2015-01-01 93.0
8 2016-01-01 93.0
4 2017-02-01 43.0
I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)