Python : do a linear regression on time series

Python : do a linear regression on time series - python

I would like to do a linear regression on a wave time serie.
I've got a dataframe with a date column ( DateHeure) and a column with my wave high (it contains some nan values...). My problem is : I don't manage to plot it with the date on the x-axis or it doesn't fit. I know that my problem is the x but I don't know how I could fix it.
My actual script :
CANDHIS =
DateHeure H13D
0 2017-01-01 00:00:00 1.7
1 2017-01-01 01:00:00 1.72
2 2017-01-01 02:00:00 2.04
3 2017-01-01 03:00:00 2.44
4 2017-01-01 04:00:00 nan
5 2017-01-01 05:00:00 2.51
6 2017-01-01 06:00:00 2.25
7 2017-01-01 07:00:00 2.28
8 2017-01-01 08:00:00 1.97
9 2017-01-01 09:00:00 1.95
10 2017-01-01 10:00:00 1.84
CANDHIS.set_index('DateHeure', inplace=True)
y=np.array(CANDHIS['H13D'].dropna().values, dtype=float)
x=np.array(pd.to_datetime(CANDHIS["H13D"].dropna().index.values), dtype=float)
slope, intercept, r_value, p_value, std_err =sp.linregress(x,y)
xf = np.linspace(min(x),max(x),100)
xf1 = xf.copy()
xf1 = pd.to_datetime(xf1)
yf = (slope*xf)+intercept
print('r = ', r_value, '\n', 'p = ', p_value, '\n', 's = ', std_err)
f, ax = plt.subplots(1, 1)
ax.plot(xf1, yf,label='Linear fit', lw=3)
CANDHIS['H13D'].dropna().plot(marker='o', ls='')
plt.ylabel('Hauteurs significatives')
ax.legend()
I know that is probably a stupid question but I'm still searching and I'm still lost...
Thanks
I would like to do a linear regression on a wave time serie and keep the date for the x-axis.

Related

Getting the max value and the time the max value occurs for all periods in a pandas df

I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.

You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00

df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.

You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00

How can I create extra columns during resampling a Pandas data frame?

if I have a dataframe like this:
timestamp price
1596267946298 100.0
1596267946299 101.0
1596267946300 102.0
1596267948301 99.0
1596267948302 98.0
1596267949303 99.0
and I want to create the high, low and average during resampling:
I can duplicate the price column into a high and low column and then during resample do the min, max and mean on the appropriate columns.
But I was wondering if there is a way to make this in one pass?
my expected output would be (let's assume resampling at 100ms for this example)
timestamp price min mean max
1596267946298 100.0 100 100.5 101
1596267946299 101.0 100 100.5 101
1596267946300 102.0 98 99.5 102
1596267948301 99.0 98 99.5 102
1596267948302 98.0 98 99.5 102
1596267949303 99.0 98 995. 102

You could something like this-
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
def custom_func(x):
return x[-1], x.min(), x.max(), x.mean()
result = series.resample('3T').apply(custom_func)
print(pd.DataFrame(result.tolist(), columns=['resampled', 'min', 'max', 'mean'], index=result.index))
Before resampling
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
After resampling
resampled min max mean
2000-01-01 00:00:00 2 0 2 1.0
2000-01-01 00:03:00 5 3 5 4.0
2000-01-01 00:06:00 8 6 8 7.0

Multiple regression on Time Series sensor data

I am working on a regression problem where, I have 12 sensors data (Independent) columns and 1 output column, all sampled at 48KHz. I have total 420 seconds of train data. In test dataset, I have 12 sensor data columns and need to predict output.
Till now, I have tried classical machine learning algorithms without considering time feature. I am new to the time series and not sure if this is actually a time series forecasting problem.
I am not sure if I can consider this as a multivariate time series problem and try LSTM/RNNs.
I have been following https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/#comment-442845 but not able to understand how I can predict on test data.
Do I need to append a new column to convert test data from (length,12) to (length, 13) and then predict one by one row and use output to next iteration?
Also, is above a correct approach to solve this kind of problem or do I have to think about something else?
UPDATE
Updating my question on below comments.
Let's say my train data looks like below (Updated heading just to explain better). I am training LSTM network same as mentioned in above link. I have created Y(t),Y(t-1),x1(t-1),x2(t-1),x3(t-1),x4(t-1),x5(t-1),x6(t-1) using series_to_supervised function.
Y x1 x2 x3 x4 x5 x6
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2
Now, I have test data without Y column.
As an example,
x1 x2 x3 x4 x5 x6
date
2010-01-02 00:00:00 -11 -6.0 1020.0 SE 1.79 0
2010-01-02 01:00:00 -12 -1.0 1020.0 SE 2.68 0
2010-01-02 02:00:00 -10 -4.0 1021.0 SE 3.57 0
2010-01-02 03:00:00 -7 -2.0 1022.0 SE 5.36 1
2010-01-02 04:00:00 -7 -5.0 1022.0 SE 6.25 2
What I have done. I have appended fake Y column with 0 padding and replaced first value as mean of train Y column. My idea is to use t-1 predicted value in next prediction. I don't know how I can get it easily. I came up with following logic.
Code snippet
#test_pd is panda frame of size Nx6
#train_pd is panda frame of size Nx5
test_pd['Y'] = 0
train_out_mean = train_pd[0].mean()
test_pd[0][0] = train_out_mean
test_pd = test_pd.values.reshape((test_pd.shape[0],1,test_pd.shape[1]))
out_list = list()
out_list.append(train_out_mean)
for i in range(test_pd.shape[0]):
y = loaded_model.predict(test_pd[i].reshape(1,test_pd.shape[1],test_pd.shape[2]))
y = y[0]
out_list.append(y)
if (i+1>=test_pd.shape[0]):
break
test_pd[i+1][0][0] = y
I have two follow-up question.
Is above approach theoretically correct to solve the problem?
If yes, then is there any better way to predict on test dataset?

I would consider starting with a simpler approach before going for more complex algorithms like a LSTM.
Here in StackOverflow you should objectively ask some doubt about code. So if you share some of your code here, we can try to help you.
Considering that you have a time series like that (example in your link):
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
simpler approach: MLP Regressor
In a simpler approach, assuming you wanted to predict the pollution, you can build a a MLP Regressor, so during the training phase, you should separate the data in 7 features(dew, temp, press, wnd_dir, wnd_spd, snow, rain) to predict the pollution. Here an example:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics
data = dataset.values
# integer encode WIND direction
encoder = LabelEncoder()
data[:,4] = encoder.fit_transform(data[:,4])
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(data)
y, X = np.split(data,[1],axis=1)
mlp = MLPRegressor(learning_rate_init=0.001)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
mlp.fit(X_train,y_train)
y_prediction = mlp.predict(X_test)
print("R2 score:", metrics.r2_score(y_test, y_prediction))
Output:
R2 score: 0.30376681842945985
in LSTM (You need: 3D [samples, timesteps, features])
NOW, Suppose that some feature (wind, air pressure, etc.) at the moment ** t-1 **, ** t-2 ** (1 hour, 2 hours) has some influence on the moment ** t **. So now you intend to solve your problem as a time series by capturing some evolution of wind speed (for example) for some time. So now it makes sense to use LSTM.
So, the function series_to_supervised (example of your link) will help you to create new features...
The function series_to_supervised has 4 arguments:
data: Sequence of observations as a list or 2D NumPy array.
n_in: Number of lag observations as input (X). Values may be between [1..len(data)]
n_out: Number of observations as output (y). Values may be between [0..len(data)-1].
dropnan: Boolean whether or not to drop rows with NaN values
So, supposing this series the only one feature X and the label y:
X y
2018-01-01 00:00:00 1 2
2018-01-01 01:00:00 2 3
2018-01-01 02:00:00 3 4
2018-01-01 03:00:00 4 5
2018-01-01 04:00:00 5 6
2018-01-01 05:00:00 6 7
2018-01-01 06:00:00 7 8
2018-01-01 07:00:00 8 9
2018-01-01 08:00:00 9 10
2018-01-01 09:00:00 10 11
Using this function series_to_supervised(df.values,n_in=2, n_out=1, dropnan=False) you will have some like that (I did some improvements in order to understand):
X(t-2) y(t-2) X(t-1) y(t-1) X(t) y(t)
2018-01-01 00:00:00 NaN NaN NaN NaN 1 2
2018-01-01 01:00:00 NaN NaN 1.0 2.0 2 3
2018-01-01 02:00:00 1.0 2.0 2.0 3.0 3 4
2018-01-01 03:00:00 2.0 3.0 3.0 4.0 4 5
2018-01-01 04:00:00 3.0 4.0 4.0 5.0 5 6
2018-01-01 05:00:00 4.0 5.0 5.0 6.0 6 7
2018-01-01 06:00:00 5.0 6.0 6.0 7.0 7 8
2018-01-01 07:00:00 6.0 7.0 7.0 8.0 8 9
2018-01-01 08:00:00 7.0 8.0 8.0 9.0 9 10
2018-01-01 09:00:00 8.0 9.0 9.0 10.0 10 11
So, in this approach we are considering that to predict, we will al least two records X(t-2, t-1) and y(t-2, t-1) to predict y(t), future.
Why you need to do THIS? Now I think that I will start answering your question. In a LSTM you need to transform your data in 2D in 3D space.
So, after that you need to reshape input to be 3D [samples, timesteps, features] before using a LSTM. So, transform (using this function) your data is just a preparation.
Answering your question. You don't need append just one column. You NEED to transform your data in order to HAVE new features in t-n, t-3, t-2, t-1 to predict some feature in t.
I recommend you follow the steps on pollution case (cited by you) on this blog first, before trying to adapt in your case.

Plotting pandas DataFrame with matplotlib

Here is a sample of the code I am using which works perfectly well..
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# Data
df=pd.DataFrame({'x': np.arange(10), 'y1': np.random.randn(10), 'y2': np.random.randn(10)+
range(1,11), 'y3': np.random.randn(10)+range(11,21) })
print(df)
# multiple line plot
plt.plot( 'x', 'y1', data=df, marker='o', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4)
plt.plot( 'x', 'y2', data=df, marker='', color='olive', linewidth=2)
plt.plot( 'x', 'y3', data=df, marker='', color='olive', linewidth=2, linestyle='dashed', label="y3")
plt.legend()
plt.show()
The values in the column 'x' actually refers to 10 hours time period of the day, starting with 6 AM as 0 and 7 AM, and so on. Is there any way I could replace those values(x-axis) in my figure with the time periods, like replace the 0 with 6 AM?

It's always a good idea to store time or datetime information as Pandas datetime datatype.
In your example, if you only want to keep the time information:
df['time'] = (df.x + 6) * pd.Timedelta(1, unit='h')
Output
x y1 y2 y3 time
0 0 -0.523190 1.681115 11.194223 06:00:00
1 1 -1.050002 1.727412 13.360231 07:00:00
2 2 0.284060 4.909793 11.377206 08:00:00
3 3 0.960851 2.702884 14.054678 09:00:00
4 4 -0.392999 5.507870 15.594092 10:00:00
5 5 -0.999188 5.581492 15.942648 11:00:00
6 6 -0.555095 6.139786 17.808850 12:00:00
7 7 -0.074643 7.963490 18.486967 13:00:00
8 8 0.445099 7.301115 19.005115 14:00:00
9 9 -0.214138 9.194626 20.432349 15:00:00
If you have a starting date:
start_date='2018-07-29' # change this date appropriately
df['datetime'] = pd.to_datetime(start_date) + (df.x + 6) * pd.Timedelta(1, unit='h')
Output
x y1 y2 y3 time datetime
0 0 -0.523190 1.681115 11.194223 06:00:00 2018-07-29 06:00:00
1 1 -1.050002 1.727412 13.360231 07:00:00 2018-07-29 07:00:00
2 2 0.284060 4.909793 11.377206 08:00:00 2018-07-29 08:00:00
3 3 0.960851 2.702884 14.054678 09:00:00 2018-07-29 09:00:00
4 4 -0.392999 5.507870 15.594092 10:00:00 2018-07-29 10:00:00
5 5 -0.999188 5.581492 15.942648 11:00:00 2018-07-29 11:00:00
6 6 -0.555095 6.139786 17.808850 12:00:00 2018-07-29 12:00:00
7 7 -0.074643 7.963490 18.486967 13:00:00 2018-07-29 13:00:00
8 8 0.445099 7.301115 19.005115 14:00:00 2018-07-29 14:00:00
9 9 -0.214138 9.194626 20.432349 15:00:00 2018-07-29 15:00:00
Now the time / datetime column have a special datatype:
print(df.dtypes)
Out[5]:
x int32
y1 float64
y2 float64
y3 float64
time timedelta64[ns]
datetime datetime64[ns]
dtype: object
Which have a lot of nice properties, including automatic string formatting which you will find very useful in later parts of your projects.
Finally, to plot using matplotlib:
# multiple line plot
plt.plot( df.datetime.dt.hour, df['y1'], marker='o', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4)
plt.plot( df.datetime.dt.hour, df['y2'], marker='', color='olive', linewidth=2)
plt.plot( df.datetime.dt.hour, df['y3'], marker='', color='olive', linewidth=2, linestyle='dashed', label="y3")
plt.legend()
plt.show()

How to do interpolation on datetime and float

I am doing 1d interpolation using scipy on time-series. My x-axis data is in datetime format and y axis is in float like:
3/15/2012 16:00:00 32.94
3/16/2012 16:00:00 32.95
3/19/2012 16:00:00 32.61
Now during slope calculation slope = (y_hi-y_lo) / (x_hi-x_lo) i am getting the error TypeError: unsupported operand type(s) for /: 'float' and 'datetime.timedelta' which is an obvious error. Can someone point me toward the right direction, How to handle it ?

Your issue is that you are trying to divide a float by a datetime.timedelta object which is, as you said, obviously throwing a TypeError.
You can convert datetime.timedelta objects to a float representing the total number of seconds within that timedelta using the datetime.timedelta.total_seconds() instance method.
In that case you would modify your code to something like:
slope_numerator = y_hi - y_lo
slope_denominator = (x_hi - x_lo).total_seconds()
slope = slope_numerator / slope_denominator
Note that this will give you a slope in terms of seconds. You could modify the denominator to give it in terms of hours, days, etc to suit your purposes.

If you are working with timeseries data, the Pandas package is an excellent option. Here's an example of upsampling daily data to hourly data via interpolation:
import numpy as np
from pandas import *
rng = date_range('1/1/2011', periods=12, freq='D')
ts = Series(np.arange(len(rng)), index=rng)
resampled = ts.resample('H')
interp = resampled.interpolate()
In [5]: ts
Out[5]:
2011-01-01 0
2011-01-02 1
2011-01-03 2
2011-01-04 3
2011-01-05 4
2011-01-06 5
2011-01-07 6
2011-01-08 7
2011-01-09 8
2011-01-10 9
2011-01-11 10
2011-01-12 11
In [12]: interp.head()
Out[12]:
2011-01-01 00:00:00 0.000000
2011-01-01 01:00:00 0.041667
2011-01-01 02:00:00 0.083333
2011-01-01 03:00:00 0.125000
2011-01-01 04:00:00 0.166667
Freq: H, dtype: float64
In [13]: interp.tail()
Out[13]:
2011-01-11 20:00:00 10.833333
2011-01-11 21:00:00 10.875000
2011-01-11 22:00:00 10.916667
2011-01-11 23:00:00 10.958333
2011-01-12 00:00:00 11.000000
Freq: H, dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : do a linear regression on time series - python

Related

Getting the max value and the time the max value occurs for all periods in a pandas df

How can I create extra columns during resampling a Pandas data frame?

Multiple regression on Time Series sensor data

Plotting pandas DataFrame with matplotlib

How to do interpolation on datetime and float

Categories

Resources