Tensorflow many-to-one RNN time series - python

I am trying to implement a many-to-one RNN using time series data using tensorflow, similar to the example given https://www.tensorflow.org/tutorials/structured_data/time_series. The data looks similar to the one below
Time Latitude Longitude Speed Heading (deg)
0 20 20 5 180
1 19.9 20 5 180
2 19.8 20 5 180
3 19.7 20 5 180
Now my goal is to use the first 3 timesteps to predict the latitude of the next timestep. So my input would be
Latitude Longitude Speed Heading (deg)
20 20 5 180
19.9 20 5 180
19.8 20 5 180
and my output would be
19.7
My inputs may be "numbers", but they're all really categorical. Ex. heading 359 deg and 1 deg is nearly identical. I have tried one-hot encoding the data, then concatenating it to create a "four hot encoding" of the data but with little success.
How do you encode the features I have in a format that makes sense?

You can set some boundaries for each of the areas. For example, if Latitude is less than 10, assign it to class 0, if 10 < Latitude < 20 - to class 1, and more than 20-to class 2.
You can do it by simply adding columns to your dataframe.

Related

How to calculate best-fit line for each row with NaN?

I have a dataset storing marathon segment splits (5K, 10K, ...) in seconds and identifiers (age, gender, country) as columns and individuals as rows. Each cell for a marathon segment split column may contain either a float (specifying the number of seconds required to reach the segment) or "NaN". A row may contain up to 4 NaN values. Here is some sample data:
Age M/F Country 5K 10K 15K 20K Half Official Time
2323 38 M CHI 1300.0 2568.0 3834.0 5107.0 5383.0 10727.0
2324 23 M USA 1286.0 2503.0 3729.0 4937.0 5194.0 10727.0
2325 36 M USA 1268.0 2519.0 3775.0 5036.0 5310.0 10727.0
2326 37 M POL 1234.0 2484.0 3723.0 4972.0 5244.0 10727.0
2327 32 M DEN NaN 2520.0 3782.0 5046.0 5319.0 10728.0
I intend to calculate a best fit line for marathon split times (using only the columns between "5K" to "Half") for each row with at least one NaN; from the best fit line for the row, I want to impute a data point to replace the NaN with.
From the sample data, I intend to calculate a best fit line for row 2327 only (using values 2520.0, 3782.0, 5046.0, and 5319.0). Using this best fit line for row 2327, I intend to replace the NaN 5K time with the predicted 5K time.
How can I calculate this best fit line for each row with NaN?
Thanks in advance.
I "extrapolated" a solution from here from 2015 https://stackoverflow.com/a/31340344/6366770 (pun intended). Extrapolation definition I am not sure if in 2021 pandas has reliable extrapolation methods, so you might have to use scipy or other libraries.
When doing the Extrapolation , I excluded the "Half" column. That's because the running distances of 5K, 10K, 15K and 20K are 100% linear. It is literally a straight line if you exclude the half marathon column. But, that doesn't mean that expected running times are linear. Obviously, as you run a longer distance your average time per kilometer is lower. But, this "gets the job done" without getting too involved in an incredibly complex calculation.
Also, this is worth noting. Let's say that the first column was 1K instead of 5K. Then, this method would fail. It only works because the distances are linear. If it was 1K, you would also have to use the data from the rows of the other runners, unless you were making calculations based off the kilometers in the column names themselves. Either way, this is an imperfect solution, but much better than pd.interpolation. I linked another potential solution in the comments of tdy's answer.
import scipy as sp
import pandas as pd
# we focus on the four numeric columns from 5K-20K and and Transpose the dataframe, since we are going horizontally across columns. T
#T he index must also be numeric, so we drop it, but don't worry, we add back just the numbers and maintain the index later on.
df_extrap = df.iloc[:,4:8].T.reset_index(drop=True)
# create a scipy interpolation function to be called by a custom extrapolation function later on
def scipy_interpolate_func(s):
s_no_nan = s.dropna()
return sp.interpolate.interp1d(s_no_nan.index.values, s_no_nan.values, kind='linear', bounds_error=False)
def my_extrapolate_func(scipy_interpolate_func, new_x):
x1, x2 = scipy_interpolate_func.x[0], scipy_interpolate_func.x[-1]
y1, y2 = scipy_interpolate_func.y[0], scipy_interpolate_func.y[-1]
slope = (y2 - y1) / (x2 - x1)
return y1 + slope * (new_x - x1)
#Concat each extrapolated column altogether and transpose back to initial shape to be added to the original dataframe
s_extrapolated = pd.concat([pd.Series(my_extrapolate_func(scipy_interpolate_func(df_extrap[s]),
df_extrap[s].index.values),
index=df_extrap[s].index) for s in df_extrap.columns], axis=1).T
cols = ['5K', '10K', '15K', '20K']
df[cols] = s_extrapolated
df
Out[1]:
index Age M/F Country 5K 10K 15K 20K Half \
0 2323 38 M CHI 1300.0 2569.0 3838.0 5107.0 5383.0
1 2324 23 M USA 1286.0 2503.0 3720.0 4937.0 5194.0
2 2325 36 M USA 1268.0 2524.0 3780.0 5036.0 5310.0
3 2326 37 M POL 1234.0 2480.0 3726.0 4972.0 5244.0
4 2327 32 M DEN 1257.0 2520.0 3783.0 5046.0 5319.0
Official Time
0 10727.0
1 10727.0
2 10727.0
3 10727.0
4 10728.0

pandas: how to avoid iterating through rows when you need to verify the data sequentially?

I have a dataframe which is composed of a timestamp and two variables:
A pressure measurement which varies sequentially, representing a specific process batch (in red);
An lab analysis, which represents a measurement that represents each batch. The analysis always occurs at the end of the batch and remains a constant value until a new analysis is made. Caution: not every batch is analyzed and I don't have a flag indicating when the batch started.
I need to create a dataframe which calculates, for each batch, the average, maximum and minimum temperature, and how long it took from start to end (timedelta).
I had an idea to loop through all analysis values from the end to the start, and every time I find a new analysis value OR the pressure dropped below a certain value (since this is a characteristic of the process, all batches starts with low pressure), I'd consider as the batch start (to calculate the timedelta and to define the interval I would consider for taking the pressure min, max, and average).
However, I know it is not effective to loop through all dataframe rows (especially with 1 million rows) so, any ideas?
Dataset sample: https://cl1p.net/5sg45sf5 or https://wetransfer.com/downloads/321bc7dc2a02c6f713963518fdd9271b20201115195604/08c169
Edit: there is no clear/ready indication of when a batch starts in the current data (as someone asked), but you can identify a batch by the following characteristics:
Every batch starts with pressure below 30 and going up fastly (in less than one hour) up to 61.
Then it stabilizes around 65 (the plateau value can be something between 61 and 70) and stays there for at least 2 and a half hours.
It ends with a pressure drop (faster than one hour) to a value smaller than 30.
The cycle repeats.
OBS: you can have smaller/shorter peaks between two valid batches, but it shall not be considered as a batch.
Thanks!
This solution assumes that the batches change when the value of lab analysis changes.
First, I'll plot those changes, so we can get an idea of how frequently it does:
df['lab analysis'].plot()
There are not many changes, so we just need to identify these:
df_shift = df.loc[(df['lab analysis'].diff()!=0) & (df['lab analysis'].diff().isna() == False)]
df_shift
time pressure lab analysis
2632 2020-09-15 19:52:00 356.155 59.7
3031 2020-09-16 02:31:00 423.267 59.4
3391 2020-09-16 08:31:00 496.583 59.3
4136 2020-09-16 20:56:00 625.494 59.4
4971 2020-09-17 10:51:00 469.114 59.2
5326 2020-09-17 16:46:00 546.989 58.9
5677 2020-09-17 22:37:00 53.730 59.0
6051 2020-09-18 04:51:00 573.789 59.2
6431 2020-09-18 11:11:00 547.015 58.7
8413 2020-09-19 20:13:00 27.852 58.5
10851 2020-09-21 12:51:00 570.747 58.9
15816 2020-09-24 23:36:00 553.846 58.7
Now we can run a loop from these few changes, categorize each batch, and then run the descriptive statistics:
index_shift = df_shift.index
i = 0
batch = 1
for shift in index_shift:
df.loc[i:shift, 'batch number'] = batch
batch = batch + 1
i = shift
stats = df.groupby('batch number')['pressure'].describe()[['mean','min','max']]
And compute the time difference and insert on stats as well:
df_shift.loc[0] = df.iloc[0,:3]
df_shift.sort_index(inplace = True)
time_difference = [*df_shift['time'].diff()][1:]
stats['duration'] = time_difference
stats
mean min max duration
batch number
1.0 518.116150 24.995 671.315 1 days 19:52:00
2.0 508.353105 27.075 670.874 0 days 06:39:00
3.0 508.562450 26.715 671.156 0 days 06:00:00
4.0 486.795097 25.442 672.548 0 days 12:25:00
5.0 491.437620 24.234 671.611 0 days 13:55:00
6.0 515.473651 29.236 671.355 0 days 05:55:00
7.0 509.180860 25.566 670.714 0 days 05:51:00
8.0 490.876639 25.397 671.134 0 days 06:14:00
9.0 498.757555 24.973 670.445 0 days 06:20:00
10.0 497.000796 25.561 670.667 1 days 09:02:00
11.0 517.255608 26.107 669.476 1 days 16:38:00
12.0 404.859498 20.594 672.566 3 days 10:45:00

How to combine static features with time series in forecasting

I tried to find a similar question and its answers but was not successful in doing so. That's why I'm asking a question that might be asked before:
I'm working on a problem that outputs the cumulative water production of several water wells. The features I have are both time series (water rate and pump speed as functions of time) and static (depth of the wells, latitude and longitude of the well, thickness of the water bearing zone, etc.)
My input data can be shown as below for well#1.
dynamic data:
water rate pump speed total produced water
2000-01-01 10 4 1120
2000-01-02 20 8 1140
2000-01-03 10 4 1150
2000-01-04 10 3 1160
2000-01-05 10 4 1170
static data:
depth of the well_1 = 100
latitude and longitude of the well_1 = x1, y1
thickness of the water bearing zone of well_1 = 3
My question is how a RNN model (LSTM, GRU, ...) can be built that can take both dynamic and static features?
There are multiple options, and you need to experiment which one will be optimal for your case.
Option 1: You can treat your static features as fixed temporal data. So, you make a temporal dimension for each of your static features and let LSTM handle the rest.
For example your transformed data will look like this:
water rate pump speed total produced water depth_wall
2000-01-01 10 4 1120 100
2000-01-02 20 8 1140 100
2000-01-03 10 4 1150 100
2000-01-04 10 3 1160 100
2000-01-05 10 4 1170 100
Option 2: Designing multi-head networks.
TIME_SERIES_INPUT ------> LSTM -------\
*---> MERGE / Concatenate ---> [more layers]
STATIC_INPUTS --> [FC layer/ conv] ---/
Here is a paper explaining a combining strategy: https://arxiv.org/pdf/1712.08160.pdf
Here is another paper utilizing option 2: https://www.researchgate.net/publication/337159046_Classification_of_ECG_signals_by_dot_Residual_LSTM_Network_with_data_augmentation_for_anomaly_detection
Source code for paper 2: https://github.com/zabir-nabil/dot-res-lstm
LSTM_att proposed by Machine Learning Crop Yield Models Based on Meteorological Features and Comparison with a Process-Based Modelenter link description here seems to be a good option.
It applies static features to calculate the attention to aggregate the hidden states of time series and also provides a shortcut connection between each hidden state and final state (similar to ResNet). It outperforms baseline LSTM models.

Plotting a segmentation plot

I have a data frame
Rfm Count %
0 111 88824 57.13
1 112 5462 3.51
2 121 32209 20.72
3 122 15155 9.75
4 211 5002 3.22
5 212 1002 0.64
6 221 3054 1.96
7 222 4778 3.07
How can I plot a graph like this?
Background - The numbers are the RFM scores.
R is Repeat (number of days since customer ordered)
F is frequency (number of jobs from customers)
M is monetary (how much customer is paying)
The R,F and M scores are either 1 (bad) or 2 (good).
I would like to segment them into 4 Quadrants.
I would also like the size of the blob to be proportional to the percentage.
I.e. blob 111 (57%) will be much larger than blob 212 (0.64%).
I really want to get better at data visualization, please help a beginner out. I'm familiar with seaborn and matplotlib.
Ps: Is it possible to add a third dimension to the plot? 3rd Dim would be the frequency.
Edit: The second image is a simple static way of achieveing my goal. Any input for doing it with matplotlib or seaborn? For a more interesting illustration.
[Second Image]
(https://i.stack.imgur.com/AuzEM.jpg)

How to get indices of both lists of different length if there is a unique match between them using python 2.7

Today I want to know how to get indices for both lists whenever there is match between them. I came across use of enumerate and zip function. But they work if two list are of same length. Since my inputs are different I want to get indices of both of the lists.
# Simulated Time(msec) Simulated O/p Actual Time(msec) Actual o/p
0 12.57 0 12.55
50 12.58 100 12.56
100 12.55 200 12.60
150 12.59 300 12.45
200 12.53 400 12.59
250 12.87 500 12.78
300 12.50 600 12.57
350 12.75 700 12.66
400 12.80 800 12.78
...... ...... ..... ......
Also My simulated data is in different file and generating data at 50Hz rate different from my actual data. Hence Simulated data is of higher length than actual data. But actual data is present in simulated data. I want to get indices of both the the list. Example Simulated Time(msec) 100 (i=2) is matching with indice(j=1) of actual time. If I get indices of both i and j then I can compare corresponding simulated output and actual output at that particular instant.
Lastly I want to iterate till the end of simulated time.
Please suggest how can I solve this.
if simand act contains unique values, here is a way to do that, using the numpy set routine np.in1d:
sim=np.unique(np.random.randint(0,10,3))*10 #sample
act=np.unique(np.random.randint(0,10,5))*10 #sample
i=np.arange(len(sim))[np.in1d(sim,act)]
j=np.arange(len(act))[np.in1d(act,sim)]
print(sim,act,i,j)
#[40 50 70] [10 30 40 50] [0 1] [2 3]

Categories