Misalign predictions and the response column after crossvalidation in H2O (Deep Learning) - python

I keep having a problem with a deep learning model. I have a model trained on rrc data frame, and if I do:
rrc['preds'] = dp.cross_validation_holdout_predictions().as_data_frame().predict
I always get misaligned the response columns and predictions. At the top of the data frame there are aligned, but at some point it seems that they are misaligned and if I calculate a correlation between them is very bad because of this misalignment. I have been trying to fix this for over 3 day but I have no idea how to do it.
I'm using H2O 3.10.4.5.
The model itself:
dp = H2ODeepLearningEstimator(activation = "Tanh", hidden = [10, 10, 10], epochs = 10000,
keep_cross_validation_predictions=True,
ignored_columns = ['fn', 'pdb_id','pdb_id_chain', 'pdb_id_chain_source', 'source'])
dp.train(x = list(set(rrch.col_names) - set(['rmsd_all'])), y ="rmsd_all", training_frame = rrch,
fold_column="cv")
Edit: I think I found the problem (Cell #58) https://github.com/mmagnus/mmagnus.github.io/blob/master/mq-test.ipynb If I do rrc3 = rrc3[rrc3.rmsd_all < 10] to remove some rows that rmsd_all (the response column) value is higher than 10 and then I do rrc3h = h2o.H2OFrame(rrc3) caused the problem. I'm not sure why though. The dataset, 40mb https://www.dropbox.com/s/1et38o3xx47jw1m/rasp_rnakb_cv2.csv?dl=0

Solved: rrc3.reset_index(inplace=True) will do the job!

Related

Predicting whether a customer will deposit in the next 30 days

I am trying to figure out what type of a machine learning model to build for the following case:
I want to be able to predict whether a customer will deposit in the next 30 days, using past data:
# Data
import pandas as pd
client_id = [1 , 1, 1, 2, 2, 3]
deposit_amount = [10, 20, 30, 15, 45, 55]
deposit_date = ["2022-01-05", "2022-01-06", "2022-01-07", "2022-01-05", "2022-01-06", "2022-01-06"]
dat = pd.DataFrame([client_id, deposit_amount, deposit_date]).T
dat.columns = ["client_id", "deposit_amount", "deposit_date"]
dat
# Use ML algorithm
---
# Output of the ML Algorithm
model_prediction = pd.DataFrame([[1, 2, 3],[0, 1, 0]]).T
model_prediction.columns = ["client_id", "target"]
As you can see there are multiple rows for a customer for each time when they deposit. Which machine learning algorithm do you suggest? Since I am trying to predict something time related, is there any model which can deal with time-based data and outputs a binary column?
You can use whatever you want as classification algorithm (SVC, RandomForest, LogisticRegression, etc). The most important thing is to prepare your data correctly.
For example, the date variable is not useful as such but the time delta before two deposits is interesting for you:
dat['deposit_date'] = pd.to_datetime(dat['deposit_date'])
dat['deposit_date'] = dat.groupby('client_id')['deposit_date'].diff().dt.days.fillna(0)
Once you have prepared your data, it is important to normalize/standardize them to avoid giving importance to certain variables with large numbers.
You can use several algorithms and compare them. You can also try different hyper settings for each model (watch out for overfitting).
The second most important thing is choosing your metric to evaluate your model. As you have classification problem (and more specifically binary classification), you need to use metrics like f1-score, precision, recall.
As you can see, the choice of the algorithm is not very important compared to your analysis of the problem :-)

How to modify the Timeseries forecasting for weather prediction example to increase the number of predictions?

(And plot them all in the same figure).
I've been following the "Timeseries forecasting for weather prediction" code found here:
https://keras.io/examples/timeseries/timeseries_weather_forecasting/
The article says:
"The trained model above is now able to make predictions for 5 sets of values from validation set."
And it uses this code to get predictons and plot them:
def show_plot(plot_data, delta, title):
labels = ["History", "True Future", "Model Prediction"]
marker = [".-", "rx", "go"]
time_steps = list(range(-(plot_data[0].shape[0]), 0))
if delta:
future = delta
else:
future = 0
plt.title(title)
for i, val in enumerate(plot_data):
if i:
plt.plot(future, plot_data[i], marker[i], markersize=10, label=labels[i])
else:
plt.plot(time_steps, plot_data[i].flatten(), marker[i], label=labels[i])
plt.legend()
plt.xlim([time_steps[0], (future + 5) * 2])
plt.xlabel("Time-Step")
plt.show()
return
for x, y in dataset_val.take(5):
show_plot(
[x[0][:, 1].numpy(), y[0].numpy(), model.predict(x)[0]],
12,
"Single Step Prediction",
)
In my computer in order to downsample the series to 1 hour... instead of using "sampling_rate=6" I have directly modified the frequency of the input data and I'm using "sampling_rate=1"
Now, considering that the model was fitted properly... What do I need to modify if I want to get predictions for the next 500 intervals instead of just 5?
dataset_val.take(500)
Or something else?
The configuration at the beginning also says:
split_fraction = 0.715
train_split = int(split_fraction * int(df.shape[0]))
step = 6
past = 720
future = 72
learning_rate = 0.001
batch_size = 256
epochs = 10
What values do I need to use now for past and future (if my data has a frequency of 1 hour and I want to predict 500 points forward?
future = 500
past = ? (it seems to be the number of timestamps taken backwards for training)
What about delta? It's fixed to 12, but it seems to be the value for future.
according to the source
https://github.com/keras-team/keras-io/blob/master/examples/timeseries/timeseries_weather_forecasting.py
, here is the model
inputs = keras.layers.Input(shape=(inputs.shape[1], inputs.shape[2]))
lstm_out = keras.layers.LSTM(32)(inputs)
outputs = keras.layers.Dense(1)(lstm_out)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss="mse")
model.summary()
as you can see it uses a 1 unit Dense as a last layer. if you want for example 2 predictions u should use 2 units for Dense(as a last layer) and should be careful about the input shape of ( X_train, Y_train) and (X_Validation, Y_Validation) because your expected Y as a default has a 1 unit so probably you should convert it.
Simple example
Default X:1,Y:1 changes to X:1,Y:1,2
and probably Y data should be shifted(N) which N is exactly the number of units in the last layer(Dense).
If you just want to predict a bigger time frame you can convert your whole Data to the bigger one.
e.x our default Time frame and data (weather) is per hour. then we can convert our data to the daily ( which is x24 ) and then we can predict daily or the same thing (X30) and we can predict monthly and so that.

Understanding TimeSeriesDataSet in Pytorch-Forecasting

I have 913000 rows data:
data image
First, Let me explain this data
this data is sales data for 10 stores and 50 item from 2013-01-01 to 2017-12-31.
i understand why this data has 913000, by leap year.
anyway, i made my training set.
training = TimeSeriesDataSet(
train_df[train_df.apply(lambda x:x['time_idx']<=training_cutoff,axis=1)],
time_idx = "time_idx",
target = "sales",
group_ids = ["store","item"], # list of column names identifying a time series
max_encoder_length = max_encoder_length,
max_prediction_length = max_prediction_length,
static_categoricals = ["store","item"],
# Categorical variables that do nat change over time (e.g. product length)
time_varying_unknown_reals = ["sales"],
)
Now
First Question: i have known as the TimeSeriesDataSet has data param, reflected data minus prediction horizon by training_cutoff and minus max_encoder_length for prediction. this is right? if no please tell me truth.
Second Question: Similarly, this is output of over code
output image
Why the length is 863500
i calculate the length on my knowledge.
prediction horizon by training_cutoff - 205010 =10000
max_encoder_length for prediction - 605010 = 30000
Thus 913000-40000 = 873000.
where is 9500?
i do my best in googling. please tell me truth..

Supervised Time Series efficiency improvement

The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.

Investigating the features' importance and weights evolution in given DL model

I apologize for a longer than usual intro, but it is important for the question:
I've recently been assigned to work on an existing project, which uses Keras+Tensorflow to create a Fully Connected Net.
Overall the model has 3 fully connected layers with 500 neurons and has 2 output classes. The first layer has 500 neurons which are connected to 82 input features. The model is used in the production and is retrained weekly, using this week information generated by an outer source.
The engineer which designed the model is no longer working here and I'm trying to reverse engineer and understand the behavior of the model.
Couple of objectives I have defined for myself are:
Understand the feature selection process and feature importance.
Understand and control the weekly re-training process.
In order to try and answer both of them, I've implemented an experiment where I feed my code with two models: one from the previous week and the other from the current week:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from keras.models import model_from_json
path1 = 'C:/Model/20190114/'
path2 = 'C:/Model/20190107/'
model_name1 = '0_10.1'
model_name2 = '0_10.2'
models = [path1 + model_name1, path2 + model_name2]
features_cum_weight = {}
I then take each feature and try to sum all the weights (their absolute value) which connect it to the first hidden layer.
This way I create two vectors of 82 values:
for model_name in models:
structure_filename = model_name + "_structure.json"
weights_filename = model_name + "_weights.h5"
with open(structure_filename, 'r') as model_json:
model = model_from_json(model_json.read())
model.load_weights(weights_filename)
in_layer_weights = model.layers[0].get_weights()[0]
in_layer_weights = abs(in_layer_weights)
features_cum_weight[model_name] = in_layer_weights.sum(axis=1)
I then plot them, using MatplotLib:
# Plot the Evolvement of Input Neuron Weights:
keys = list(features_cum_weight.keys())
weights_1 = features_cum_weight[keys[0]]
weights_2 = features_cum_weight[keys[1]]
fig, ax = plt.subplots(nrows=2, ncols=2)
width = 0.35 # the width of the bars
n_plots = 4
batch = int(np.ceil(len(weights_1)/n_plots))
for i in range(n_plots):
start = i*(batch+1)
stop = min(len(weights_1), start + batch + 1)
cur_w1 = weights_1[start:stop]
cur_w2 = weights_2[start:stop]
ind = np.arange(len(cur_w1))
cur_ax = ax[i//2][i%2]
cur_ax.bar(ind - width/2, cur_w1, width, color='SkyBlue', label='Current Model')
cur_ax.bar(ind + width/2, cur_w2, width, color='IndianRed', label='Previous Model')
cur_ax.set_ylabel('Sum of Weights')
cur_ax.set_title('Sum of all weights connected by feature')
cur_ax.set_xticks(ind)
cur_ax.legend()
cur_ax.set_ylim(0, 30)
plt.show()
Resulting in the following plot:
MatPlotLib plot
I then try to compare the vectors to deduce:
If the vectors have been changed drastically - there might be some major change in the training data or some problem while retraining the model.
If some value is close to zero the model might have recognized this feature as not important.
I want your opinion and insights on the following:
The overall approach to this experiment.
Advice on other ideas on reverse engineering on a given model.
Insights on the output I provide here.
Thank you all, I am open to any suggestions and critic!
This type of deduction is not entirely true. The combination between the features is not linear. It is true that if is strictly 0 does not matter, but it may be that it is then recombined in another way and in another deep layer.
It would be true if your model is linear. In fact, this is how the PCA analysis works, where it searches for linear relationships through the covariance matrix. The eigenvalue would indicate the importance of each feature.
I think that there are several ways to confirm your suspicions:
Eliminate features that you think are not important to train again and see the result. If it is similar, your suspicions are correct.
Apply the current model, take an example (we will call it as pivot) to evaluate and significantly change the features that you consider irrelevant and create many examples. This applies for several pivots. If the result is similar, that field should not matter. Example (I consider the first feature to be irrelevant):
data = np.array([[0.5, 1, 0.5], [1, 2, 5]])
range_values = 50
new_data = []
for i in range(data.shape[0]):
sample = data[i]
# We create new samples
for i in range (1000):
    noise = np.random.rand () * range_values
    new_sample = sample.copy()
new_sample[0] += noise
new_data.append(new_sample)

Categories