Number of target values in the one prediction - python

I use python's scikit-learn module for predicting some values in the CSV file. I am using Random Forest Regressor to do it. As example, i have 8 train values and 3 values to predict - which of codes i must use? As a values to be predicted, I have to give all target values at once (A) or separately (B)?
Variant A:
#Readind CSV file
dataset = genfromtxt(open('Data/for training.csv','r'), delimiter=',', dtype='f8')[1:]
#Target value to predict
target = [x[8:11] for x in dataset]
#Train values to train
train = [x[0:8] for x in dataset]
#Starting traing
rf = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf.fit(train, target)
Variant B:
#Readind CSV file
dataset = genfromtxt(open('Data/for training.csv','r'), delimiter=',', dtype='f8')[1:]
#Target values to predict
target1 = [x[8] for x in dataset]
target2 = [x[9] for x in dataset]
target3 = [x[10] for x in dataset]
#Train values to train
train = [x[0:8] for x in dataset]
#Starting traings
rf1 = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf1.fit(train, target1)
rf2 = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf2.fit(train, target2)
rf3 = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf3.fit(train, target3)
Which version is correct?
Thanks in advance!

Both are possible, but do different things.
The first learns independent models for the different entries of y. The second learns a joint model for all entries of y. If there are meaningful relations between the entries of y that can be learned, the second should be more accurate.
As you are training on very little data and don't regularize, I imagine you are simply overfitting in the second case. I am not entirely sure about the splitting criteria in the regression case but it takes much longer for a leaf to be "pure" if the label-space is three dimensional than if it is just one-dimensional. So you will learn more complex models, that are not warranted by the little data you have.

"8 train values and 3 values" is probably best expressed as "8 features and 3 target variables" in usual machine learning parlance.
Both variants should work and yield the similar predictions as RandomForestRegressor has been made to support multi output regression.
The predictions won't be exactly the same as RandomForestRegressor is a non deterministic algorithm though. But on average the predictive quality of both approaches should be the same.
Edit: see Andreas answer instead.

Related

SVM classifier n_samples, n_splits problem sklearn Python

I'm trying to predict volatility one step ahead with an SVM model based on O'Reilly book example (Machine Learning for Financial Risk Management with Python). When I copy exactly the example (with S&P500 data) it works well but now I'm having troubles with this chunk of code with a particular fund returns data:
# returns
r = np.array([ nan, 0.0013933 , 0.00118874, 0.00076462, 0.00168565,
-0.00018507, -0.00390753, 0.00307275, -0.00351472])
# horizon
t = 252
# mean of returns
mu = r.mean()
# critical value
z = norm.ppf(0.95)
# realized volatility
vol = r.rolling(5).std()
vol = pd.DataFrame(vol)
vol.reset_index(drop=True, inplace=True)
# SVM GARCH
r_svm = r ** 2
r_svm = r_svm.reset_index()
# inputs X (returns and realized volatility)
X = pd.concat([vol, r_svm], axis=1, ignore_index=True)
X = X.dropna().copy()
X = X.reset_index()
X.drop([1, 'index'], axis=1, inplace=True)
# labels y realized volatility shifted 1 period onward
vol = vol.dropna().reset_index()
vol.drop('index', axis=1, inplace=True)
# linear kernel
svr_lin = SVR(kernel='linear')
# hyperparameters grid
para_grid = {'gamma': sp_rand(),
'C': sp_rand(),
'epsilon': sp_rand()}
# svm classifier (regression?)
clf = RandomizedSearchCV(svr_lin, para_grid)
clf.fit(X[:-1].dropna().values,
vol[1:].values.reshape(-1,))
# prediction
n_vol = clf.predict(X.iloc[-1:])
The raised error is:
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=3.
The code works with longer returns series so I assume that the problem is the length of the array but I can't figure out how to solve it. can someone help me with that?
This error is getting raised because you use RandomizedSearchCV with default cv parameter.
By default RandomizedSearchCV is running 5-folds cross-validation to find the best hyperparameters for the model.
5-folds cross-validation means splitting your training data into 5 subsets and training 5 different models based on these splits.
Looks like you have less than 5 objects in your training set, so splitting your data into 5 folds isn't possible.
To fix the issue you should either add more data or decrease number of folds for the RandomizedSearchCV by adding cv parameter:
clf = RandomizedSearchCV(svr_lin, para_grid, cv=2)
I'd recommend to collect more data, since 4 data points most likely won't be enough to make the model accurate or predictive.

Problems with inverse_transform scaled predictions and y_test in multi-step, multi-variate LSTM

I have built a multi-step, multi-variate LSTM model to predict the target variable 5 days into the future with 5 days of look-back. The model runs smooth (even though it has to be further improved), but I cannot correctly invert the transformation applied, once I get my predictions.
I have seen on the web that there are many ways to pre-process and transform data. I decided to follow these steps:
Data fetching and cleaning
df = yfinance.download(['^GSPC', '^GDAXI', 'CL=F', 'AAPL'], period='5y', interval='1d')['Adj Close'];
df.dropna(axis=0, inplace=True)
df.describe()
Data set table
Split the data set into train and test
size = int(len(df) * 0.80)
df_train = df.iloc[:size]
df_test = df.iloc[size:]
Scaled train and test sets separately with MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0,1))
df_train_sc = scaler.fit_transform(df_train)
df_test_sc = scaler.transform(df_test)
Creation of 3D X and y time-series compatible with the LSTM model
I borrowed the following function from this article
def create_X_Y(ts: np.array, lag=1, n_ahead=1, target_index=0) -> tuple:
"""
A method to create X and Y matrix from a time series array for the training of
deep learning models
"""
# Extracting the number of features that are passed from the array
n_features = ts.shape[1]
# Creating placeholder lists
X, Y = [], []
if len(ts) - lag <= 0:
X.append(ts)
else:
for i in range(len(ts) - lag - n_ahead):
Y.append(ts[(i + lag):(i + lag + n_ahead), target_index])
X.append(ts[i:(i + lag)])
X, Y = np.array(X), np.array(Y)
# Reshaping the X array to an RNN input shape
X = np.reshape(X, (X.shape[0], lag, n_features))
return X, Y
#In this example let's assume that the first column (AAPL) is the target variable.
trainX,trainY = create_X_Y(df_train_sc,lag=5, n_ahead=5, target_index=0)
testX,testY = create_X_Y(df_test_sc,lag=5, n_ahead=5, target_index=0)
Model creation
def build_model(optimizer):
grid_model = Sequential()
grid_model.add(LSTM(64,activation='tanh', return_sequences=True,input_shape=(trainX.shape[1],trainX.shape[2])))
grid_model.add(LSTM(64,activation='tanh', return_sequences=True))
grid_model.add(LSTM(64,activation='tanh'))
grid_model.add(Dropout(0.2))
grid_model.add(Dense(trainY.shape[1]))
grid_model.compile(loss = 'mse',optimizer = optimizer)
return grid_model
grid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))
parameters = {'batch_size' : [12,24],
'epochs' : [8,30],
'optimizer' : ['adam','Adadelta'] }
grid_search = GridSearchCV(estimator = grid_model,
param_grid = parameters,
cv = 3)
grid_search = grid_search.fit(trainX,trainY)
grid_search.best_params_
my_model = grid_search.best_estimator_.model
Get predictions
yhat = my_model.predict(testX)
Invert transformation of predictions and actual values
Here my problems begin, because I am not sure which way to go. I have read many tutorials, but it seems that those authors prefer to apply MinMaxScaler() on the entire dataset before splitting the data into train and test. I do not agree on this, because, otherwise, training data will be incorrectly scaled with information we should not use (i.e. the test set). So, I followed my approach, but I am stucked here.
I found this possible solution on another post, but it's not working for me:
# invert scaling for forecast
pred_scaler = MinMaxScaler(feature_range=(0, 1)).fit(df_test.values[:,0].reshape(-1, 1))
inv_yhat = pred_scaler.inverse_transform(yhat)
# invert scaling for actual
inv_y = pred_scaler.inverse_transform(testY)
In fact, when I double check the last values of the target from my original data set they don't match with the inverted scaled version of the testY.
Can someone please help me on this? Many thanks in advance for your support!
Two things could be mentioned here. First, you cannot inverse transform something you did not see. This happens because you use two different scalers. The NN will predict values in the range of Scaler 1, where it is not said that this lies within the range of Scaler 2 (scaled on test data). Second, the best practice is to fit your scaler on the training set and use the same scaler (only transform) on the test data as well. Now, you should be able to reverse transform your test results. Third if scaling wents off, because the test set has completely different values - e.g. happens with live streaming data, it is up to you to deal with it, e.g. the min-max scaler will produce values > 1.0.

Have I made a mistake in my For loop in python code? Model accuracy is too high so double checking

I am building a quant model that takes a bunch of features and predicts performance of an index. The model is doing exceptionally well which obviously makes me wonder If I am making some mistake.
I have looked at the underlying features that I am using to ensure there is no data leakage. So now my attention is turning towards my code. Below is the main body of code that I use for prediction.
Does anything look wrong in the looping or how I am predicting? Please let me know if you need any more information and I will share what I can share.
X -> Features used in model training and prediction
y -> Class variable (1,0)
n_record -> Number of records in the dataset
n_train -> Amount of data to use for training in the rolling window construct
model -> Ensemble model from sklearn
My training data is c4500 records. I used n_train of 800 to train the first instance of the model and then a rolling window of 800 for training to predict the 801st instance (and so on). So in that way I roll through time leaving out very old data (keeping the model "current").
col_names = ['Pred', 'Actual', 'Pred Prob'] #Column names for prediction output dataframe
def Strategy (n_train):
list_ans = []
n_records = len(X) #Number of records in X
for i in range(n_train, n_records):
# creating a rolling window to train model on backward data (n_train records) and predict tomorrows performance
X_train, X_test, y_train, y_test = X[i-n_train:i], X[i:i+1], y[i-n_train:i], y[i:i+1]
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
model.fit(X_train,y_train)
Pred=model.predict(X_test)
Actual = y_test.values
Prob = model.predict_proba(X_test)[:,1]
i_ans = [Pred.item(), Actual.item(), Prob.item()]
resi = pd.Series(data=i_ans, index=col_names)
list_ans.append(resi)
return pd.DataFrame(list_ans)
For 1. What values do you expect from n_record or n_train? keep in mind that n_train is acting as the min value for the range. I Don't know if this is how it should be, but be careful, you may be skipping training data.
Apart from that, it's good on my eyes!

Pseudo Labelling on Text Classification Python

I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..
I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)

PCA applied to MFCC for feeding a GMM classifier (sklearn library)

I'm facing a (probably simple) problem where I have to reduce the dimensionality of my features vector using PCA. The main point of all of this is to create a classifier that predicts a sentence composed by phonemes. I train my model with hours of sentences pronounced by people (the sentences are only 10), each sentence has a label composed by a set of phonemes (see below).
What I have done so far is the following:
import mdp
from sklearn import mixture
from features import mdcc
def extract_mfcc():
X_train = []
directory = test_audio_folder
# Iterate through each .wav file and extract the mfcc
for audio_file in glob.glob(directory):
(rate, sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig, rate)
X_train.append(mfcc_feat)
return np.array(X_train)
def extract_labels():
Y_train = []
# here I have all the labels - each label is a sentence composed by a set of phonemes
with open(labels_files) as f:
for line in f: # Ex: line = AH0 P IY1 S AH0 V K EY1 K
Y_train.append(line)
return np.array(Y_train)
def main():
__X_train = extract_mfcc()
Y_train = extract_labels()
# Now, according to every paper I read, I need to reduce the dimensionality of my mfcc vector before to feed my gaussian mixture model
X_test = []
for feat in __X_train:
pca = mdp.pca(feat)
X_test.append(pca)
n_classes = 10 # I'm trying to predict only 10 sentences (each sentence is composed by the phonemes described above)
gmm_classifier = mixture.GMM(n_components=n_classes, covariance_type='full')
gmm_classifier.fit(X_train) # error here!reason: each "pca" that I appended before in X_train has a different shape (same number of columns though)
How can I reduce the dimensionality and, at the same time, have the same shape for each PCA that I extract ?
I also tried a new thing: calling the gmm_classifier.fit(...) within the for loop where I obtain the PCA vector (see code below). The function fit() works but I'm not sure whether I'm actually training the GMM correctly or not.
n_classes = 10
gmm_classifier = mixture.GMM(n_components=n_classes, covariance_type='full')
X_test = []
for feat in __X_train:
pca = mdp.pca(feat)
gmm_classifier.fit(pca) # in this way it works, but I'm not sure if it actually model is trained correctly
Thanks a lot
Regarding to your last comment/question:
gmm_classifier.fit(pca) # in this way it works, but I'm not sure if it actually model is trained correctly
Whenever you call this, the classifier forgets the previous information and be only trained by the last data. Try appending the feats inside the loop and then fit.

Categories