I have a Fan Speed (RPM) dataset of 192.405 Values (train+test values). I am training the ARIMA model and trying to predict the rest of the future values of our dataset and comparing the results.
While fitting the model in test data I am getting straight line for predictions
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima_model import ARIMA
dfx = df[(df['Tarih']>'2020-07-23') & (df['Tarih']<'2020-10-23')]
X_train = dfx[:int(dfx.shape[0]*0.8)] #2 months
X_test = dfx[int(dfx.shape[0]*0.8):] # rest, 1 months
model = ARIMA(X_train.Value, order=(4,1,4))
model_fit = model.fit(disp=0)
print(model_fit.summary())
test = X_test
train = X_train
What could i do now ?
Your ARIMA model uses the last 4 observations to make a prediction. The first prediction will be based on the four last known data points. The second prediction will be based on the first prediction and the last three known data points. The third prediction will be based on the first and second prediction and the last two known data points and so on. Your fifth prediction will be based entirely on predicted values. The hundredth prediction will be based on predicted values based on predicted values based on predicted values … Each prediction will have a slight deviation from the actual values. These prediction errors accumulate over time. This often leads to ARIMA simply prediction a straight line when you try to predict such large horizons.
If your model uses the MA component, represented by the q parameter, then you can only predict q steps into the future. That means your model is only able to predict the next four data points, after that the prediction will converge into a straight line.
Related
I'm trying to predict what a 15-minute delay in flight departure does to the flight's arrival time. I have thousands of rows as well as several columns in a DF. Two of these columns are dep_delay and arr_delay for departure delay and arrival delay. I have built a simple LinearRegression model:
y = nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(y, nyc['arr_delay'])
And now I'm trying to find out the predicted arrival delay if the flights departure was delayed 15 minutes. How would I use the model above to predict what the arrival delay would be?
My first thought was to use a for loop / if statement, but then I came across .predict() and now I'm even more confused. Does .predict work like a boolean, where I would use "if departure delay is equal to 15, then arrival delay equals y"? Or is it something like:
arr_dep_model.predict(y)?
When working with LinearRegression models in sklearn you need to perform inference with the predict() function. But you also have to ensure the input you pass to the function has the correct shape (the same as the training data). You can learn more about the proper use of predict function in the official documentation.
arr_dep_model.predict(youtInput)
This line of code would output a value that the model predicted for a corresponding input. You can insert this into a for loop and traverse a set of values to serve as the model's input, it depends on the needs for your project and the data you are working with.
Hi Check below code for an example:`
import pandas as pd
import random
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X_train = df[['x1','x2']][:-3].values #Training on top 7 rows
y_train = df['y'][:-3].values #Training on top 7 rows
X_test = df[['x1','x2']][-3:].values # Values on which prediction will happen - bottom 3 rows
regr = LinearRegression()
regr.fit(X_train, y_train)
regr.predict(X_test)
If you will notice X_test the data on which prediction is happening is of same shape as (number of columns) as X_train both have two columns ['X1','X2']. Same has been converted in array when .values is used. You can create your own data (2 column dataframe for current example) & can use that for prediction (because 3rd column is need to be predicted).
Output will be three values as predicted on three rows:
I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing
I am building a churn prediction model with logistic regression in python. My model accuracy is 0.47 and only predicts 0s. The realized y variable is actually 81 zeros and 92 ones.
The data set I have is only a few features and 220 users(records). If I set a reference time, it is even less(about 123 records for the training set and 173 for the testing set). So I think the sample size is too small to use logistic regression. But I still tried because this is just a sample test so I only got this small data set. (Theoretically there is more data)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('Accuracy: {:.2f}'.format(logreg.score(x_test, y_test)))
Even if I don't test the model, meaning I use the whole data set to build the model, when I predict the future churn it still returns only 0s.
is it that my sample size is too small, or because the accuracy is less than 0.5 so it just returns one value(0 here) ? Or I did something wrong in the code?
Thanks very much!
There are several potential causes for heavily biased prediction from a logistic regression model. For the purpose of informing general audience, I will list the most common ones even though some of them don't apply to your case.
(Skewed output distribution) Your training data has biased, imbalanced label distribution. If your training contains, for example, 1 positive and 100000 negatives, the bias/intercept term in the regression will be very small. After applying the link function the predictions can be practically zero.
(Sparsity) The feature space is large and your dataset is small, leading to a sparse training data. Therefore most new incoming instances of data point aren't seen before. In the worse case, in which all features are factor, unseen factor values result in zeros because the correct one-hot column cannot be identified.
(Skewed input distribution) The feature space is small and your dataset is dense around a small region. If it turns out at that region there are more zeros, the predictions are always gonna be zero even for future instances of input. For example, my data X has two columns, gender and age. It turns out most of my data points are 30 years old male, and 80 out of 100 30-year-old males like ice-cream, in a 101 data-point dataset. The model will predict 30-year-old males like ice-cream for future input, which are usually for 30-year-old males assuming similar input distribution.
You should check the distribution of score using the predict_proba function, and check the distribution of input features using something like pairplot.
I am trying to use OneCsRestClassifier on my data set. I extracted the features on which model will be trained and fitted Linear SVC on it. After model fitting, when I try to predict on the same data on which the model was fitted, I get all zeros. Is it because of some implementation issues or because my feature extraction is not good enough. I think since I am predicting on the same data on which my model was fitted I should get 100% accuracy. But instead my model predicts all zeros. Here is my code-
#arrFinal contains all the features and the labels. Last 16 columns are labels and features are from 1 to 521. 17th column from the last is not taken
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear'))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
Is there something wrong with my implementation of OneVsRestClassifier?
After looking at your data, it appears the values may be too small for the C value. Try using a sklearn.preprocessing.StandardScaler.
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
scaler = StandardScaler()
X = scaler.fit_transform(X)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear', C=100))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
From here, you should look at parameter tuning on the C using cross validation. Either with a learning curve or using a grid search.
I am working on a program where I have some data (labeled and unlabeled) and 2 different groups ("artritis" and "fibro"). I would like to obtain the classifier's accuracy and then classify the unlabeled data. My problem is that I am testing it with 2 classifiers (LDA and QDA). With the first one I obtain an accuracy of 81% and when I classify the unlabeled data (39 objects) it classifies everything correctly. However, when I use QDA I obtain an accuracy of 93,74% and when it classifies the unlabeled data (the same 39 objects) it labels 3 of them with the wrong group. Can someone help me to find my errors?
My code:
#"listaTrain" has a list of dictionaries which are the labeled data and will be used for
# training and Cross-Validation
#"listaLabels" has a list of the train labels
#"listaClasificar" has a list of dictionaries which are the unlabeled data
# which I want to label
#"clasificador" is my classifier
X=vec.fit_transform(listaTrain) #I transform the dictionaries to
#a format that sklearn can use
X=preprocessing.scale(X.toarray()) #I scale the values
clasificador.fit(X, listaLabels) #I train the classifier with the train data and
# the train labels
n_samples = X.shape[0]
cv = cross_validation.ShuffleSplit(n_samples, n_iter=300, test_size=0.6, random_state=4)
#I make Cross-Validation dividing the X's data (40% for training and 60% for testing)
scores = cross_validation.cross_val_score(clasificador, X, listaLabels,v=cv)
#I obtain the Cross-validation accuracy
scores.mean() #I obtain the accuracy mean (here is where i obtain 81% and 93%)
testX=vec.transform(listaClasificar) #I transform the dictionaries to a
#format that sklearn can use
testX=preprocessing.scale(testX.toarray()) #I scale the values
predicted=clasificador.predict(testX) #I predict the labels of the unlabeled data