Unable to understand the prediction algorithm in pandas - python

I am new to machine learning , i am running a code which will analysis a set of data using pandas, quandl , the code is running fine and giving output also but i am unable to understand two lines of that code ,i am posting that one
import pandas as pd
import quandl
import math
df = quandl.get('WIKI/GOOGL')
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume',]]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close'])/ df['Adj. Close']*100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] *
100.0
df = df[['Adj. Close','HL_PCT','PCT_change','Adj. Volume']]
forecast_col = 'Adj. Close'
#filling the NAN datas
df.fillna(-99999,inplace=True)
// this line i am unable to understand
forecast_out = int(math.ceil(0.02*len(df)))
// this line i am unable to understand
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.head())
I am unable to understand what is the use of 0.1 in the ceil function and why this code is using shift function, why they have used -forecast_out?Beacause forecast_out is giving some different values. and we have already filled the NAN positions with some data but then why we are going drop NAN?Please help

I was following the same tutorial and I was stuck in your same problem here is how it I figured it out :
math_ceil(): Rounds to the highest number for example:
math_ceil(4.5)
it will round to:
5
then the code will be multiplied by:
(0.02*len(df))
len(df) is basically the size of the dataset which in this case is 3424
(print (len(df))
In other words we are keeping track of data for 3424 days, now in our case, we will be forecasting what is happening in the future, but obviously, we won't go for 3424 days timeframe but we will take a small dive into the future, in our case it will be 69 days(2% of our total data) beyond our last data in our classifier to see what prices will it be on that period.
So to wrap this up :
forecast_out = int(math.ceil(0.02*len(df)))
Equals 69
Now we will use the variable forcast_out to determine the label:
df['label'] = df[forecast_col].shift(-forecast_out)
This formula means that we are shifting the column of our dataset up so what appears in our vision is the stock price after 69 days.
Here is the code with more details you can try to play around with it.
forecast_col ='Adj. Close'
df.fillna(-99999,inplace=True)
forecast_out=int(math.ceil(0.02*len(df)))
print ("Dataset= " + str(len(df)))
print ("Forecasting after how many days = " + str(forecast_out))
df['label']=df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.tail())

Often in machine learning, you'll have data samples and each sample has features and labels (many api's expect this such as scikit-learn). In your case, each sample is a row of your dataframe. The value to predict is the forecast_col. Since you're looking at stock data, you want to predict what will happen in the future. It's meaningless to "predict" what's happening now (you can just observe it). The forecast_out value is some arbitrary value, in this case it's used to say how far in advance you will predict the 'Adj. Close'.
The shift method aligns the observations with the future value to predict. Then with this dataframe you can easily use scikit-learn to fit a model.
lr = sklearn.linear_model.LinearRegression()
lr.fit(df[['HL_PCT','PCT_change','Adj. Volume']], df[forecast_col])
This model will make predictions from the current observed values about what's going to happen forecast_out days from now.

Related

Own RSI calculation divers from Altcoin Tradingview RSI ... Does anyone know why?

I've tried several ways to get the same RSI like Tradingview. The funny thing is, that my own calculated RSI matches e.g. the Bitcoin related RSI's perfectly. But when i try to calculate the RSI for altcoins, it's different. I have tried different EMA/RMAs, Excel recreation and of course python. Even: XRSIs (eg: RSI = 0,6 RSI-XRP + 0,4 RSI-BTC), but never got the same result.
Does anyone know how Tradingview is calculating the AltCoin RSIs?
Thank you in advance,
Best regards,
Domi
The calculation of the RSI should be the same for any kind of data in Tradingview. In Pinescript the RSI can be calculated from scratch as follows:
pine_rsi(x, y) =>
u = max(x - x[1], 0) // upward change
d = max(x[1] - x, 0) // downward change
rs = rma(u, y) / rma(d, y)
res = 100 - 100 / (1 + rs)
res
If the results differ it might be due to rounding errors, another cause might the use of data different from the one provided by Tradingview.
They are using a smoothed RSI formula. I checked it against the one year chart which uses daily bars.
import yfinance as yf
import talib as ta
#get data
ticker = yf.Ticker("BTC-USD")
period = '10y'
interval = '1d'
data = ticker.history(interval=interval, period= period)
df = data .reset_index()
df = df.rename(columns={"index": "Date"})
df['RSI_Ta'] = ta.RSI(df['Close'], timeperiod=14)
df
It is the same as Yahoo data somehow. Which is strange because I buy and sell as a market maker for higher and lower prices most every day.

Small dataset not working well with prophet

I am trying to use Prophet - the forecasting package by facebook, and it works great with say, 150 rows of data. But when I try to model with less than 100 rows, it gives me very weird predictions. When I do it in R, it gives me the same prediction for all dates and when I do it in python, it gives me very bad predictions.
My data is weekly from 2018 week 1 to 2019 week 40.
This is my code:
(python)
predictionSize=6
new_train_df = data[:-predictionSize]
new_test_df = data[len(data)-predictionSize:]
m_new = Prophet(weekly_seasonality=True,yearly_seasonality=True)
m_new.fit(new_train_df)
new_future = m_new.make_future_dataframe(periods=predictionSize,freq='W')
new_forecast = m_new.predict(new_future)
new_ypred = new_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(6)
Using this code gives me negative values for yhat.
My question is, are the predictions bad because the dataset is too less for prophet?
Do let me know if you need any other information. The data has weekly seasonality and yearly seasonality.

How to save predicted regression values inside a for loop?

I'm trying to use statsmodels to run separate logistic regressions for each "group" in a pandas dataframe and save the predicted probabilities for each observations (row). Each "group" represents about 2500 respondents or observations; I would like to get the predicted probability for each respondent - similar to how with SPSS you can "save" predicted probabilities when running a logistic regression.
I've read what others have attempted, but nothing seems to work. I'm using SPSS to check that the looping operation in Python is working correctly - the predicted probabilities should be the same (SPSS has a split function which makes this really easy).
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
df = pd.read_csv('test_data.csv')
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df)
print(est_result.summary())
df['pred'] = pred
The model summaries are correct (est_result.summary()) and match SPSS exactly. However, the saved predicted values do not match at all. I cannot seem to understand how to get it to work correctly.
Any advice is appreciated.
I solved it in a really un-pythonic kind of way. I hope someone can improve this code. The probabilities now match exactly what SPSS produces when you split the file by group, and run individual regressions by group.
result =[]
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df_slice)
results.append(pred)
# print(est_result.summary())
n = len(df['Brand'].unique())
r = pd.DataFrame(results) #put the results into a dataframe
rt = r.T #tranpose the dataframe
r_small = rt[rt.columns[-n:]] #remove all but the last n columns, n = number of categories
r_new = r_small.bfill(axis=1).iloc[:, 0] #merge the n columns and remove the NaNs
r_new #show us
df['predicted'] = r_new # combine the r_new array with the original dataframe
df #show us.

Supervised Time Series efficiency improvement

The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.

How is 'yhat' (prediction) calculated in the fbprophet library?

I am using fbprophet for time-series predictions in Python and I am wondering how the yhat (prediction) column is calculated. I used the following code.
import quandl
import fbprophet
tesla = quandl.get('WIKI/TSLA')
tesla = tesla.reset_index()
tesla = tesla.rename(columns={'Date': 'ds', 'Adj. Close': 'y'})
tesla = tesla[['ds', 'y']]
prophet = fbprophet.Prophet()
prophet.fit(tesla)
future = prophet.make_future_dataframe(periods=365)
future = prophet.predict(future)
The future dataframe contains the following columns:
['ds', 'trend', 'trend_lower', 'trend_upper', 'yhat_lower', 'yhat_upper',
'seasonal', 'seasonal_lower', 'seasonal_upper', 'seasonalities',
'seasonalities_lower', 'seasonalities_upper', 'weekly', 'weekly_lower',
'weekly_upper', 'yearly', 'yearly_lower', 'yearly_upper', 'yhat']
I understand yhat is the prediction, but is it a combination of trend, seasonal, seasonalities, weekly, yearly or something else?
I have tried combining the trend, seasonal, seasonalities, weekly, yearly columns to see if they equal the yhat column, but they do not:
future['combination'] = future['trend'] + future['seasonal'] + future['weekly'] + future['yearly'] + future['seasonalities']
print(future[['combination', 'yhat']].head())
combination yhat
0 57.071956 27.681139
1 55.840545 27.337270
2 53.741200 26.704090
3 51.874192 26.148355
4 47.827763 25.065950
I have been unable to find an answer in the documentation, but apologize if I have simply missed it.
I am a beginner of fbprophet but would like to share my guess.
Even though I'm not sure it is correct, The following equation probably holds true in the default setting of prophet.
yhat = trend + seasonal
Hopefully, my answer will help you!
This seems to have changed now. Looking at Github code, yhat is calculated as below
yhat = (trend * (1 + multiplicative_terms) + additive_terms
A link to this calculation in the fbprophet source code is found
here for Python
here for R

Categories