I have this plot
Now I want to add a trend line to it, how do I do that?
The data looks like this:
I wanted to just plot how the median listing price in California has gone up over the years so I did this:
# Get California data
state_ca = []
state_median_price = []
state_ca_month = []
for state, price, date in zip(data['ZipName'], data['Median Listing Price'], data['Month']):
if ", CA" not in state:
continue
else:
state_ca.append(state)
state_median_price.append(price)
state_ca_month.append(date)
Then I converted the string state_ca_month to datetime:
# Convert state_ca_month to datetime
state_ca_month = [datetime.strptime(x, '%m/%d/%Y %H:%M') for x in state_ca_month]
Then plotted it
# Plot trends
figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(state_ca_month, state_median_price)
plt.show()
I thought of adding a trendline or some type of line but I am new to visualization. If anyone has any other suggestions I would appreciate it.
Following the advice in the comments I get this scatter plot
I am wondering if I should further format the data to make a clearer plot to examine.
If by "trend line" you mean a literal line, then you probably want to fit a linear regression to your data. sklearn provides this functionality in python.
From the example hyperlinked above:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
To clarify, "the overall trend" is not a well-defined thing. Many times, by "trend", people mean a literal line that "fits" the data well. By "fits the data", in turn, we mean "predicts the data." Thus, the most common way to get a trend line is to pick a line that best predicts the data that you have observed. As it turns out, we even need to be clear about what we mean by "predicts". One way to do this (and a very common one) is by defining "best predicts" in such a way as to minimize the sum of the squares of all of the errors between the "trend line" and the observed data. This is called ordinary least squares linear regression, and is one of the simplest ways to obtain a "trend line". This is the algorithm implemented in sklearn.linear_model.LinearRegression.
Related
I've a dataset with 4 years of sales and trying to forecast sales for next five years. I've split the dataset into 36 months as training-set and 12 months as test-set. I have chosen Holt Winter’s method and written following code to test the model.
from statsmodels.tsa.api import ExponentialSmoothing
holt_winter = ExponentialSmoothing(np.asarray(train_data['Sales']), seasonal_periods=12, trend='add', seasonal='add')
hw_fit = holt_winter.fit()
hw_forecast = hw_fit.forecast(len(test_data))
plt.figure(figsize=(16,8))
plt.plot(train_data.index, train_data['Sales'], "b.-", label='Train Data')
plt.plot(test_data.index, test_data['Sales'], "ro-", label='Original Test Data')
plt.plot(test_data.index, hw_forecast, "gx-", label='Holt_Winter Forecast Data')
plt.ylabel('Score', fontsize=16)
plt.xlabel('Time', fontsize=16)
plt.legend(loc='best')
plt.title('Holt Winters Forecast', fontsize=20)
plt.show()
It seems the code is working fine, and probably correctly predicting outcome of test data set. However, I'm struggling to figure out how to code if I want predict sales for the next five year?
You could also should try ARIMA model it usually gives the better performance and this code makes combinations of different ARIMA parameters (AR, autoregressive parameter; I, differencing parameter; and MA, moving average parameter; - p,d,q respectively) and finds the best combination of them by lowering the Akaike information criteria (AIK) which penalizes the maximum likelihood with number of parameters (i.e. finds the best likelihood, with the smallest number of parameters):
from statsmodels.tsa.arima_model import ARIMA
import itertools
# Grid Search
p = d = q = range(0,3) # p, d, and q can be either 0, 1, or 2
pdq = list(itertools.product(p,d,q)) # gets all possible combinations of p, d, and q
combs = {} # stores aic and order pairs
aics = [] # stores aics
# Grid Search continued
for combination in pdq:
try:
model = ARIMA(train_data['Sales'], order=combination) # create all possible models
model = model.fit()
combs.update({model.aic : combination}) # store combinations
aics.append(model.aic)
except:
continue
best_aic = min(aics)
hw_fit.predict(start, end)
will make prediction from step start to step end, with step 0 being the first value of the training data.
forecast makes out-of-sample predictions. So these two are equivalent:
hw_fit.forecast(steps)
hw_fit.predict(len(train_data), len(train_data)+steps-1)
So, since your model was trained with a monthly step, if you want to forecast n months after the training data, you can call the methods above with steps=n
I what to implement the two data set and prediction
Pre-process data
Enocode columns that contains text, Normalize numberical columns, Visualize data.
Form a problem statement: Select a model, Train a model, Evaluate result.
www.kaggle.com/alfrednieves/notebooke0d8d2a5e2/edit
https://www.kaggle.com/questions-and-answers/217299
I Need Help with the solution
import pandas as pd
import matplotlib.pyplot as plt
data2013=pd.read_csv('laser_incidents_2013.csv')
data2014=pd.read_csv('laser_incidents_2014.csv')
data2013.head()
data2014.head()
data2013.all
data2014.all
data2013.columns
data2014.columns
data2013=data2013.rename(columns={'DATE':'D1','Time_(UTC)':'T1','Aircraft ID':'AI1','No. A/C':'N1',
'ALT':'A1','MAJOR CITY':'MC1', 'COLOR':'C1', 'Injury Reported':'IR1',`enter code here`'CITY':'C1', 'STATE':'S1'})
data2014=data2014.rename(columns={'DATE':'D2','Time_(UTC)':'T2','Aircraft ID':'AI2','No. A/C':'N2',
'ALT':'A2','MAJOR CITY':'MC2', 'COLOR':'C2','Injury Reported':'IR2','CITY':'C2','STATE':'S2'})
#Lets first import the preprocessing module
from sklearn import preprocessing
#Now let's form a label Encoder model
le = preprocessing.LabelEncoder()
#Now we use feed the label column to the model
le.fit(data2013['S1'])
le.fit(data2014['S2'])
#Model will go through column and find the unique labels (Number of classes that are there)
#Following line will print the labels found in the column
list(le.classes_)
#Following line will convert the labels to an array of numbers
le.transform(data['S1'])
#Following line will convert the labels to an array of numbers
le.transform(data['S2'])
data2013.head()
data2014.head()
#Visualize Data
import matplotlib.pyplot as plt
#plot not stressed class
plt.scatter(data['X1'], data['X2'],c=data['S1'])
plt.scatter(data['X1'][data['L']==0], data['X2'][data['L']==0],label='class1',color='blue')
plt.scatter(data['X1'][data['L']==1], data['X2'][data['L']==1],label='class2',color='red')
plt.scatter(data['X1'][data['L']==2], data['X2'][data['L']==2],label='class3',color='green')
plt.legend()
from sklearn.model_selection import train_test_split
X_training, X_testing, Y_training, Y_testing = train_test_split(data[['X1','X2','X3','X4']], data['L'], test_size=0.3)
plt.scatter(X_training['X1'][Y_training==0],X_training['X2'][Y_training==0],label='class1',color='blue')
plt.scatter(X_training['X1'][Y_training==1],X_training['X2'][Y_training==1],label='class2',color='red')
plt.scatter(X_training['X1'][Y_training==2],X_training['X2'][Y_training==2],label='class3',color='green')
plt.title('Training Data')
plt.legend()
plt.figure()
plt.scatter(X_testing['X1'][Y_testing==0],X_testing['X2'][Y_testing==0],label='class1',color='blue')
plt.scatter(X_testing['X1'][Y_testing==1],X_testing['X2'][Y_testing==1],label='class2',color='red')
plt.scatter(X_testing['X1'][Y_testing==2],X_testing['X2'][Y_testing==2],label='class3',color='green')
plt.title('Test Data')
plt.legend()
MDC Classifier
Use the code that described in lectures and classify data using MDC
Then count the number of points that are missclassified in the training data and test data.
K-Nearst Neighbors Classifier
Using 5NN (5 nearest neighbors)
Then count the number of points that are missclassified in the training data and test data.
How to generate "lower" and "upper" predictions, not just "yhat"?
import statsmodels
from statsmodels.tsa.arima.model import ARIMA
assert statsmodels.__version__ == '0.12.0'
arima = ARIMA(df['value'], order=order)
model = arima.fit()
Now I can generate "yhat" predictions
yhat = model.forecast(123)
and get confidence intervals for model parameters (but not for predictions):
model.conf_int()
but how to generate yhat_lower and yhat_upper predictions?
In general, the forecast and predict methods only produce point predictions, while the get_forecast and get_prediction methods produce full results including prediction intervals.
In your example, you can do:
forecast = model.get_forecast(123)
yhat = forecast.predicted_mean
yhat_conf_int = forecast.conf_int(alpha=0.05)
If your data is a Pandas Series, then yhat_conf_int will be a DataFrame with two columns, lower <name> and upper <name>, where <name> is the name of the Pandas Series.
If your data is a numpy array (or Python list), then yhat_conf_int will be an (n_forecasts, 2) array, where the first column is the lower part of the interval and the second column is the upper part.
To generate prediction intervals as opposed to confidence intervals (which you have neatly made the distinction between, and is also presented in Hyndman's blog post on the difference between prediction intervals and confidence intervals), then you can follow the guidance available in this answer.
You could also try to compute bootstrapped prediction intervals, which is laid out in this answer.
Below, is my attempt at implementing this (I'll update it when I get the chance to check it in more detail):
def bootstrap_prediction_interval(y_train: Union[list, pd.Series],
y_fit: Union[list, pd.Series],
y_pred_value: float,
alpha: float = 0.05,
nbootstrap: int = None,
seed: int = None):
"""
Bootstraps a prediction interval around an ARIMA model's predictions.
Method presented clearly here:
- https://stats.stackexchange.com/a/254321
Also found through here, though less clearly:
- https://otexts.com/fpp3/prediction-intervals.html
Can consider this to be a time-series version of the following generalisation:
- https://saattrupdan.github.io/2020-03-01-bootstrap-prediction/
:param y_train: List or Series of training univariate time-series data.
:param y_fit: List or Series of model fitted univariate time-series data.
:param y_pred_value: Float of the model predicted univariate time-series you want to compute P.I. for.
:param alpha: float = 0.05, the prediction uncertainty.
:param nbootstrap: integer = 1000, the number of bootstrap sampling of the residual forecast error.
Rules of thumb provided here:
- https://stats.stackexchange.com/questions/86040/rule-of-thumb-for-number-of-bootstrap-samples
:param seed: Integer to specify if you want deterministic sampling.
:return: A list [`lower`, `pred`, `upper`] with `pred` being the prediction
of the model and `lower` and `upper` constituting the lower- and upper
bounds for the prediction interval around `pred`, respectively.
"""
# get number of samples
n = len(y_train)
# compute the forecast errors/resid
fe = y_train - y_fit
# get percentile bounds
percentile_lower = (alpha * 100) / 2
percentile_higher = 100 - percentile_lower
if nbootstrap is None:
nbootstrap = np.sqrt(n).astype(int)
if seed is None:
rng = np.random.default_rng()
else:
rng = np.random.default_rng(seed)
# bootstrap sample from errors
error_bootstrap = []
for _ in range(nbootstrap):
idx = rng.integers(low=n)
error_bootstrap.append(fe[idx])
# get lower and higher percentiles of sampled forecast errors
fe_lower = np.percentile(a=error_bootstrap, q=percentile_lower)
fe_higher = np.percentile(a=error_bootstrap, q=percentile_higher)
# compute P.I.
pi = [y_pred_value + fe_lower, y_pred_value, y_pred_value + fe_higher]
return pi
using ARIMA you need to include seasonality and exogenous variables in the model yourself. While using SARIMA (Seasonal ARIMA) or SARIMAX (also for exogenous factors) implementation give C.I. to summary_frame:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
dta = sm.datasets.sunspots.load_pandas().data[['SUNACTIVITY']]
dta.index = pd.Index(pd.date_range("1700", end="2009", freq="A"))
print(dta)
print("init data:\n")
dta.plot(figsize=(12,4));
plt.show()
##print("SARIMAX(dta, order=(2,0,0), trend='c'):\n")
result = sm.tsa.SARIMAX(dta, order=(2,0,0), trend='c').fit(disp=False)
print(">>> result.params:\n", result.params, "\n")
##print("SARIMA_model.plot_diagnostics:\n")
result.plot_diagnostics(figsize=(15,12))
plt.show()
# summary stats of residuals
print(">>> residuals.describe:\n", result.resid.describe(), "\n")
# Out-of-sample forecasts are produced using the forecast or get_forecast methods from the results object
# The get_forecast method is more general, and also allows constructing confidence intervals.
fcast_res1 = result.get_forecast()
# specify that we want a confidence level of 90%
print(">>> forecast summary at alpha=0.01:\n", fcast_res1.summary_frame(alpha=0.10), "\n")
# plot forecast
fig, ax = plt.subplots(figsize=(15, 5))
# Construct the forecasts
fcast = result.get_forecast('2010Q4').summary_frame()
print(fcast)
fcast['mean'].plot(ax=ax, style='k--')
ax.fill_between(fcast.index, fcast['mean_ci_lower'], fcast['mean_ci_upper'], color='k', alpha=0.1);
fig.tight_layout()
plt.show()
docs: "The forecast above may not look very impressive, as it is almost a straight line. This is because this is a very simple, univariate forecasting model. Nonetheless, keep in mind that these simple forecasting models can be extremely competitive"
p.s. here " you can use it in a non-seasonal way by setting the seasonal terms to zero."
I am training a Gaussian Process to learn the mapping between a set of coordinates x,y,z and some time series. In a nutshell, my question is about how to prevent my GP to do oerfitting, which I am facing to an odd level.
Some details:
my training set is made of 1500 samples. My testing set of 500 samples. Each time sample has 20 time components;
I don't have a preference in terms of what kernel to use for the GP, and I would appreciate help in understanding which one could work better. Furthermore, I have very little experience with GP in general, hence I am not sure how well I am doing with the hyperparameters. See below for how I set my length_scale: I set it this way following some advice, but I am wondering if it makes sense;
my coordinates are standardized (mean 0, std 1), but my time series are not;
I am training one Gaussian Process for each time component.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, Matern, RationalQuadratic, ExpSineSquared, DotProduct, ConstantKernel)
# ----------------------------------------------------------------------
number_of_training_samples = 1500
number_of_testing_samples = 500
# read coordinates STANDARDIZED
coords_training_stand = np.loadtxt('coordinates_training_standardized.txt')
coords_testing_stand = np.loadtxt('coordinates_testing_standardized.txt')
# read time series TRAIN/TEST
timeseries_training = np.loadtxt('timeseries_training.txt')
timeseries_testing = np.loadtxt('timeseries_testing.txt')
number_of_time_components = np.shape(timeseries_training)[1] # 20
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5, length_scale=np.ones(coords_training_stand.shape[1]))
gp = GaussianProcessRegressor(kernel=kernel)
# placeholder for predictions
pred_timeseries_training = np.zeros((np.shape(timeseries_training)))
pred_timeseries_testing = np.zeros((np.shape(timeseries_testing)))
for i in range(number_of_time_components):
print("time component", i)
gp.fit(coords_training_stand, timeseries_training[:,i])
y_pred, sigma = gp.predict(coords_training_stand, return_std=True)
y_pred_test, sigma_test = gp.predict(coords_testing_stand, return_std=True)
pred_timeseries_training[:,i] = y_pred
pred_timeseries_testing[:,i] = y_pred_test
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(timeseries_training[100*i, :20], color='blue', label='Original train')
ax[i].plot(pred_timeseries_training[100*i], color='black', label='GP pred train')
ax[i].set_xlabel('Time components', fontsize='x-large')
ax[i].set_ylabel('Amplitude', fontsize='x-large')
ax[i].set_title('Time series n. {:}'.format(100*i+1), fontsize='x-large')
ax[i].legend(fontsize='x-large')
plt.subplots_adjust(hspace=1)
plt.show()
plt.close()
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(timeseries_testing[100*i, :20], color='blue', label='Original test')
ax[i].plot(pred_timeseries_testing[100*i], color='black', label='GP pred test')
ax[i].set_xlabel('Time components', fontsize='x-large')
ax[i].set_ylabel('Amplitude', fontsize='x-large')
ax[i].set_title('Time series n. {:}'.format(1500+100*i+1), fontsize='x-large')
ax[i].legend(fontsize='x-large')
plt.subplots_adjust(hspace=1)
plt.show()
plt.close()
Here the plot of a few samples from the TRAINING set and the corresponding GP predictions (one can't even see the blue lines, corresponding to the original samples, because they are perfectly covered by the predictions of the GP):
Here the plot of a few samples from the TESTING set and the corresponding GP predictions:
(in only one case - 1801 - the prediction is good).
I think there is a very strong overfitting going on, and I would like to understand how to avoid it.
I don't think the problem is with the Gaussian Process itself but with the dataset.
How were the time series samples generated ? and how did you divide the dataset in training and test set ?
If you got one big time series and then cut it in small sequences, there is not enough real examples for the model to learn and you can get big overfitting problems.
Explanation with an example :
I have one big time series t0, t1, t2, t3, ..., t99
I make a training dataset of 80 samples with [t0,...,t19], [t1,...,t20], [t2,...,t21], ..., [t80,...,t99]
In this case all my samples are almost exactly the same and it will cause overfitting. And if the validation set is composed of some random samples taken from this dataset then I'll get a very high validation accuracy because the model saw almost exactly the same thing in the dataset. (I think that's what might have happended for example 1801 you gave)
So make sure all your samples in your datasets are completely independant.
I have two NumPy arrays time and no of get requests. I need to fit this data using a function so that i could make future predictions.
These data were extracted from cassandra table which stores the details of a log file. So basically the time format is epoch-time and the training variable here is get_counts.
from cassandra.cluster import Cluster
import numpy as np
import matplotlib.pyplot as plt
from cassandra.query import panda_factory
session = Cluster(contact_points=['127.0.0.1'], port=9042).connect(keyspace='ASIA_KS')
session.row_factory = panda_factory
df = session.execute("SELECT epoch_time, get_counts FROM ASIA_TRAFFIC")
.sort(columns=['epoch_time','get_counts'], ascending=[1,0])
time = np.array([x[1] for x in enumerate(df['epoch_time'])])
get = np.array([x[1] for x in enumerate(df['get_counts'])])
plt.title('Trend')
plt.plot(time, byte,'o')
plt.show()
The data is as follows:
there are around 1000 pairs of data
time -> [1391193000 1391193060 1391193120 ..., 1391279280 1391279340 1391279400 1391279460]
get -> [577 380 430 ...,250 275 365 15]
Plot image (full size here):
Can someone please help me in providing a function so that i could properly fit in the data? I am new to python.
EDIT *
fit = np.polyfit(time, get, 3)
yp = np.poly1d(fit)
plt.plot(time, yp(time), 'r--', time, get, 'b.')
plt.xlabel('Time')
plt.ylabel('Number of Get requests')
plt.title('Trend')
plt.xlim([time[0]-10000, time[-1]+10000])
plt.ylim(0, 2000)
plt.show()
print yp(time[1400])
the fit curve looks like this:
https://drive.google.com/file/d/0B-r3Ym7u_hsKUTF1OFVqRWpEN2M/view?usp=sharing
However at the later part of the curve the value of y becomes (-ve) which is wrong. The curve must change its slope back to (+ve) somewhere in between.
Can anyone please suggest me how to go about it.
Help will be much appreciated.
You could try:
time = np.array([x[1] for x in enumerate(df['epoch_time'])])
byte = np.array([x[1] for x in enumerate(df['byte_transfer'])])
fit = np.polyfit(time, byte, n) # step up n value here,
# where n is the degree of the polynomial
yp = np.poly1d(fit)
print yp # displays function in cx^n +- cx^n-1...c format
plt.plot(x, yp(x), '-')
plt.xlabel('Time')
plt.ylabel('Bytes Transfered')
plt.title('Trend')
plt.plot(time, byte,'o')
plt.show()
I'm new to Numpy and curve fitting as well, but this is how I've been attempting to do it.