Negative confidence interval in linear regression despite all positive values - python

I am getting a negative confidence interval for a linear regression plot even though all data points are positive. Why is this happening? I believe this negative confidence interval will also affect my R^2 score?
Code used is:
sns.regplot(x = 'Consumer Confidence Index_1', y = 'Sales (ALV
sources)', data = df_mx2)
plt.show()
See graph pic here

One of the foundational assumptions for a linear regression is that the data is normally distributed about the line. In your case you have data on the right side and the left side with a big gap in the middle. As such, you should double check that a linear regression is appropriate for your analysis.
That being said, rest easy, the negative confidence interval will NOT effect your R² value.
The reason for the negative confidence interval has to do with the sparsity of data with x<42. If the three points on the right side were removed, the regression would have a positive slope intersecting the x axis around x=42. If that line were extended to x=30 or so it would be very negative. As such the data suggests that to hit the confidence threshold you have set, the confidence interval must be very large to include data that potentially lines up with the steeper regression line.
This can be interpreted as the data provides very little in the way of predictive ability below x=42.

Related

ACF and PACF plot has very small confidence level. How to interpret?

I'm rather new at programming at general so do forgive me if the question is rather basic.
I'm trying to determine my p, d, q values for an ARIMA model and I've already conducted an adfuller test that determined that my time series is stationary. However, when I plot out my ACF and PACF plots, I get the following:
ACF plot
PACF plot
From what I've read about the p values, I'm supposed to pick the value where the line first crosses the confidence interval except I'm not sure why my confidence intervals for both are that small? Does this mean that my MA value should be 2 according to the PACF plot? Any help in interpreting the graphs would be appreciated!
My code:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig = plt.figure(figsize=(20, 12))
fig = plot_acf(train_set.dropna(), lags=10)
fig = plot_pacf(train_set.dropna(), lags=10)
The d component is used to make the data stationary by differencing, if the adf test (and kpss test) shows that the data is stationary, you can probably set it to 0. However, keep in mind that you cannot trust these tests by 100 %.
The confidence interval indicates whether the correlation is statistically significant, meaning that the correlation is very likely not to be random. All bars that cross the confidence interval are “real” correlations that you can use for modeling.
There are thousands of thumb rules to interpret these plots. I recommend the following:
If the ACF trails off, use an MA model with the significant and strong correlations from the PACF.
If the PACF trails off, use an AR model with the significant and strong correlations from the ACF.
You can also have a look here:
https://towardsdatascience.com/identifying-ar-and-ma-terms-using-acf-and-pacf-plots-in-time-series-forecasting-ccb9fd073db8
I guess you created the plots with statsmodels, in that case you shoud keep in mind that lag 0 (the first in the plots) is the correlation of the time series with itself, therefore it will always be +1 and significiant, you can ignore this lag.In your case, the ACF is trailing off, and the PACF has only one statistically significant and strong correlation with the first lag, perhaps you can also use 2, 3 and 4 but they are very weak. Best is of course if you just try it out. Or you can use pmdarima’s auto_arima() function:
https://alkaline-ml.com/pmdarima/tips_and_tricks.html
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html

What does the background area mean in seaborn regression plot?

What does the background in blue mean or determine in the regression plot when using seaborn? What determines its width at both ends?
According to seaborn documentation, that area rappresents the confidence interval. You can set it through the ci parameter:
Size of the confidence interval for the regression estimate. This will
be drawn using translucent bands around the regression line. The
confidence interval is estimated using a bootstrap; for large
datasets, it may be advisable to avoid that computation by setting
this parameter to None
For the statistical meaning of confidence interval, I suggest you the wikipedia definition:
The confidence interval represents values for the population parameter
for which the difference between the parameter and the observed
estimate is not statistically significant at the 10% level
Strictly speaking, there is 95% of probability (seaborn uses 95% as default value) that a new sample falls in the confidence interval. In practice, the confidence interval indicates the forecast error associated with data dispersion.

Trouble with visualizing components of fourier transform (python fft)

I am analyzing a time-series dataset that I am pretty sure can be broken down using fft. I want to develop a model to estimate the data using a sum of sin/cos but I am having trouble with the syntax to find the frequencies in python
Here is a graph of the data
data graph
And here's a link to the original data: https://drive.google.com/open?id=1mqZtQ-txdd_AFbKGBlbSL6903CK-_kXl
Most of the examples I have seen have multiple samples per second/time period, however the data in this set represent by-minute observations of some metric. Because of this, I've had trouble translating the answers online to this problem
Here's my naive first approach
X = fftpack.fft(data)
freqs = fftpack.fftfreq(len(data))
plt.plot(freqs, np.abs(X))
plt.show()
Instead of peaking at the major frequencies, my plot only has one peak at 0.
result
The FFT you posted has been shifted so that 0 is at the center. Data to the left of the center represents negative frequencies and to the right represents positive frequencies. If you zoom in and look more closely, I think you will see that there are two peaks close to the center that you are interpreting as a single peak at 0. Just looking at the positive side, the location of this peak will tell you which frequency is contributing significant signal power.
Like you said, your x-axis is probably incorrect. scipy.fftpack.fftfreq needs to know the time between samples (in seconds, I think) of your time-domain signal to correctly determine the bandwidth and create the x-axis array in Hz. This should do it:
dt = 60 # 60 seconds between samples
freqs = fftpack.fftfreq(len(data),dt)

How to interpret the upper/lower bound of a datapoint with confidence intervals?

Given a list of values:
>>> from scipy import stats
>>> import numpy as np
>>> x = list(range(100))
Using student t-test, I can find the confidence interval of the distribution at the mean with an alpha of 0.1 (i.e. at 90% confidence) with:
def confidence_interval(alist, v, itv):
return stats.t.interval(itv, df=len(alist)-1, loc=v, scale=stats.sem(alist))
x = list(range(100))
confidence_interval(x, np.mean(x), 0.1)
[out]:
(49.134501289005009, 49.865498710994991)
But if I were to find the confidence interval at every datapoint, e.g. for the value 10:
>>> confidence_interval(x, 10, 0.1)
(9.6345012890050086, 10.365498710994991)
How should the interval of the values be interpreted? Is it statistically/mathematical sound to interpret that at all?
Does it goes something like:
At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991),
aka.
At 90% confidence, we can say that the data point falls at 10 +- 0.365...
So can we interpret the interval as some sort of a box plot of the datapoint?
In short
Your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, since your distribution is clearly not normal, and because 10 is not the observed mean.
TL;DR
There are a lot misconceptions floating around confidence intervals, most of which seemingly stems from a misunderstanding of what we are confident about. Since there is some confusion in your understanding of confidence interval maybe a broader explanation will give a deeper understanding of the concepts you are handling, and hopefully definitely rule out any source of error.
Clearing out misconceptions
Very briefly to set things up. We are in a situation where we want to estimate a parameter, or rather, we want to test a hypothesis for the value of a parameter parameterizing the distribution of a random variable. e.g: Let's say I have a normally distributed variable X with mean m and standard deviation sigma, and I want to test the hypothesis m=0.
What is a parametric test
This a process for testing a hypothesis on a parameter for a random variable. Since we only have access to observations which are concrete realizations of the random variable, it generally procedes by computing a statistic of these realizations. A statistic is roughly a function of the realizations of a random variable. Let's call this function S, we can compute S on x_1,...,x_n which are as many realizations of X.
Therefore you understand that S(X) is a random variable as well with distribution, parameters and so on! The idea is that for standard tests, S(X) follows a very well known distribution for which values are tabulated. e.g: http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf
What is a confidence interval?
Given what we've just said, a definition for a confidence interval would be: the range of values for the tested parameter, such that if the observations were to have been generated from a distribution parametrized by a value in that range, it would not have probabilistically improbable.
In other words, a confidence interval gives an answer to the question: given the following observations x_1,...,x_n n realizations of X, can we confidently say that X's distribution is parametrized by such value. 90%, 95%, etc... asserts the level of confidence. Usually, external constraints fix this level (industrial norms for quality assessment, scientific norms e.g: for the discovery of new particles).
I think it is now intuitive to you that:
The higher the confidence level, the larger the confidence interval. e.g. for a confidence of 100% the confidence interval would range across all the possible values as soon as there is some uncertainty
For most tests, under conditions I won't describe, the more observations we have, the more we can restrain the confidence interval.
At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991)
It is wrong to say that and it is the most common source of mistakes. A 90% confidence interval never means that the estimated parameter has 90% percent chance of falling into that interval. When the interval is computed, it covers the parameter or it does not, it is not a matter of probability anymore. 90% is an assessment of the reliability of the estimation procedure.
What is a student test?
Now let's come to your example and look at it under the lights of what we've just said. You to apply a Student test to your list of observations.
First: a Student test aims at testing a hypothesis of equality between the mean m of a normally distributed random variable with unknown standard deviation, and a certain value m_0.
The statistic associated with this test is t = (np.mean(x) - m_0)/(s/sqrt(n)) where x is your vector of observations, n the number of observations and s the empirical standard deviation. With no surprise, this follows a Student distribution.
Hence, what you want to do is:
compute this statistic for your sample, compute the confidence interval associated with a Student distribution with this many degrees of liberty, this theoretical mean, and confidence level
see if your computed t falls into that interval, which tells you if you can rule out the equality hypothesis with such level of confidence.
I wanted to give you an exercise but I think I've been lengthy enough.
To conclude on the use of scipy.stats.t.interval. You can use it one of two ways. Either computing yourself the t statistic with the formula shown above and check if t fits in the interval returned by interval(alpha, df) where df is the length of your sampling. Or you can directly call interval(alpha, df, loc=m, scale=s) where m is your empirical mean, and s the empirical standard deviatation (divided by sqrt(n)). In such case, the returned interval will directly be the confidence interval for the mean.
So in your case your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, beside the error of interpretation I've already pointed out, since your distribution is clearly not normal, and because 10 is not the observed mean.
Resources
You can check out the following resources to go further.
wikipedia links to have quick references and an elborated overview
https://en.wikipedia.org/wiki/Confidence_interval
https://en.wikipedia.org/wiki/Student%27s_t-test
https://en.wikipedia.org/wiki/Student%27s_t-distribution
To go further
http://osp.mans.edu.eg/tmahdy/papers_of_month/0706_statistical.pdf
I haven't read it but the one below seems quite good.
https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/162/Handouts/StatsTests04.pdf
You should also check out p-values, you will find a lot of similarities and hopefully you understand them better after reading this post.
https://en.wikipedia.org/wiki/P-value#Definition_and_interpretation
Confidence intervals are hopelessly counter-intuitive. Especially for programmers, I dare say as a programmer.
Wikipedida uses a 90% confidence to illustrate a possible interpretation:
Were this procedure to be repeated on numerous samples, the fraction of calculated confidence intervals (which would differ for each sample) that encompass the true population parameter would tend toward 90%.
In other words
The confidence interval provides information about a statistical parameter (such as the mean) of a sample.
The interpretation of e.g. a 90% confidence interval would be: If you repeat the experiment an infinite number of times 90% of the resulting confidence intervals will contain the true parameter.
Assuming the code to compute the interval is correct (which I have not checked) you can use it to calculate the confidence interval of the mean (because of the t-distribution, which models the sample mean of a normally distributed population with unknown standard deviation).
For practical purposes it makes sense to pass in the sample mean. Otherwise you are saying "if I pretended my data had a sample mean of e.g. 10, the confidence interval of the mean would be [9.6, 10.3]".
The particular data passed into the confidence interval does not make sense either. Numbers increasing in a range from 0 to 99 are very unlikely to be drawn from a normal distribution.

How to measure the accuracy of predictions using Python/Pandas?

I have used the Elo and Glicko rating systems along with the results for matches to generate ratings for players. Prior to each match, I can generate an expectation (a float between 0 and 1) for each player based on their respective ratings. I would like test how accurate this expectation is, for two reasons:
To compare the difference rating systems
To tune variables (such as kfactor in Elo) used to calculate ratings
There are a few differences from chess worth being aware of:
Possible results are wins (which I am treating as 1.0), losses (0.0), with the very occasional (<5%) draws (0.5 each). Each individual match is rated, not a series like in chess.
Players have less matches -- many have less than 10, few go over 25, max is 75
Thinking the appropriate function is "correlation", I have attempted creating a DataFrame containing the prediction in one column (a float between 0, 1) and the result in the other (1|0.5|0) and using corr(), but based on the output, I am not sure if this is correct.
If I create a DataFrame containing expectations and results for only the first player in a match (the results will always be 1.0 or 0.5 since due to my data source, losers are never displayed first), corr() returns very low: < 0.05. However, if I create a series which has two rows for each match and contains both the expectation and result for each player (or, alternatively, randomly choose which player to append, so results will be either 0, 0.5, or 1), the corr() is much higher: ~0.15 to 0.30. I don't understand why this would make a difference, which makes me wonder if I am either misusing the function or using the wrong function entirely.
If it helps, here is some real (not random) sample data: http://pastebin.com/eUzAdNij
An industry standard way to judge the accuracy of prediction is Receiver Operating Characteristic (ROC). You can create it from your data using sklearn and matplotlib with this code below.
ROC is a 2-D plot of true positive vs false positive rates. You want the line to be above diagonal, the higher the better. Area Under Curve (AUC) is a standard measure of accuracy: the larger the more accurate your classifier is.
import pandas as pd
# read data
df = pd.read_csv('sample_data.csv', header=None, names=['classifier','category'])
# remove values that are not 0 or 1 (two of those)
df = df.loc[(df.category==1.0) | (df.category==0.0),:]
# examine data frame
df.head()
from matplotlib import pyplot as plt
# add this magic if you're in a notebook
# %matplotlib inline
from sklearn.metrics import roc_curve, auc
# matplot figure
figure, ax1 = plt.subplots(figsize=(8,8))
# create ROC itself
fpr,tpr,_ = roc_curve(df.category,df.classifier)
# compute AUC
roc_auc = auc(fpr,tpr)
# plotting bells and whistles
ax1.plot(fpr,tpr, label='%s (area = %0.2f)' % ('Classifier',roc_auc))
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.0])
ax1.set_xlabel('False Positive Rate', fontsize=18)
ax1.set_ylabel('True Positive Rate', fontsize=18)
ax1.set_title("Receiver Operating Characteristic", fontsize=18)
plt.tick_params(axis='both', labelsize=18)
ax1.legend(loc="lower right", fontsize=14)
plt.grid(True)
figure.show()
From your data, you should get a plot like this one:
Actually, what you observe makes perfectly sense. If there were no draws and you would always show the expectation of the winner in the first row, then there would be no correlation with the second row at all! Because no matter how big or small the expectation, the number in the second row is always 1.0, i.e. it does not depend on the number in the first row at all.
Due to a low percentage of draws (draws probably correlate with the values around 0.5) you still can observe a small correlation.
Maybe the correlation is not the best measure for the accuracy of the predictions here.
One of the problems is, that the Elo does not predict the single result but the expected amount of points. There is at least one unknown factor: The probability of the draw. You have to put additional knowledge about the probability of the draw into your models. This probability is dependent on the strength difference between the players: the bigger the difference the smaller the chance of a draw. One could try the following approaches:
mapping expected points onto expected results, e.g. 0...0.4 means a loss, 0.4..0.6 - a draw and 0.6...1.0 - a win and see how many results are predicted correctly.
For a player and a bunch of games, the measure for accuracy would be |predicted_score-score|/number_of_games averaged over the players. The smaller the difference, the better.
A kind of Bayesian approach: if for a game the predicted amount of points is x than the score of the predictor is x if the game were won and 1-x if the game were lost (maybe you have to skip the draws or score them as (1-x)*x/4 - thus the prediction of 0.5 would have the score of 1). The overall score of the predictor over all games would be the product of the single game scores. The bigger the score, the better.

Categories