Linregress output seems incorrect

Linregress output seems incorrect - python

I plotted a scatter plot on my dataframe which looks like this:
with code
from scipy import stats
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
subset = df[:,1:10080]
df['mean'] = subset.mean(axis=1)
df.plot(x='mean', y='Result', kind = 'scatter')
sns.lmplot('mean', 'Result', df, order=1)
I wanted to find the slope of the regression in the graph using code
scipy.stats.mstats.linregress(Result,average)
but from the output it seems like the slope magnitude is too small:
LinregressResult(slope=-0.0001320534706614152, intercept=27.887336813241845, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=2.55977061451773e-05)
if I switched the Resultand average positions,
scipy.stats.mstats.linregress(average,Result)
it still doesn't look right as the intercept is too large
LinregressResult(slope=-213.12489536011773, intercept=7138.48783135982, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=41.31287437069993)
Why is this happening? Do these output values need to be rescaled?

The signature for scipy.stats.mstats.linregress is linregress(x,y) so your second ordering, linregress(average, Result) is the one that is consistent with the way your graph is drawn. And on that graph, an intercept of 7138 doesn't seem unreasonable—are you getting confused by the fact that the x-axis limits you're showing don't go down to 0, where the intercept would actually happen?
In any case, your data really don't look like they follow a linear law, so the slope (or any parameter from a completely-misspecified model) will not actually tell you much. Are the x and y values all strictly positive? And is there a particular reason why x can never logically go below 25? The data-points certainly seem to be piling up against that vertical asymptote. If so, I would probably subtract 25 from x, then fit a linear model to logged data. In other words, do your plot and your linregress with x=numpy.log(average-25) and y=numpy.log(Result). EDIT: since you say x is temperature there’s no logical reason why x can’t go below 25 (it is meaningful to want to extrapolate below 25, for example—and even below 0). Therefore don’t subtract 25, and don’t log x. Just log y.
In your comments you talk about rescaling the slope, and eventually the suspicion emerges that you think this will give you a correlation coefficient. These are different things. The correlation coefficient is about the spread of the points around the line as well as slope. If what you want is correlation, look up the relevant tools using that keyword.

Related

How do I get peak values back from fourier transform?

I suspect that there's something I'm missing in my understanding of the Fourier Transform, so I'm looking for some correction (if that's the case). How should I gather peak information from the first plot below?
The dataset is hourly data for 911 calls over the past 17 years (for a particular city).
I've removed the trend from my data, and am now removing the seasonality. When I run the Fourier transform, I get the following plot:
I believe the dataset does have some seasonality to it (looking at weekly data, I have this pattern):
How do I pick out the values of the peaks in the first plot? Presumably for all of the "peaks" under, say 5000 in the first plot, I may ignore the inclusion of that seasonality in my final model, but only at a loss of accuracy, correct?
Here's the bit of code I'm working with, currently:
from scipy import fftpack
fft = fftpack.fft(calls_grouped_hour.detrended_residuals - calls_grouped_hour.detrended_residuals.mean())
plt.plot(1./(17*365)*np.arange(len(fft)), np.abs(fft))
plt.xlim([-.1, 23/2]);
EDIT:
After Mark Snider's initial answer, I have the following plot:
Adding code attempt to get peak values from fft:
Do I need to convert the values back using ifft first?
fft_x_y = np.stack((fft.real, fft.imag), -1)
peaks = []
for x, y in np.abs(fft_x_y):
if (y >= 0):
spipeakskes.append(x)
peaks = np.unique(peaks)
print('Length: ', len(peaks))
print('Peak values: ', '\n', np.sort(peaks))

threshold = 5000
fft[np.abs(fft)<threshold] = 0
This'll give you an fft that ignores everything except the peaks. And no, I wouldn't imagine that the "noise" represents actual seasonality. The peak at fft[0] doesn't represent seasonality, either - it's a multiple of the mean of the data, so if you plan on subtracting the ifft of the peaks I wouldn't include fft[0] either unless you want your data to be centered.
If you want just the peak values and not the full fft that you can invert, you can just do this:
peaks = [np.abs(value) for value in fft if np.abs(value)>threshold]

How do you smoothen out values in an array (without polynomial equations)?

So basically I have some data and I need to find a way to smoothen it out (so that the line produced from it is smooth and not jittery). When plotted out the data right now looks like this:
and what I want it to look is like this:
I tried using this numpy method to get the equation of the line, but it did not work for me as the graph repeats (there are multiple readings so the graph rises, saturates, then falls then repeats that multiple times) so there isn't really an equation that can represent that.
I also tried this but it did not work for the same reason as above.
The graph is defined as such:
gx = [] #x is already taken so gx -> graphx
gy = [] #same as above
#Put in data
#Get nice data #[this is what I need help with]
#Plot nice data and original data
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
The method I think would be most applicable to my solution is getting the average of every 2 points and setting that to the value of both points, but this idea doesn't sit right with me - potential values may be lost.

You could use a infinite horizon filter
import numpy as np
import matplotlib.pyplot as plt
x = 0.85 # adjust x to use more or less of the previous value
k = np.sin(np.linspace(0.5,1.5,100))+np.random.normal(0,0.05,100)
filtered = np.zeros_like(k)
#filtered = newvalue*x+oldvalue*(1-x)
filtered[0]=k[0]
for i in range(1,len(k)):
# uses x% of the previous filtered value and 1-x % of the new value
filtered[i] = filtered[i-1]*x+k[i]*(1-x)
plt.plot(k)
plt.plot(filtered)
plt.show()

I figured it out, by averaging 4 results I was able to significantly smooth out the graph. Here is a demonstration:
Hope this helps whoever needs it

How to perform linear/non-linear regression between two 2-D numpy arrays and visualize it with matplotlib?

First I'd like to clear that I need to perform regression on data between a disease and a number of other environmental factors for a particular large country, so I have lot of data.
Now I have this data stored in tiff files and I'm reading them into numpy arrays through gdal. Each dataset is read into a numpy array of shape <54L,53L>. I have several such arrays for each dataset. And I need to perform regression between such two 2-D numpy arrays. The values in arrays are Float64. Here's an example:
[[ 162.32145691 158.19345093 153.15704346 ..., 123.77481079 123.63883972 123.6770401 ]
[ 164.55152893 160.59266663 155.75968933 ..., 121.28504181 121.1164093 121.16275024] ...,
[ 321.38272095 329.53326416 338.85699463 ..., 193.69404602 192.50938416 191.42672729]]
Like DiseaseDataset vs EnvironmentFactor1, DiseasDataset vs EnvironementFactor2 etc. Since the relationship is rather unknown, arbitrary and complex I want to plot these 2-D arrays first, but I could not find an appropriate way.
So how do I plot the 2-D arrays in a scatter plot in matplotlib I said scatter plot because it'd be easier for me to infer the relationship and move onto appropriate regression model (linear, non-linear, logarithmic etc). I used following code to plot the relationship row-wise between each numpy array:
for i in range(55):
plt.scatter(JanTemp[i],can02[i])
plt.title('Disease vs Temperature')
plt.ylabel('DiseaseCases')
plt.xlabel('Temp')
plt.show()
Here can02 is the response variable and JanTemp is predictor variable. As expected I got 54 consecutive graph and in same color for both variables, which is frustrating (It's my first ever experience with matplotlib and I don't know how to get each variable its own color). Is there a better way to do it? If yes, please suggest I think it would be in 3-D visualization but then how would I will be able to infer from it? So please suggest a way to visualize in 2-D space but better than above.
Since I couldn't get much info from plots, I decided to begin with linear regression. I used scipy.stats.linregress similar to above iteratively for each row, in the following manner:
months =[JanTemp,FebTemp,MarTemp1,AprTemp,MayTemp,JunTemp,JulTemp,AugTemp,SepTemp,OctTemp,NovTemp,DecTemp]
for month in months:
csum=0
pcsum=0
for i in range(54):
slope, intercept, r_value, p_value, std_err = stats.linregress(month[i],can02[i])
csum +=r_value
pcsum += (r_value**2)*100
print "mean correlation coefficient is", csum/53
print "The avg COD is", pcsum/53
Here JanTemp,FebTemp etc are each file of dimension 54,53. For each file, I'm doing row vs row regression 53 times. This is also rather mundane. Is there a better way to do it, like a function, module etc?
The other method I was aware of was using Ordinary Least Square(OLS) of statsmodels.api module in the following manner:
y = can02
x = JanTemp
X = sm.add_constant(x) #Adds a constant to the linear eq of regression
est = sm.OLS(y, X) #OLS performs the regression of predictor on response
est = est.fit() #fit object of OLS fits the mode
est.summary() #Gives the summary of whole calculation
est.params #gives the coefficient of regression
But I get the following long error:
Traceback (most recent call last):
File "H:\Python\results.py", line 77, in <module>
est.summary() #Gives the summary of whole calculation
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 1230, in summary
top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 959, in rsquared
return 1 - self.ssr/self.centered_tss
File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 931, in ssr
return np.dot(wresid, wresid)
ValueError: matrices are not aligned
I didn't get how the matrices are not aligned. Anyway, sticking to my original question, Is there any other way similar to this to perform regression and how would I do it on 2-D arrays
Thanks, I know I took a lot of your precious time in this long question but I wanted to be clear. I've searched numerous questions on this site and at other websites but I couldn't find an appropriate or related solution. Thanks.

Do you actually have have 3D data with axes location, parameter, year? Then there is very little geographical in this.
I do not think the problem is numpy at all, but the way how to analyze the data. (Toolwise you might be interested in pandas, as soon as you know what you want.)
There are some very sophisticated statistical methods for this type of work, but you may start with some simple concepts as you have done with the linear regression. First, you should separate the dependent variables (outcomes, e.g. diseases) and independent variables (e.g. temperatures) and look at one dependent variable at a time.
A simple example: Take just one disease. For that you have the number of cases at N locations during M years. Then take all P environmental factors you have. Now you can calculate the time series correlation at each location between the disease and all P environmental factors. This results in P numbers for each N locations.
If you show this as an image (N rows, P columns), you may look for columns with high intensity. They represent disease-environmental factor pairs which seem to repeat in many locations. This is not a statistically rigorous method, but it gives a quick overview.
I am not giving too many code examples, as the statistical basis needs to be thought of before making any visualizations. The visualization part is then usually easier. Unfortunately, there is no simple visualization for the type of data you have.
But for the scatter plot: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter . For example, to draw red markers instead of blue ones: scatter(x, y, c='r') If you only want single color per data series, you may also use plt.plot(x, y, 'r.') (r defines the color, . that we want separate data points.)

Fourier smoothing of data set

I am following this link to do a smoothing of my data set.
The technique is based on the principle of removing the higher order terms of the Fourier Transform of the signal, and so obtaining a smoothed function.
This is part of my code:
N = len(y)
y = y.astype(float) # fix issue, see below
yfft = fft(y, N)
yfft[31:] = 0.0 # set higher harmonics to zero
y_smooth = fft(yfft, N)
ax.errorbar(phase, y, yerr = err, fmt='b.', capsize=0, elinewidth=1.0)
ax.plot(phase, y_smooth/30, color='black') #arbitrary normalization, see below
However some things do not work properly.
Indeed, you can check the resulting plot :
The blue points are my data, while the black line should be the smoothed curve.
First of all I had to convert my array of data y by following this discussion.
Second, I just normalized arbitrarily to compare the curve with data, since I don't know why the original curve had values much higher than the data points.
Most importantly, the curve is like "specular" to the data point, and I don't know why this happens.
It would be great to have some advices especially to the third point, and more generally how to optimize the smoothing with this technique for my particular data set shape.

Your problem is probably due to the shifting that the standard FFT does. You can read about it here.
Your data is real, so you can take advantage of symmetries in the FT and use the special function np.fft.rfft
import numpy as np
x = np.arange(40)
y = np.log(x + 1) * np.exp(-x/8.) * x**2 + np.random.random(40) * 15
rft = np.fft.rfft(y)
rft[5:] = 0 # Note, rft.shape = 21
y_smooth = np.fft.irfft(rft)
plt.plot(x, y, label='Original')
plt.plot(x, y_smooth, label='Smoothed')
plt.legend(loc=0)
plt.show()
If you plot the absolute value of rft, you will see that there is almost no information in frequencies beyond 5, so that is why I choose that threshold (and a bit of playing around, too).
Here the results:

From what I can gather you want to build a low pass filter by doing the following:
Move to the frequency domain. (Fourier transform)
Remove undesired frequencies.
Move back to the time domain. (Inverse fourier transform)
Looking at your code, instead of doing 3) you're just doing another fourier transform. Instead, try doing an actual inverse fourier transform to move back to the time domain:
y_smooth = ifft(yfft, N)
Have a look at scipy signal to see a bunch of already available filters.
(Edit: I'd be curious to see the results, do share!)

I would be very cautious in using this technique. By zeroing out frequency components of the FFT you are effectively constructing a brick wall filter in the frequency domain. This will result in convolution with a sinc in the time domain and likely distort the information you want to process. Look up "Gibbs phenomenon" for more information.
You're probably better off designing a low pass filter or using a simple N-point moving average (which is itself a LPF) to accomplish the smoothing.

(Python) Estimating regression parameter confidence intervals with scikits bootstrap

I've just started to try out a nice bootstrapping package available through scikits:
https://github.com/cgevans/scikits-bootstrap
but I've encountered a problem when trying to estimate confidence intervals for the correlation coefficient from linear regression. The confidence intervals returned lie completely outside the range of the original statistic.
Here is the code:
import numpy as np
from scipy import stats
import bootstrap as boot
np.random.seed(0)
x = np.arange(10)
y = 10 + 1.5*x + 2*np.random.randn(10)
r0 = stats.linregress(x, y)[2]
def my_function(y):
return stats.linregress(x, y)[2]
ci = boot.ci(y, statfunction=my_function, alpha=0.05, n_samples=1000, method='pi')
This yields a result of ci = [-0.605, 0.644], but the original statistic is r0=0.894.
I've tried this in R and it seems to work fine there: the ci straddles r0 as expected.
Please help!

Could you provide your R code? I'd be interested in knowing how this is dealt with in R.
The problem here is that you're only passing y to boot.ci, but every time it runs my_function, it uses the entire, original x (note the lack of x input to my_function). Bootstrapping applies the statistic function to resampled data, so if you're applying your statistic function using the original x and a sample of y, you're going to have a nonsensical result. This is why the BCA method doesn't work at all, actually: it can't apply your statistic function to jackknife samples, which don't have the same number of elements.
Samples are taken along axis 0 (rows), so if you want to pass multiple 1D arrays to your statistic function, you can use multiple columns: xy = vstack((x,y)).T would work, and then use a statfunction that takes data from those columns:
def my_function(xysample):
return stats.linregress(xysample[:,0], xysample[:,1])[2]
Alternatively, if you wanted to avoid messing with your data at all, you could define a function that operates on indexes, and then just pass indexes to boot.ci:
def my_function2(i):
return stats.linregress(x[i], y[i])[2]
boot.ci(np.arange(len(x)), statfunction=my_function2, alpha=0.05, n_samples=1000, method='pi')
Note that in either of these cases, BCA works, so you may as well use method='bca' unless you really do want to use percentage intervals; BCA is pretty much always better.
I do realize that both of these methods are less than ideal. Honestly, I've never had a need to pass multiple arrays like this to my statfunction, and the majority of people are likely using mean as their statfunction. I think the best idea here may be to allow lists of equally-size[0] arrays to be passed, eg, boot.ci([x,y],...), and then sample all of those at the same time and pass them all to the statfunction as separate arguments. In that case, you could just have a my_function(x,y). I'll see if I can do this, but if you can show me your R code, that would be great, as I'd like to see if there is a better way of dealing with this.
Update:
In the most recent version of scikits.bootstrap (v0.3.1), a tuple of arrays can be provided, and samples from them will be passed as separate arguments to statfunction. Additionally, statfunction can provide array output, and confidence intervals will be calculated for each point in the output. Thus, this is now very easy to do. The following will give confidence intervals for every output of linregress:
cis = boot.ci( (x,y), statfunction=stats.linregress )
cis[:,2] in this case will be the desired confidence interval.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.