I used autocorrelation_plot to plot the autocorrelation of a straight line:
import numpy as np
import pandas as pd
from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
dr = pd.date_range(start='1984-01-01', end='1984-12-31')
df = pd.DataFrame(np.arange(len(dr)), index=dr, columns=["Values"])
autocorrelation_plot(df)
plt.show()
Then, I tried using autocorr() to calculate the autocorrelation with different lags:
for i in range(0,366):
print(df['Values'].autocorr(lag=i))
The output is 1 (or 0.99) for all the lag. But it is clear from the correlogram that the autocorrelation is a curve rather than a straight line fixed at 1.
Did I interpret the correlogram incorrectly or did I use the autocorr() function incorrectly?
You are using both functions correctly, but... Autocorrelation_plot uses a different way of calculating autocorrelations then autocorr() does. The following two posts explain more about the differences. Unfortunately I don't know which way of calculating is the correct way:
What's the difference between pandas ACF and statsmodel ACF?
Why NUMPY correlate and corrcoef return different values and how to "normalize" a correlate in "full" mode?
If you need it, you can get the autocorrelations out of your autocorrelation plot as follows:
ax = autocorrelation_plot(df)
ax.lines[5].get_data()[1]
Related
I was playing around with the seaborn library for data visualization and trying to display a standard normal distribution. The basics in this case look something like:
import numpy as np
import seaborn as sns
n=1000
N= np.random.randn(n)
fig=sns.displot(N,kind="kde")
Which behaves as expected. My problem starts when I try to plot multiple distributions at the same time. I tried the brute N2= np.random.randn(n//2) and fig=sns.displot((N,N2),kind="kde"), which returns two distributions (as wanted), but the one with smaller sample size is significantly different (and flatter). Regardless of the sample size, a proper density plot (or histogram) should have the area below the graph equal to one, but this is clearly not the case.
Knowing that seaborn works with pandas Dataframes, I've tried with the more elaborate (and generally bad and inefficient, but I hope clear) code below to attempt again multiple distributions on the same graph:
import numpy as np
import seaborn as sns
import pandas as pd
n=10000
N_1= np.reshape(np.random.randn(n),(n,1))
N_2= np.reshape(np.random.randn(int(n/2)),(int(n/2),1))
N_3= np.reshape(np.random.randn(int(n/4)),(int(n/4),1))
A_1 = np.reshape(np.array(['n1' for _ in range(n)]),(n,1))
A_2 = np.reshape(np.array(['n2' for _ in range(int(n/2))]),(int(n/2),1))
A_3 = np.reshape(np.array(['n3' for _ in range(int(n/4))]),(int(n/4),1))
F_1=np.concatenate((N_1,A_1),1)
F_2=np.concatenate((N_2,A_2),1)
F_3=np.concatenate((N_3,A_3),1)
F= pd.DataFrame(data=np.concatenate((F_1,F_2,F_3),0),columns=["datar","cat"])
F["datar"]=F.datar.astype('float')
fig=sns.displot(F,x="datar",hue="cat",kind="kde")
Which shows again very different (almost scaled) distributions, confirming that the result in this case is not consistent with what I was expecting (namely, roughly overlapping distributions). Am I not understanding how this graph works? There is a completely different approach to draw multiple distributions on the same graph that I am missing?
Seaborn works happily with and without dataframes. Columns of dataframes get converted to numpy arrays in order to draw the plots.
sns.displot(..., kind="kde") refers to sns.kdeplot() which has a parameter common_norm defaulting to True. Setting it to False draws the curves independently.
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
n = 10000
N_1 = np.random.randn(n)
N_2 = np.random.randn(n // 2) + 2
N_3 = np.random.randn(n // 4) + 4
sns.displot((N_1, N_2, N_3), kind="kde", common_norm=False)
plt.show()
Note that for kdeplot, the option common_norm defaulting to True makes sense, as with kdeplot you can also create plots with three separate calls which automatically will be independent. There also is a useful option multiple (defaulting to 'layer'), which can be set to 'stack' or to 'fill'.
Is there any way to find the best fitting line for a scatter plot if I don't know the relationship between 2 axes(else I could have used scipy.optimize).My scatter plot looks something like this
I would like to have a line like this
and i need to get the points of the best fitting line for my further calculation
for j in lat :
l=94*j
i=l-92
for lines in itertools.islice(input_file, i, l):
lines=lines.split()
p.append(float(Decimal(lines[0])))
vmr.append(float(Decimal(lines[3])))
plt.scatter(vmr, p)
You can use LOWESS (Locally Weighted Scatterplot Smoothing), a non-parametric regression method.
Statsmodels has an implementation here that you can use to fit your own smoother.
See this StackOverflow question on visualizing nonlinear relationships in scatter plots for an example using the Statsmodels implementation.
You could also use the implementation in the Seaborn visuzalization library's regplot() function with the keyword argument lowess=True. See the Seaborn documentation for details.
The following code is an example using Seaborn and the data from the StackOverflow question above:
import numpy as np
import seaborn as sns
sns.set_style("white")
x = np.arange(0,10,0.01)
ytrue = np.exp(-x/5.0) + 2*np.sin(x/3.0)
# add random errors with a normal distribution
y = ytrue + np.random.normal(size=len(x))
sns.regplot(x, y, lowess=True, color="black",
line_kws={"color":"magenta", "linewidth":5})
This probably isn't a matplotlib question, but I think you can do this kind of thing with pandas, using a rolling median.
smoothedData = dataSeries.rolling(10, center = True).median()
Actually you can do a rolling median with anything, but pandas has a built in function. Numpy may too.
I need to make a plot of the following data, with the year_week on x-axis, the test_duration on the y-axis, and each operator as a different series. There may be multiple data points for the same operator in one week. I need to show standard deviation bands around each series.
data = pd.DataFrame({'year_week':[1601,1602,1603,1604,1604,1604],
'operator':['jones','jack','john','jones','jones','jack'],
'test_duration':[10,12,43,7,23,9]})
prints as:
I have looked at seaborn, matplotlib, and pandas, but I cannot find a solution.
It could be that you are looking for seaborn pointplot.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame({'year_week':[1601,1602,1603,1604,1604,1604],
'operator':['jones','jack','john','jones','jones','jack'],
'test_duration':[10,12,43,7,23,9]})
sns.pointplot(x="year_week", y="test_duration", hue="operator", data=data)
plt.show()
I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)
My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde
df[var].hist()
plt.show() # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys) # a pdf with kinks
Alternatively, is there a slick way to use
count, div = np.histogram(df[var])
and then scale the count array to apply kde() to it?
Update
Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used
df[var].hist(bins=100)
I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.
If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()
yields
The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.
See also:
http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width
How would you create a qq-plot using Python?
Assuming that you have a large set of measurements and are using some plotting function that takes XY-values as input. The function should plot the quantiles of the measurements against the corresponding quantiles of some distribution (normal, uniform...).
The resulting plot lets us then evaluate in our measurement follows the assumed distribution or not.
http://en.wikipedia.org/wiki/Quantile-quantile_plot
Both R and Matlab provide ready made functions for this, but I am wondering what the cleanest method for implementing in in Python would be.
Update: As folks have pointed out this answer is not correct. A probplot is different from a quantile-quantile plot. Please see those comments and other answers before you make an error in interpreting or conveying your distributions' relationship.
I think that scipy.stats.probplot will do what you want. See the documentation for more detail.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Result
Using qqplot of statsmodels.api is another option:
Very basic example:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(0,1, 1000)
sm.qqplot(test, line='45')
pylab.show()
Result:
Documentation and more example are here
If you need to do a QQ plot of one sample vs. another, statsmodels includes qqplot_2samples(). Like Ricky Robinson in a comment above, this is what I think of as a QQ plot vs a probability plot which is a sample against a theoretical distribution.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot_2samples.html
I came up with this. Maybe you can improve it. Especially the method of generating the quantiles of the distribution seems cumbersome to me.
You could replace np.random.normal with any other distribution from np.random to compare data against other distributions.
#!/bin/python
import numpy as np
measurements = np.random.normal(loc = 20, scale = 5, size=100000)
def qq_plot(data, sample_size):
qq = np.ones([sample_size, 2])
np.random.shuffle(data)
qq[:, 0] = np.sort(data[0:sample_size])
qq[:, 1] = np.sort(np.random.normal(size = sample_size))
return qq
print qq_plot(measurements, 1000)
To add to the confusion around Q-Q plots and probability plots in the Python and R worlds, this is what the SciPy manual says:
"probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot."
If you try out scipy.stats.probplot, you'll see that indeed it compares a dataset to a theoretical distribution. Q-Q plots, OTOH, compare two datasets (samples).
R has functions qqnorm, qqplot and qqline. From the R help (Version 3.6.3):
qqnorm is a generic function the default method of which produces a
normal QQ plot of the values in y. qqline adds a line to a
“theoretical”, by default normal, quantile-quantile plot which passes
through the probs quantiles, by default the first and third quartiles.
qqplot produces a QQ plot of two datasets.
In short, R's qqnorm offers the same functionality that scipy.stats.probplot provides with the default setting dist=norm. But the fact that they called it qqnorm and that it's supposed to "produce a normal QQ plot" may easily confuse users.
Finally, a word of warning. These plots don't replace proper statistical testing and should be used for illustrative purposes only.
It exists now in the statsmodels package:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html
You can use bokeh
from bokeh.plotting import figure, show
from scipy.stats import probplot
# pd_series is the series you want to plot
series1 = probplot(pd_series, dist="norm")
p1 = figure(title="Normal QQ-Plot", background_fill_color="#E8DDCB")
p1.scatter(series1[0][0],series1[0][1], fill_color="red")
show(p1)
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Here probplot draw the graph measurements vs normal distribution which speofied in dist="norm"
How big is your sample? Here is another option to test your data against any distribution using OpenTURNS library. In the example below, I generate a sample x of 1.000.000 numbers from a Uniform distribution and test it against a Normal distribution.
You can replace x by your data if you reshape it as x= [[x1], [x2], .., [xn]]
import openturns as ot
x = ot.Uniform().getSample(1000000)
g = ot.VisualTest.DrawQQplot(x, ot.Normal())
g
In my Jupyter Notebook, I see:
If you are writing a script, you can do it more properly
from openturns.viewer import View`
import matplotlib.pyplot as plt
View(g)
plt.show()