Pandas Multiple Plotting - python

I have Dataframes which contains daily returns data for different indices. I am using the below code to plot the Density of the returns distribution.
df.plot(kind='density', title='Returns Density Plot for '+ str(i))
In the same graph I want to plot the Normal Density curve with the same mean and standard deviation as the Index Returns so that I can see how much the Empirical PDF curve deviates from the Normal Distribution Curve.
What will be the easiest way to do this?
A sample Empirical PDF

I suppose You could do something like this, assuming you have a data frame column which contains the normal distribution values.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv(somefile.csv)
density=df['Density']
norm_density=df['Normal Distribution']
f= figure(1)
f1=plt.plot(density,title='Returns Density Plot for '+ str(i))
f2=plt.plot(norm_density="normal density")
plt.legend(f1[0],f2[0],('density','normal distribution'))
f.show()

I used something like this and it works
df1=pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(dic_2[i])))
ax=df.plot(kind='density', title='Returns Density Plot for '+ str(i),colormap='Reds_r')
df1.plot(ax=ax,kind='density',colormap='Blues_r')

Related

Finding the correlation between variables using python

I am trying to find the correlation of all the columns in this dataset excluding qualityand then plot the frequency distribution of wine quality.
I am doing it the following way, but how do I remove quality?
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
df.corr()
It returns this output:
How can I graph the frequency distribution of wine quality with pandas?
I previously used R for correlation and it worked fine for me but on this dataset I am learning use of pandas and python:
winecor = cor(wine[-12])
hist(wine$quality)
So in R I am getting the following output and I am looking for same in Python.
1. Histogram
# Import plotting library
import matplotlib.pyplot as plt
### Option 1 - histogram
plt.hist(df['quality'], bins=range(3, 10))
plt.show()
### Option 2 - bar plot (looks nicer)
# Get frequency per quality group
x = df.groupby('quality').size()
# Plot
plt.bar(x.index, x.values)
plt.show()
2. Correlation matrix
In order to get the correlation matrix of features, excluding quality:
# Option 1 - very similar to R
df.iloc[:, :-1].corr()
# Option 2 - more Pythonic
df.drop('quality', axis=1).corr()
You can plot histograms with:
import matplotlib.pyplot as plt
plt.hist(x=df['quality'], bins=30)
plt.show()
Read the docs of plt.hist() in order to understand better all the attributes

Get data array from a Seaborn pairplot

I have used the seaborn pairplot function and would like to extract a data array.
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")
I want to get an array of the points I show below in black color:
Thanks.
Just this line:
data = iris[iris['species'] == 'setosa']['sepal_length']
You are interested in the blue line, so the 'setosa' scpecie. In order to filter the iris dataframe, I create this filter:
iris['species'] == 'setosa'
which is a boolean array, whose values are True if the corresponding row in the 'species' columns of the iris dataframe is 'setosa', False otherwise. With this line of code:
iris[iris['species'] == 'setosa']
I apply the filter to the dataframe, in order to extract only the rows associated with the 'setosa' specie. Finally, I extract the 'sepal_length' column:
iris[iris['species'] == 'setosa']['sepal_length']
If I plot a KDE for this data array with this code:
data = iris[iris['species'] == 'setosa']['sepal_length']
sns.kdeplot(data)
I get:
that is the plot above you are interested in
The values are different from the plot above by the way KDE is calculated.
I quote this reference:
The y-axis in a density plot is the probability density function for
the kernel density estimation. However, we need to be careful to
specify this is a probability density and not a probability. The
difference is the probability density is the probability per unit on
the x-axis. To convert to an actual probability, we need to find the
area under the curve for a specific interval on the x-axis. Somewhat
confusingly, because this is a probability density and not a
probability, the y-axis can take values greater than one. The only
requirement of the density plot is that the total area under the curve
integrates to one. I generally tend to think of the y-axis on a
density plot as a value only for relative comparisons between
different categories.

Beginner question: Python scatter plot with normal distribution not plotting

I have an array of random integers for which I have calculated the mean and std, the standard deviation. Next I have an array of random numbers within the normal distribution of this (mean, std).
I want to plot now a scatter plot of the normal distribution array using matplotlib. Can you please help?
Code:
random_array_a = np.random.randint(2,15,size=75) #random array from [2,15)
mean = np.mean(random_array_a)
std = np.std(random_array_a)
sample_norm_distrib = np.random.normal(mean,std,75)
The scatter plot needs x and y axis...but what should it be?
I think what you may want is a histogram of the normal distribution:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(sample_norm_distrib)
The closest thing you can do to visualise your distribution of 1D output is doing scatter where your x & y are the same. this way you can see more accumulation of data in the high probability areas. For example:
import numpy as np
import matplotlib.pyplot as plt
mean = 0
std = 1
sample_norm_distrib = np.random.normal(mean,std,7500)
plt.figure()
plt.scatter(sample_norm_distrib,sample_norm_distrib)

Excel-like Interpolation in Python

Plotting my data in excel as a scatter plot with smooth line and markers produces the type of figure I'm expecting. Image of Excel plots:
However when trying to plot the data in matplotlib I'm running into some issues with interpolation. I'm using the interpolation package from SciPy, I've tried a range of different interpolation methods including spline interpolation and BarycentricInterpolator as suggested previously. These plots are obviously very different to the excel produced plots however:
I've tried different smoothing and k values for spline interpolation, while the curve changes the root problem still exists.
How would I be able to produce a fitted curve similar to the excel-produced plots?
Thanks
The problem is that you interpolate the data on a linear scale but expect the outcome to look smooth on a logarithmic scale.
The idea would therefore be perform the interpolation on a log scale already by transforming the data to its logarithm first and then perform the interpolation. You can then transform it back to linear scale such that you can plot it on a log scale again.
from scipy.interpolate import interp1d, Akima1DInterpolator
import numpy as np
import matplotlib.pyplot as plt
x = np.array([0.02,0.2,2,20,200])
y = np.array([700,850,680,410, 700])
plt.plot(x,y, marker="o", ls="")
sx=np.log10(x)
xi_ = np.linspace(sx.min(),sx.max(), num=201)
xi = 10**(xi_)
f = interp1d(sx,y, kind="cubic")
yi = f(xi_)
plt.plot(xi,yi, label="cubic spline")
f2 = Akima1DInterpolator(sx, y)
yi2 = f2(xi_)
plt.plot(xi,yi2, label="Akima")
plt.gca().set_xscale("log")
plt.legend()
plt.show()

Python equivalent for MATLAB's normplot?

Is there a python equivalent function similar to normplot from MATLAB?
Perhaps in matplotlib?
MATLAB syntax:
x = normrnd(10,1,25,1);
normplot(x)
Gives:
I have tried using matplotlib & numpy module to determine the probability/percentile of the values in array but the output plot y-axis scales are linear as compared to the plot from MATLAB.
import numpy as np
import matplotlib.pyplot as plt
data =[-11.83,-8.53,-2.86,-6.49,-7.53,-9.74,-9.44,-3.58,-6.68,-13.26,-4.52]
plot_percentiles = range(0, 110, 10)
x = np.percentile(data, plot_percentiles)
plt.plot(x, plot_percentiles, 'ro-')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
Gives:
Else, how could the scales be adjusted as in the first plot?
Thanks.
A late answer, but I just came across the same problem and found a solution, that is worth sharing. I guess.
As joris pointed out the probplot function is an equivalent to normplot, but the resulting distribution is in form of the cumulative density function. Scipy.stats also offers a function, to convert these values.
cdf -> percentile
stats.'distribution function'.cdf(cdf_value)
percentile -> cdf
stats.'distribution function'.ppf(percentile_value)
for example:
stats.norm.ppf(percentile)
To get an equivalent y-axis, like normplot, you can replace the cdf-ticks:
from scipy import stats
import matplotlib.pyplot as plt
nsample=500
#create list of random variables
x=stats.t.rvs(100, size=nsample)
# Calculate quantiles and least-square-fit curve
(quantiles, values), (slope, intercept, r) = stats.probplot(x, dist='norm')
#plot results
plt.plot(values, quantiles,'ob')
plt.plot(quantiles * slope + intercept, quantiles, 'r')
#define ticks
ticks_perc=[1, 5, 10, 20, 50, 80, 90, 95, 99]
#transfrom them from precentile to cumulative density
ticks_quan=[stats.norm.ppf(i/100.) for i in ticks_perc]
#assign new ticks
plt.yticks(ticks_quan,ticks_perc)
#show plot
plt.grid()
plt.show()
The result:
I'm fairly certain matplotlib doesn't provide anything like this.
It's possible to do, of course, but you'll have to either rescale your data and change your y axis ticks/labels to match, or, if you're planning on doing this often, perhaps code a new scale that can be applied to matplotlib axes, like in this example: http://matplotlib.sourceforge.net/examples/api/custom_scale_example.html.
Maybe you can use the probplot function of scipy (scipy.stats), this seems to me an equivalent for MATLABs normplot:
Calculate quantiles for a probability
plot of sample data against a
specified theoretical distribution.
probplot optionally calculates a
best-fit line for the data and plots
the results using Matplotlib or a
given plot function.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html
But is does not solve your problem of the different y-axis scale.
Using matplotlib.semilogy will get closer to the matlab output.

Categories