Finding the correlation between variables using python - python

I am trying to find the correlation of all the columns in this dataset excluding qualityand then plot the frequency distribution of wine quality.
I am doing it the following way, but how do I remove quality?
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
df.corr()
It returns this output:
How can I graph the frequency distribution of wine quality with pandas?
I previously used R for correlation and it worked fine for me but on this dataset I am learning use of pandas and python:
winecor = cor(wine[-12])
hist(wine$quality)
So in R I am getting the following output and I am looking for same in Python.

1. Histogram
# Import plotting library
import matplotlib.pyplot as plt
### Option 1 - histogram
plt.hist(df['quality'], bins=range(3, 10))
plt.show()
### Option 2 - bar plot (looks nicer)
# Get frequency per quality group
x = df.groupby('quality').size()
# Plot
plt.bar(x.index, x.values)
plt.show()
2. Correlation matrix
In order to get the correlation matrix of features, excluding quality:
# Option 1 - very similar to R
df.iloc[:, :-1].corr()
# Option 2 - more Pythonic
df.drop('quality', axis=1).corr()

You can plot histograms with:
import matplotlib.pyplot as plt
plt.hist(x=df['quality'], bins=30)
plt.show()
Read the docs of plt.hist() in order to understand better all the attributes

Related

Plot existing covariance dataframe

I have computed a covariance of 26 inputs from another software. I have an existing table of the results. See image below:
What I want to do is enter the table as a pandas dataframe and plot the matrix. I have seen the thread here: Plot correlation matrix using pandas. However, the aforementioned example, computed the covariance first and plotted the 'covariance' object. In my case, I want to plot the dataframe object to look like the covariance matrix in the example.
Link to data: HERE.
IIUC, you can use seaborn.heatmap with annot=True :
plt.figure(figsize=(6, 4))
(
pd.read_excel("/tmp/Covariance Matrix.xlsx", header=None)
.pipe(lambda df: sns.heatmap(df.sample(10).sample(10, axis=1), annot=True, fmt=".1f"))
);
# for a sample of 10 rows / 10 columns
Output :
And, as suggested by stukituk in the comments, you can add cmap="coolwarm" for colors :
a clean option, in my opinion, from this other answer: How to plot only the lower triangle of a seaborn heatmap?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_excel('Covariance Matrix.xlsx', header=None)
# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(df)
# using the upper triangle matrix as mask
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(df, ax=ax, fmt='.1g', annot=True, mask=matrix)
plt.show()
hope this helps

Creating whisker plots from grouped pandas Series

I have a dataset of values arriving in 5min timestamped intervals that I'm visualising grouped by hours of day, like this
I want to turn this into a whisker/box plot for the added information. However, the implementations of matplotlib, seaborn and pandas of this plot all want an array of raw data to compute the plot's contents themselves.
Is there a way to create whisker plots from pre-computed/grouped mean, median, std and quartiles? I would like to avoid reinventing the wheel with a comparatively inefficient grouping algorithm to build per-day datasets just for this.
This is some code to produce toy data and a version of the current plot.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# some toy data in a 15-day range
data = [1.5+np.sin(x)*5 for x in np.arange(0, 403.3, .1)]
s = pd.Series(data=data, index=pd.date_range('2019-01-01', '2019-01-15', freq='5min'))
s.groupby(s.index.hour).mean().plot(kind='bar')
plt.show()
Adding to #Quang Hoang's solution: You can use hlines() to display the median as well:
axis.bar(data.index, data['q75'] - data['q25'], bottom=data['q25'], width=wd)
axis.hlines(y=data['median'], xmin=data.index-wd/2, xmax=data.index+wd/2, color='black', linewidth=1)
I don't think there is anything for that. But you can create a whisker plot fairly simply with two plot command:
# precomputed data:
data = (s.groupby(s.index.hour)
.agg(['mean','std','median',
lambda x: x.quantile(.25),
lambda x: x.quantile(.75)])
)
data.columns = ['mean','std','median','q25','q75']
# plot the whiskers with `errorbar` from `mean` and `std`
fig, ax = plt.subplots(figsize=(12,6))
ax.errorbar(data.index,data['mean'],
yerr=data['std']*1.96,
linestyle='none',
capsize=5
)
# plot the boxes with `bar` at bottoms from quantiles
ax.bar(data.index, data['q75']-data['q25'], bottom=data['q25'])
Output:

Plot type for proportions (with hue)

In a classifieds website I maintain, I'm comparing classifieds that receive greater-than-median views vs classifieds that are below median in this criterion. I call the former "high performance" classifieds. Here's a simple countplot showing this:
The hue is simply the number of photos the classified had.
My question is - is there a plot type in seaborn or matplotlib which shows proportions instead of absolute counts?
I essentially want the same countplot, but with each bar as a % of the total items in that particular category. For example, notice that in the countplot, classifieds with 3 photos make up a much larger proportion of the high perf category. It takes a while to glean that information. If each bar's height was instead represented by its % contribution to its category, it'd be a much easier comparison. That's why I'm looking for what I'm looking for.
An illustrative example would be great.
Instead of trying to find a special case plotting function that would do exactly what you want, I would suggest to consider keeping data generation and visualization separate. At the end what you want is to plot a bar graph of some values, so the idea would be to generate the data in such a way that they can easily be plotted.
To this end, you may crosstab the two columns in question and divide each row (or column) in the resulting table by its sum. This table can then easily be plotted using the pandas plotting wrapper.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
plt.rcParams["figure.figsize"] = 5.6, 7.0
n = 100
df = pd.DataFrame({"performance": np.random.choice([0,1], size=n, p=[0.7,0.3]),
"photo" : np.random.choice(range(4), size=n, p=[0.6,0.1,0.2,0.1]),
"someothervalue" : np.random.randn(n) })
fig, (ax,ax2, ax3) = plt.subplots(nrows=3)
freq = pd.crosstab(df["performance"],df["photo"])
freq.plot(kind="bar", ax=ax)
relative = freq.div(freq.sum(axis=1), axis=0)
relative.plot(kind="bar", ax=ax2)
relative = freq.div(freq.sum(axis=0), axis=1)
relative.plot(kind="bar", ax=ax3)
ax.set_title("countplot of absolute frequency")
ax2.set_title("barplot of relative frequency by performance")
ax3.set_title("barplot of relative frequency by photo")
for a in [ax, ax2, ax3]: a.legend(title="Photo", loc=6, bbox_to_anchor=(1.02,0.5))
plt.subplots_adjust(right=0.8,hspace=0.6)
plt.show()

Pandas Multiple Plotting

I have Dataframes which contains daily returns data for different indices. I am using the below code to plot the Density of the returns distribution.
df.plot(kind='density', title='Returns Density Plot for '+ str(i))
In the same graph I want to plot the Normal Density curve with the same mean and standard deviation as the Index Returns so that I can see how much the Empirical PDF curve deviates from the Normal Distribution Curve.
What will be the easiest way to do this?
A sample Empirical PDF
I suppose You could do something like this, assuming you have a data frame column which contains the normal distribution values.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv(somefile.csv)
density=df['Density']
norm_density=df['Normal Distribution']
f= figure(1)
f1=plt.plot(density,title='Returns Density Plot for '+ str(i))
f2=plt.plot(norm_density="normal density")
plt.legend(f1[0],f2[0],('density','normal distribution'))
f.show()
I used something like this and it works
df1=pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(dic_2[i])))
ax=df.plot(kind='density', title='Returns Density Plot for '+ str(i),colormap='Reds_r')
df1.plot(ax=ax,kind='density',colormap='Blues_r')

Plot a histogram with normal curve and name the bins in seaborn

Hi all, I am trying to plot the following type of plot using seaborn with a different data set. The problem is when a histogram type is used, I cannot name the bins (like 2-2.5,2.5-3..etc) even though it provides kernel curves. Bar plots dont have function to draw the normal curve like in the picture. The image seems to be used SPSS statistical package which I have little knowledge of.
Following is the closest thing I can get (I have attached the code)
df = pd.DataFrame({'cat': ['1-1.5', '1.5-2', '2-2.5','2.5-3','3-3.5','3.5-4','4-4.5','4.5-5'],'val': [0,0,1,7,7,33,17,10]})
ax = sns.barplot(y = 'val', x = 'cat',
data = df)
ax.set(xlabel='Categories', ylabel='Frequency')
plt.show()
So the problem is of course that you don't have the original data, but data that has already been binned. One could reverse this binning and start with an array of raw data. Then perform the histogramming again and use a sns.distplot which, by default, shows a KDE plot as well.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
cat = ['1-1.5', '1.5-2', '2-2.5','2.5-3','3-3.5','3.5-4','4-4.5','4.5-5']
val = [0,0,1,7,7,33,17,10]
data = []
for i in range(len(cat)):
data.extend([1.25+i*0.5]*val[i])
bins = np.arange(1,5.5, 0.5)
ax = sns.distplot(data, bins=bins, hist_kws= dict(edgecolor="k"))
ax.set(xlabel='Categories', ylabel='Frequency')
ax.set_xticks(bins[:-1]+0.25)
ax.set_xticklabels(cat)
plt.show()
Use the bw keyword argument to the KDE function to set the smoothness of the curve. E.g. sns.distplot(data, bins=bins, kde_kws=dict(bw=0.5), hist_kws= dict(edgecolor="k")) where bw=0.5 produces
Also try bw=0.1, bw=0.25, bw=0.35 and bw=2 to see the differences.

Categories