Displaying pair plot in Pandas data frame

Displaying pair plot in Pandas data frame - python

I am trying to display a pair plot by creating from scatter_matrix in pandas dataframe. This is how the pair plot is created:
# Create dataframe from data in X_train
# Label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# Create a scatter matrix from the dataframe, color by y_train
grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
I want to display the pair plot to look something like this;
I am using Python v3.6 and PyCharm and am not using Jupyter Notebook.

This code worked for me using Python 3.5.2:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets
iris_dataset = datasets.load_iris()
X = iris_dataset.data
Y = iris_dataset.target
iris_dataframe = pd.DataFrame(X, columns=iris_dataset.feature_names)
# Create a scatter matrix from the dataframe, color by y_train
grr = pd.plotting.scatter_matrix(iris_dataframe, c=Y, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8)
For pandas version < v0.20.0.
Thanks to michael-szczepaniak for pointing out that this API had been deprecated.
grr = pd.scatter_matrix(iris_dataframe, c=Y, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8)
I just had to remove the cmap=mglearn.cm3 piece, because I was not able to make mglearn work. There is a version mismatch issue with sklearn.
To not display the image and save it directly to file you can use this method:
plt.savefig('foo.png')
Also remove
# %matplotlib inline

Just an update to Vikash's excellent answer. The last two lines should now be:
grr = pd.plotting.scatter_matrix(iris_dataframe, c=Y, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8)
The scatter_matrix function has been moved to the plotting package, so the original answer, while correct is now deprecated.
So the complete code would now be:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets
iris_dataset = datasets.load_iris()
X = iris_dataset.data
Y = iris_dataset.target
iris_dataframe = pd.DataFrame(X, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
grr = pd.plotting.scatter_matrix(iris_dataframe, c=Y, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8)

This is also possible using seaborn:
import seaborn as sns
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

I finally know how to do it with PyCharm.
Just import matploblib.plotting as plt instead:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from pandas.plotting import scatter_matrix
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris_dataset = load_iris()
X_train,X_test,Y_train,Y_test = train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=0)
iris_dataframe = pd.DataFrame(X_train,columns=iris_dataset.feature_names)
grr = scatter_matrix(iris_dataframe,c = Y_train,figsize = (15,15),marker = 'o',
hist_kwds={'bins':20},s=60,alpha=.8,cmap = mglearn.cm3)
plt.show()
Then it works perfect as below:

first of all use
pip install mglearn
then import the mglearn
the code will be like this...
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import mglearn
import matplotlib.pyplot as plt
iris_dataframe=pd.DataFrame(X_train,columns=iris_dataset.feature_names)
grr=pd.scatter_matrix(iris_dataframe,
c=y_train,figsize=(15,15),marker='o',hist_kwds={'bins':20},
s=60,alpha=.8,cmap=mglearn.cm3)
plt.show()

Related

How to create specific plots using Pandas and then store them as PNG files?

So I am trying to create histograms for each specific variable in my dataset and then save it as a PNG file.
My code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
x=combined_databook.groupby('x_1').hist()
x.figure.savefig("x.png")
I keep getting "AttributeError: 'Series' object has no attribute 'figure'"

Use matplotlib to create a figure and axis objects, then tell pandas which axes to plot on using the ax argument. Finally, use matplotlib (or the fig) to save the figure.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sample Data (3 groups, normally distributed)
df = pd.DataFrame({'gp': np.random.choice(list('abc'), 1000),
'data': np.random.normal(0, 1, 1000)})
fig, ax = plt.subplots()
df.groupby('gp').hist(ax=ax, ec='k', grid=False, bins=20, alpha=0.5)
fig.savefig('your_fig.png', dpi=200)
your_fig.png

Instead of using *.hist() I would use matplotlib.pyplot.hist().
Example :
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
y =[10, 20,30,40,100,200,300,400,1000,2000]
x = np.arange(10)
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(x, y, label='$y = Values')
plt.title('my plot')
ax.legend()
plt.show()
fig.savefig('tada.png')

How do I get the diagonal of sns.pairplot?

OK I am probably being thick, but how do I get just the graphs in the diagonal (top left to bottom right) in a nice row or 2x2 grid of:
import seaborn as sns; sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species", palette="husl")
TO CLARIFY: I just want these graphs I do not care whether pairplot or something else is used.

Doing this the seaborn-way would make use of a FacetGrid. For this we would need to convert the wide-form input to a long-form dataframe, such that every observation is a single row. This is done via pandas.melt.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
df = pd.melt(iris, iris.columns[-1], iris.columns[:-1])
g = sns.FacetGrid(df, col="variable", hue="species", col_wrap=2)
g.map(sns.kdeplot, "value", shade=True)
plt.show()

Why do you even want to do that. The diagonal of the pairplot gives you the distplot of that feature. It will be more effective if you can plot the idividual distplots as subplot or mux them Ex:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns
iris = load_iris()
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])
# Sort the dataframe by target
target_0 = iris.loc[iris['target'] == 0]
target_1 = iris.loc[iris['target'] == 1]
target_2 = iris.loc[iris['target'] == 2]
sns.distplot(target_0[['sepal length (cm)']], hist=False, rug=True)
sns.distplot(target_1[['sepal length (cm)']], hist=False, rug=True)
sns.distplot(target_2[['sepal length (cm)']], hist=False, rug=True)
sns.plt.show()
The output will be somewhat like this:
[1]
Read more here : python: distplot with multiple distributions

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
def hide_current_axis(*args, **kwds):
plt.gca().set_visible(False)
g = sns.pairplot(iris, hue="species", palette="husl")
g.map_upper(hide_current_axis)
g.map_lower(hide_current_axis)
Output:

plt.subplots(2, 2)
for i, col in enumerate(iris.columns[:4]):
plt.subplot(2, 2, i+1)
sns.kdeplot(iris.loc[iris['species'] == 'setosa', col], shade=True, label='setosa')
sns.kdeplot(iris.loc[iris['species'] == 'versicolor', col], shade=True, label='versicolor')
sns.kdeplot(iris.loc[iris['species'] == 'virginica', col], shade=True, label='virginica')
plt.xlabel('cm')
plt.title(col)
if i == 1:
plt.legend(loc='upper right')
else:
plt.legend().remove()
plt.subplot_tool() # Opens a widget which allows adjusting plot aesthetics

import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species", corner=True)

xticks don't show in matplotlib

I'm trying to plot out a dictionary data with matplotlib in python3.6, macOS.
I want the keys of the dict to be printed as sticks but they are not showing actually.
My code is as below:
import pandas as pd
import glob
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline
figure(num=None, figsize=(500, 100), dpi=80, facecolor='w', edgecolor='k')
D = info_dict
x = list(D.keys())
y = list(D.values())
plt.bar(x,y)
plt.xticks(range(len(D)), list(D.values()), rotation='vertical')
plt.margins(0.2)
plt.subplots_adjust(bottom=0.15)
plt.show()
And the plotted one is like this:

Linear regression not applying to loglog scale with seaborn

I am currently trying to do a linear regression on a loglog plot using seaborn.
Currently it tries to do the linear regression on the normal scale even though the data is shown plotted in loglog scale.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pylab as plt
import seaborn as sns
x = np.arange(1, 10)
y = x**2.0
data = pd.DataFrame(data={'x': x, 'y': y})
f, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="log")
sns.regplot("x", "y", data, ax=ax)
The only work around that I have been able to do is to log x and y in advance of plotting but then the scale for the x and y are no longer nice compared to the code above.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pylab as plt
import seaborn as sns
x = np.arange(1, 10)
y = x**2.0
data = pd.DataFrame(data={'x': x, 'y': y})
data=np.log(data)
f, ax = plt.subplots(figsize=(7, 7))
sns.regplot("x", "y", data)
Is there a way to keep the loglog scale from the first example of code but have the linear regression apply to the loglog scale and not to the normal scale?

Dot-boxplots from DataFrames

Dataframes in Pandas have a boxplot method, but is there any way to create dot-boxplots in Pandas, or otherwise with seaborn?
By a dot-boxplot, I mean a boxplot that shows the actual data points (or a relevant sample of them) inside the plot, e.g. like the example below (obtained in R).

For a more precise answer related to OP's question (with Pandas):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({ "A":np.random.normal(0.8,0.2,20),
"B":np.random.normal(0.8,0.1,20),
"C":np.random.normal(0.9,0.1,20)} )
data.boxplot()
for i,d in enumerate(data):
y = data[d]
x = np.random.normal(i+1, 0.04, len(y))
plt.plot(x, y, mfc = ["orange","blue","yellow"][i], mec='k', ms=7, marker="o", linestyle="None")
plt.hlines(1,0,4,linestyle="--")
Old version (more generic) :
With matplotlib :
import numpy as np
import matplotlib.pyplot as plt
a = np.random.normal(0,2,1000)
b = np.random.normal(-2,7,100)
data = [a,b]
plt.boxplot(data) # Or you can use the boxplot from Pandas
for i in [1,2]:
y = data[i-1]
x = np.random.normal(i, 0.02, len(y))
plt.plot(x, y, 'r.', alpha=0.2)
Which gives that :
Inspired from this tutorial
Hope this helps !

This will be possible with seaborn version 0.6 (currently in the master branch on github) using the stripplot function. Here's an example:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.boxplot(x="day", y="total_bill", data=tips)
sns.stripplot(x="day", y="total_bill", data=tips,
size=4, jitter=True, edgecolor="gray")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Displaying pair plot in Pandas data frame - python

This is also possible using seaborn: import seaborn as sns df = sns.load_dataset("iris") sns.pairplot(df, hue="species")

Related

How to create specific plots using Pandas and then store them as PNG files?

How do I get the diagonal of sns.pairplot?

xticks don't show in matplotlib

Linear regression not applying to loglog scale with seaborn

Dot-boxplots from DataFrames

Categories

Resources