How to grid plot 2D categorical data

How to grid plot 2D categorical data - python

I hava data that looks like:
Name X Y
A HIGH MID
B LOW LOW
C MID LOW
D HIGH MID
How to plot this data in a 2-D diagram with a 3x3 grid adding a random variation to place each data point including its name with enough spacing between each other.
So it should look somewhat like that:
The following i tried, but i dont know how to plot the values not exactly on the grid, but in between, so they do nbot overlap.
import pandas as pd
import matplotlib.pyplot as plt
### Mock Data ###
data = """A0,LOW,LOW
A,MID,MID
B,LOW,MID
C,MID,HIGH
D,LOW,MID
E,HIGH,HIGH"""
df = pd.DataFrame([x.split(',') for x in data.split('\n')])
df.columns = ['name','X','Y']
### Plotting ###
fig,axs = plt.subplots()
axs.scatter(df.X,df.Y,label=df.name)
axs.set_xlabel('X')
axs.set_ylabel('Y')
for i,p in enumerate(df.name):
axs.annotate(p, (df.X[i],df.Y[i]))
axs.grid()
axs.set_axisbelow(True)
fig.tight_layout()
plt.show()
resulting:

You can control directly the positions and change the labels on the axis. There are a few problems with your drawing because you are not taking into account some issue such as "what label will you have if you have more than one point at the same location?".
In any case here is a possible solution:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
### Mock Data ###
data = """A0,LOW,LOW
A,MID,MID
B,LOW,MID
C,MID,HIGH
D,LOW,MID
E,HIGH,HIGH"""
df = pd.DataFrame([x.split(',') for x in data.split('\n')])
df.columns = ['name','X','Y']
pos = [0, 1, 2]
lbls = ["LOW", "MID", "HIGH"]
trans = {lbls[i]:pos[i] for i in range(len(pos))}
mat = np.zeros((3, 3), dtype="U10") # This is limited to 10 characters
xxs = []
yys = []
offset = 0.05
for i in range(df.shape[0]):
xc, yc = trans[df.X[i]], trans[df.Y[i]]
if mat[xc, yc]=="":
mat[xc, yc] = df.name[i]
else:
mat[xc, yc] = mat[xc, yc] + ";" + df.name[i]
xxs.append(xc)
yys.append(yc)
fig,axs = plt.subplots()
axs.scatter(xxs, yys)
for i in range(df.shape[0]):
name = mat[xxs[i], yys[i]]
axs.text(xxs[i]+offset, yys[i]+offset, name)
axs.set_xticks(pos)
axs.set_xticklabels(lbls)
axs.set_yticks(pos)
axs.set_yticklabels(lbls)
for i in pos:
axs.axhline(pos[i]-0.5, color="black")
axs.axvline(pos[i]-0.5, color="black")
axs.set_xlim(-0.5, 2.5)
axs.set_ylim(-0.5, 2.5)
plt.show()
This result in the following image:

Related

How to set x-axis size in seaborn Pairplot

I am using Seaborn to draw Pairplots. Problem is that, for some variables, size of x-axis is small and data points are very close to each other, as can be seen below (first row of plots):
As you can see, plots in the second row are fine.
This is the code I am using:
import math
import matplotlib.pyplot as plt
import seaborn as sns
y_name = 'y'
features = data.iloc[:, :-1]
features_names = features.columns
plot_size=7
num_plots_x=10 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
fig = plt.figure(figsize=(plot_size*num_plots_y, plot_size*num_plots_x), facecolor='white')
axes = [fig.add_subplot(num_plots_y,1,i+1) for i in range(num_plots_y)]
for i, ax in enumerate(axes):
start_index = i * num_plots_x
end_index = (i+1) * num_plots_x
if end_index > len(features_names): end_index = len(features_names)
sns.pairplot(x_vars=features_names[start_index:end_index], y_vars=y_name, data = data)
plt.savefig('figure.png')
Is there any way that I can set size or scale of x-axis?

Fitting a column from a pandas dataframe to animated histogram

I am attempting to produce an animated histogram that uses rows of data from a data frame I created. The code I am using to produce the histogram is below. The code works with data = np.random.randn(1000) but does not animate the histogram when I replace it with data = df['GDP'] instead it outputs a non-animated histogram. I am trying to fit a column from a data frame into this code:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path
import matplotlib.animation as animation
fig, ax = plt.subplots()
# histogram our data with numpy
data = np.random.randn(1000)
n, bins = np.histogram(data, 100)
# get the corners of the rectangles for the histogram
left = np.array(bins[:-1])
right = np.array(bins[1:])
bottom = np.zeros(len(left))
top = bottom + n
nrects = len(left)
# here comes the tricky part -- we have to set up the vertex and path
# codes arrays using moveto, lineto and closepoly
# for each rect: 1 for the MOVETO, 3 for the LINETO, 1 for the
# CLOSEPOLY; the vert for the closepoly is ignored but we still need
# it to keep the codes aligned with the vertices
nverts = nrects*(1+3+1)
verts = np.zeros((nverts, 2))
codes = np.ones(nverts, int) * path.Path.LINETO
codes[0::5] = path.Path.MOVETO
codes[4::5] = path.Path.CLOSEPOLY
verts[0::5,0] = left
verts[0::5,1] = bottom
verts[1::5,0] = left
verts[1::5,1] = top
verts[2::5,0] = right
verts[2::5,1] = top
verts[3::5,0] = right
verts[3::5,1] = bottom
barpath = path.Path(verts, codes)
patch = patches.PathPatch(barpath, facecolor='green', edgecolor='yellow', alpha=0.5)
ax.add_patch(patch)
ax.set_xlim(left[0], right[-1])
ax.set_ylim(bottom.min(), top.max())
def animate(i):
# simulate new data coming in
data = np.random.randn(1000)
n, bins = np.histogram(data, 100)
top = bottom + n
verts[1::5,1] = top
verts[2::5,1] = top
ani = animation.FuncAnimation(fig, animate, 100, repeat=False)
from IPython.display import HTML
HTML(ani.to_jshtml())
To fit my own data I am replacing :
# histogram our data with numpy
data = np.random.randn(1000)
and:
# simulate new data coming in
data = np.random.randn(1000)
with a column in my data frame that has 247 rows:
data = df['GDP']
The output is a histogram with my own data however it is not animated as is with data = np.random.randn(1000)

Python: Changing visual parameters of ptitprince repo derived from seaborn and matplotlib

I am using a github repository called ptitprince, which is derived from seaborn and matplotlib, to generate graphs.
For example, this is the code using the ptitprince repo:
# coding: utf8
import pandas as pd
import ptitprince as pt
import seaborn as sns
import os
import matplotlib.pyplot as plt
#sns.set(style="darkgrid")
#sns.set(style="whitegrid")
#sns.set_style("white")
sns.set(style="whitegrid",font_scale=2)
import matplotlib.collections as clt
df = pd.read_csv ("u118phag.csv", sep= ",")
df.head()
savefigs = True
figs_dir = 'figs'
if savefigs:
# Make the figures folder if it doesn't yet exist
if not os.path.isdir('figs'):
os.makedirs('figs')
#automation
f, ax = plt.subplots(figsize=(4, 5))
#f.subplots_adjust(hspace=0,wspace=0)
dx = "Treatment"; dy = "score"; ort = "v"; pal = "Set2"; sigma = .2
ax=pt.RainCloud(x = dx, y = dy, data = df, palette = pal, bw = sigma,
width_viol = .6, ax = ax, move=.2, offset=.1, orient = ort, pointplot = True)
f.show()
if savefigs:
f.savefig('figs/figure20.png', bbox_inches='tight', dpi=500)
which generates the following graph
The raw code not using ptitprince is as follows and produces the same graph as above:
# coding: utf8
import pandas as pd
import ptitprince as pt
import seaborn as sns
import os
import matplotlib.pyplot as plt
#sns.set(style="darkgrid")
#sns.set(style="whitegrid")
#sns.set_style("white")
sns.set(style="whitegrid",font_scale=2)
import matplotlib.collections as clt
df = pd.read_csv ("u118phag.csv", sep= ",")
df.head()
savefigs = True
figs_dir = 'figs'
if savefigs:
# Make the figures folder if it doesn't yet exist
if not os.path.isdir('figs'):
os.makedirs('figs')
f, ax = plt.subplots(figsize=(7, 5))
dy="Treatment"; dx="score"; ort="h"; pal = sns.color_palette(n_colors=1)
#adding color
pal = "Set2"
f, ax = plt.subplots(figsize=(7, 5))
ax=pt.half_violinplot( x = dx, y = dy, data = df, palette = pal, bw = .2, cut = 0.,
scale = "area", width = .6, inner = None, orient = ort)
ax=sns.stripplot( x = dx, y = dy, data = df, palette = pal, edgecolor = "white",
size = 3, jitter = 1, zorder = 0, orient = ort)
ax=sns.boxplot( x = dx, y = dy, data = df, color = "black", width = .15, zorder = 10,\
showcaps = True, boxprops = {'facecolor':'none', "zorder":10},\
showfliers=True, whiskerprops = {'linewidth':2, "zorder":10},\
saturation = 1, orient = ort)
if savefigs:
f.savefig('figs/figure21.png', bbox_inches='tight', dpi=500)
Now, what I'm trying to do is to figure out how to modify the graph so that I can (1) move the plots closer together, so there is not so much white space between them, and (2) shift the x-axis to the right, so that I can make the distribution (violin) plot wider without it getting cut in half by the y-axis.
I have tried to play around with subplots_adjust() as you can see in the first box of code, but I receive an error. I cannot figure out how to appropriately use this function, or even if that will actually bring the different graphs closer together.
I also know that I can increase the distribution size by increasing this value width = .6, but if I increase it too high, the distribution plot begins to being cut off by the y-axis. I can't figure out if I need to adjust the overall plot using the plt.subplots,or if I need to move each individual plot.
Any advice or recommendations on how to change the visuals of the graph? I've been staring at this for awhile, and I can't figure out how to make seaborn/matplotlib play nicely with ptitprince.

You may try to change the interval of X-axis being shown using ax.set_xbound (put a lower value than you currently have for the beginning).

Removing Data Below A Line In A Scatterplot (Python)

So I had code that graphed a 2dhistogram of my dataset. I plotted it like so:
histogram = plt.hist2d(fehsc, ofesc, bins=nbins, range=[[-1,.5],[0.225,0.4]])
I wanted to only look at data above a certain line though, so I added the following and it worked just fine:
counts = histogram[0]
xpos = histogram[1]
ypos = histogram[2]
image = histogram[3]
newcounts = counts #we're going to iterate over this
for i in range (nbins):
xin = xpos[i]
yin = ypos
yline = m*xin + b
reset = np.where(yin < yline) #anything less than yline we want to be 0
#index = index[0:len(index)-1]
countout = counts[i]
countout[reset] = 0
newcounts[i] = countout
However, I now need to draw a regression line through that cut region. Doing so is not possible (AFAIK) in plt.2dhist, so I'm using plt.scatter. Problem is I don't know how to make that cut anymore - I can't index the scatterplot.
I have this now:
plt.xlim(-1,.5)
plt.ylim(.225, .4)
scatter = plt.scatter(fehsc,ofesc, marker = ".")
and I only want to retain the data above some line:
xarr = np.arange(-1,0.5, 0.015)
yarr = m*xarr + b
plt.plot(xarr, yarr, color='r')
I've tried running the loop with some variations of the variables but I don't actually understand or know how to get it to work.

You could define a mask for your data before you plot and then just plot the data points that actually meet your criteria. Below an example, where all data points above a certain line are plotted in green and all data points below the line are plotted in black.
from matplotlib import pyplot as plt
import numpy as np
#the scatterplot data
xvals = np.random.rand(100)
yvals = np.random.rand(100)
#the line
b = 0.1
m = 1
x = np.linspace(0,1,num=100)
y = m*x+b
mask = yvals > m*xvals+b
plt.scatter(xvals[mask],yvals[mask],color='g')
plt.scatter(xvals[~mask],yvals[~mask],color='k')
plt.plot(x,y,'r')
plt.show()
The result looks like this
Hope this helps.
EDIT:
If you want to create a 2D histogram, where the portion below the line is set to zero, you can do that by first generating the histogram using numpy (as an array) and then setting the values inside that array to zero, if the bins fall below the line. After that, you can plot the matrix using plt.pcolormesh:
from matplotlib import pyplot as plt
import numpy as np
#the scatterplot data
xvals = np.random.rand(1000)
yvals = np.random.rand(1000)
histogram,xbins,ybins = np.histogram2d(xvals,yvals,bins=50)
#computing the bin centers from the bin edges:
xcenters = 0.5*(xbins[:-1]+xbins[1:])
ycenters = 0.5*(ybins[:-1]+ybins[1:])
#the line
b = 0.1
m = 1
x = np.linspace(0,1,num=100)
y = m*x+b
#hiding the part of the histogram below the line
xmesh,ymesh = np.meshgrid(xcenters,ycenters)
mask = m*xmesh+b > ymesh
histogram[mask] = 0
#making the plot
mat = plt.pcolormesh(xcenters,ycenters,histogram)
line = plt.plot(x,y,'r')
plt.xlim([0,1])
plt.ylim([0,1])
plt.show()
The result would be something like this:

Scipy dendrogram with names

I'm using the example dendrogram from this post in my work but would also like to keep track of which row / column is from which piece of data.
I've edited the code with records of names of the data as names as follows and would like to print out the names at the bottom and to the right of the distance matrix visualization. I've tried adding labels = names in the call to dendrogram but this didn't help.
Does anyone know how to add labels to this?
import scipy
import pylab
import scipy.cluster.hierarchy as sch
# Generate random features and distance matrix.
x = scipy.rand(40)
D = scipy.zeros([40,40])
for i in range(40):
for j in range(40):
D[i,j] = abs(x[i] - x[j])
### new code
names = [ ]
for i in range(40):
names.append( 'str%i'%( i ) )
print names[-1]
### end new code
# Compute and plot first dendrogram.
fig = pylab.figure(figsize=(8,8))
ax1 = fig.add_axes([0.09,0.1,0.2,0.6])
Y = sch.linkage(D, method='centroid')
Z1 = sch.dendrogram(Y, orientation='right')
ax1.set_xticks([])
ax1.set_yticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Y = sch.linkage(D, method='single')
Z2 = sch.dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=pylab.cm.YlGnBu)
axmatrix.set_xticks([])
axmatrix.set_yticks([])
# Plot colorbar.
#axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
#pylab.colorbar(im, cax=axcolor)
fig.show()
fig.savefig('dendrogram.png')

The python package heatmapcluster (available on PyPI) that I wrote accepts (in fact, requires) labels.
Here's a simplified version of your script using heatmapcluster:
import numpy as np
import matplotlib.pyplot as plt
from heatmapcluster import heatmapcluster
# Generate random features and distance matrix.
x = np.random.rand(40)
D = np.abs(np.subtract.outer(x, x))
names = ['str%i' % i for i in range(len(x))]
h = heatmapcluster(D, names, names,
num_row_clusters=3, num_col_clusters=3,
label_fontsize=8,
xlabel_rotation=-75,
cmap=plt.cm.coolwarm,
show_colorbar=True,
top_dendrogram=True)
plt.show()
And here is the plot it generates:
(Note that, for a symmetric array like D, there is really no point in clustering both axes. By symmetry, they will generate the same dendrogram.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to grid plot 2D categorical data - python

Related

How to set x-axis size in seaborn Pairplot

Fitting a column from a pandas dataframe to animated histogram

Python: Changing visual parameters of ptitprince repo derived from seaborn and matplotlib

Removing Data Below A Line In A Scatterplot (Python)

Scipy dendrogram with names

Categories

Resources