Plot multiple bars for categorical data - python

I'm looking for a way to plot multiple bars per value in matplotlib. For numerical data, this can be achieved be adding an offset to the X data, as described for example here:
import numpy as np
import matplotlib.pyplot as plt
X = np.array([1,3,5])
Y = [1,2,3]
Z = [2,3,4]
plt.bar(X - 0.4, Y) # offset of -0.4
plt.bar(X + 0.4, Z) # offset of 0.4
plt.show()
plt.bar() (and ax.bar()) also handle categorical data automatically:
X = ['A','B','C']
Y = [1,2,3]
plt.bar(X, Y)
plt.show()
Here, it is obviously not possible to add an offset, as the categories are not directly associated with a value on the axis. I can manually assign numerical values to the categories and set labels on the x axis with plt.xticks():,
X = ['A','B','C']
Y = [1,2,3]
Z = [2,3,4]
_X = np.arange(len(X))
plt.bar(_X - 0.2, Y, 0.4)
plt.bar(_X + 0.2, Z, 0.4)
plt.xticks(_X, X) # set labels manually
plt.show()
However, I'm wondering if there is a more elegant way that makes use of the automatic category handling of bar(), especially if the number of categories and bars per category is not known in before (this causes some fiddling with the bar widths to avoid overlaps).

There is no automatic support of subcategories in matplotlib.
Placing bars with matplotlib
You may go the way of placing the bars numerically, like you propose yourself in the question. You can of course let the code manage the unknown number of subcategories.
import numpy as np
import matplotlib.pyplot as plt
X = ['A','B','C']
Y = [1,2,3]
Z = [2,3,4]
def subcategorybar(X, vals, width=0.8):
n = len(vals)
_X = np.arange(len(X))
for i in range(n):
plt.bar(_X - width/2. + i/float(n)*width, vals[i],
width=width/float(n), align="edge")
plt.xticks(_X, X)
subcategorybar(X, [Y,Z,Y])
plt.show()
Using pandas
You may also use pandas plotting wrapper, which does the work of figuring out the number of subcategories. It will plot one group per column of a dataframe.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X = ['A','B','C']
Y = [1,2,3]
Z = [2,3,4]
df = pd.DataFrame(np.c_[Y,Z,Y], index=X)
df.plot.bar()
plt.show()

Related

Plotting probability density function in Python

I want to plot two probability density functions (pdf) based on values of a certain column in a dataframe. The first one for all the values that correspond to rows with target label = 0 and second one where target label = 1.
My attempt is below, but as you can see the curves do not look like a pdf (the max value is 0 and they are not confined to X axis in range 0-1 and 5-6. I assume I can get something close by playing around with bw factor, but I am looking for a one-liner that just figures out right params and plots a pdf(including figuiring out the right X-axis start/end to use). Is there any such built in function that does this. If not, would appreciate some pointers on how to build something like this.
#matplotloib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.neighbors import KernelDensity
values = np.random.rand(10)
values_shift5 = np.random.rand(10) + 5
df = pd.DataFrame({'values' : values, 'label' : np.zeros(10)})
df = pd.concat([df, pd.DataFrame({'values' : values_shift5, 'label' : np.ones(10)})])
kde_label_0 = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df[df.label == 0]['values'].values.reshape(-1, 1))
kde_label_1 = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df[df.label == 1]['values'].values.reshape(-1, 1))
X_plot = np.linspace(0, 10, 50).reshape(-1, 1)
log_density_0 = kde_label_0.score_samples(X_plot)
log_density_1 = kde_label_1.score_samples(X_plot)
plt.plot(X_plot, log_density_0, label='Label 0')
plt.plot(X_plot, log_density_1, label='Label 1')
plt.legend()
plt.show()

How to use pandas with matplotlib to create 3D plots

I am struggling a bit with the pandas transformations needed to make data render in 3D on matplot lib. The data I have is usually in columns of numbers (usually time and some value). So lets create some test data to illustrate.
import pandas as pd
pattern = ("....1...."
"....1...."
"..11111.."
".1133311."
"111393111"
".1133311."
"..11111.."
"....1...."
"....1....")
# create the data and coords
Zdata = list(map(lambda d:0 if d == '.' else int(d), pattern))
Zinverse = list(map(lambda d:1 if d == '.' else -int(d), pattern))
Xdata = [x for y in range(1,10) for x in range(1,10)]
Ydata = [y for y in range(1,10) for x in range(1,10)]
# pivot the data into columns
data = [d for d in zip(Xdata,Ydata,Zdata,Zinverse)]
# create the data frame
df = pd.DataFrame(data, columns=['X','Y','Z',"Zi"], index=zip(Xdata,Ydata))
df.head(5)
Edit: This block of data is demo data that would normally come from a query on a
database that may need more cleaning and transforms before plotting. In this case data is already aligned and there are no problems aside having one more column we don't need (Zi).
So the numbers in pattern are transferred into height data in the Z column of df ('Zi' being the inverse image) and with that as the data frame I've struggled to come up with this pivot method which is 3 separate operations. I wonder if that can be better.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cm
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
Xs = df.pivot(index='X', columns='Y', values='X').values
Ys = df.pivot(index='X', columns='Y', values='Y').values
Zs = df.pivot(index='X', columns='Y', values='Z').values
ax.plot_surface(Xs,Ys,Zs, cmap=cm.RdYlGn)
plt.show()
Although I have something working I feel there must be a better way than what I'm doing. On a big data set I would imagine doing 3 pivots is an expensive way to plot something. Is there a more efficient way to transform this data ?
I guess you can avoid some steps during the preparation of the data by not using pandas (but only numpy arrays) and by using some convenience fonctions provided by numpy such as linespace and meshgrid.
I rewrote your code to do so, trying to keep the same logic and the same variable names :
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
pattern = ("....1...."
"....1...."
"..11111.."
".1133311."
"111393111"
".1133311."
"..11111.."
"....1...."
"....1....")
# Extract the value according to your logic
Zdata = list(map(lambda d:0 if d == '.' else int(d), pattern))
# Assuming the pattern is always a square
size = int(len(Zdata) ** 0.5)
# Create a mesh grid for plotting the surface
Xdata = np.linspace(1, size, size)
Ydata = np.linspace(1, size, size)
Xs, Ys = np.meshgrid(Xdata, Ydata)
# Convert the Zdata to a numpy array with the appropriate shape
Zs = np.array(Zdata).reshape((size, size))
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot the surface
ax.plot_surface(Xs, Ys, Zs, cmap=cm.RdYlGn)
plt.show()

How to specify scatter plot point color in matplotlib pyplot? [duplicate]

This question already has answers here:
Matplotlib scatterplot; color as a function of a third variable
(3 answers)
Closed 3 years ago.
I have a dataset with a million of points like:
1.0,9.5,-0.3
2.3,4.8,0.7
8.1,3.6,0.0
3.9,1.4,-0.1
4.7,5.3,0.0
and PyPlot code like
import pandas
import matplotlib.pyplot as plt
headers = ['A','B','C']
df = pandas.read_csv('my_data.csv',names=headers)
df['x'] = df['A']
df['y'] = df['B']
# df['color'] = df['C']
plt.xlim(min(df['x'])/2, max(df['x'])*2)
plt.ylim(min(df['y'])/2, max(df['y'])*2)
plt.xlabel("A")
plt.ylabel("B")
plt.plot(df['x'], df['y'], 'o', ms = 0.2)
plt.show()
I can plot points according to first and second column, but all points have the same color. How to tell PyPlot to color points based on the value in third column?
You need to use plt.scatter() instead of plt.plot(). There's also no need to re-name the DataFrame columns, the first argument is the x values and the second is the y values. c = z will make the colors be determined by whatever the z values are. cmap will determine what the colors are. Here are the options plt.colorbar() will give you a colorbar reference for the colors plotted for z.
import pandas as pd
import matplotlib.pyplot as plt
import random
x = [random.randint(0,100) for x in range(1000)]
y = [random.randint(0,100) for y in range(1000)]
z = [random.randint(0,100) for z in range(1000)]
df = pd.DataFrame({'A': x, 'B':y, 'C':z})
plt.scatter(df['A'], df['B'], c = df['C'], cmap = 'rainbow')
plt.colorbar()
plt.show()
In your case changing
plt.plot(df['x'], df['y'], 'o', ms = 0.2)
to
plt.scatter(df['x'], df['y'], 'o',c = df['color'], ms = 0.2)
should work, assuming that df['color'] is the same length as the x and y variables.
As was pointed out in the comments, there is no (apparent) need to create new df columns.
This you could use this
import pandas
import matplotlib.pyplot as plt
headers = ['A','B','C']
df = pandas.read_csv('my_data.csv',names=headers)
plt.xlim(min(df['A'])/2, max(df['A'])*2)
plt.ylim(min(df['B'])/2, max(df['B'])*2)
plt.xlabel("A")
plt.ylabel("B")
plt.scatter(df['A'], df['B'], 'o', c = df['C'], ms = 0.2)
plt.show()
Edit:
If you really want to make sure that each point has a unique color, then you need to make sure that the c input also only contains unique values.
c = [i for i in range(0,len(df['C'])]
plt.plot(df['A'], df['B'], 'o', c = c, ms = 0.2)

How to adjust branch lengths of dendrogram in matplotlib (like in astrodendro)? [Python]

Here is my resulting plot below but I would like it to look like the truncated dendrograms in astrodendro such as this:
There is also a really cool looking dendrogram from this paper that I would like to recreate in matplotlib.
Below is the code for generating an iris data set with noise variables and plotting the dendrogram in matplotlib.
Does anyone know how to either: (1) truncate the branches like in the example figures; and/or (2) to use astrodendro with a custom linkage matrix and labels?
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import astrodendro
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial import distance
def iris_data(noise=None, palette="hls", desat=1):
# Iris dataset
X = pd.DataFrame(load_iris().data,
index = [*map(lambda x:f"iris_{x}", range(150))],
columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), load_iris().feature_names)])
y = pd.Series(load_iris().target,
index = X.index,
name = "Species")
c = map_colors(y, mode=1, palette=palette, desat=desat)#y.map(lambda x:{0:"red",1:"green",2:"blue"}[x])
if noise is not None:
X_noise = pd.DataFrame(
np.random.RandomState(0).normal(size=(X.shape[0], noise)),
index=X_iris.index,
columns=[*map(lambda x:f"noise_{x}", range(noise))]
)
X = pd.concat([X, X_noise], axis=1)
return (X, y, c)
def dism2linkage(DF_dism, method="ward"):
"""
Input: A (m x m) dissimalrity Pandas DataFrame object where the diagonal is 0
Output: Hierarchical clustering encoded as a linkage matrix
Further reading:
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html
https://pypi.python.org/pypi/fastcluster
"""
#Linkage Matrix
Ar_dist = distance.squareform(DF_dism.as_matrix())
return linkage(Ar_dist,method=method)
# Get data
X_iris_with_noise, y_iris, c_iris = iris_data(50)
# Get distance matrix
df_dism = 1- X_iris_with_noise.corr().abs()
# Get linkage matrix
Z = dism2linkage(df_dism)
#Create dendrogram
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots(figsize=(13,3))
D_dendro = dendrogram(
Z,
labels=df_dism.index,
color_threshold=3.5,
count_sort = "ascending",
#link_color_func=lambda k: colors[k]
ax=ax
)
ax.set_ylabel("Distance")
I'm not sure this really constitutes a practical answer, but it does allow you to generate dendrograms with truncated hanging lines. The trick is to generate the plot as normal, then manipulate the resulting matplotlib plot to recreate the lines.
I couldn't get your example to work locally, so I've just created a dummy dataset.
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
a = np.random.multivariate_normal([0, 10], [[3, 1], [1, 4]], size=[5,])
b = np.random.multivariate_normal([0, 10], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
dendrogram(Z, ax=ax)
The resulting plot is the usual long-arm dendrogram.
Now for the more interesting bit. A dendrogram is made up of a number of LineCollection objects (one for each colour). To update the lines we iterate through these, extracting the details about their constituent paths, modifying these to remove any lines reaching to a y of zero, and then recreating a LineCollection for these modified paths.
The updated path is then added to the axes, and the original is removed.
The one tricky part is determining what height to draw to instead of zero. Since we are iterating over each dendrograms path, we don't know which point came before — we basically have no idea where we are. However, we can exploit the fact that hanging lines hang vertically. Assuming there are no lines on the same x, we can look for the known other y values for a given x and use that as the basis for our new y when calculating. The downside is that in order to make sure we have this number, we have to pre-scan the data.
Note: If you can get dendrogram hanging lines on the same x, you would need to include the y and search for nearest y above this x to do this.
import numpy as np
from matplotlib.path import Path
from matplotlib.collections import LineCollection
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
dendrogram(Z, ax=ax);
for c in ax.collections[:]: # use [:] to get a copy, since we're adding to the same list
paths = []
for path in c.get_paths():
segments = []
y_at_x = {}
# Pre-pass over all elements, to find the lowest y value at each x value.
# we can use this to caculate where to cut our lines.
for n, seg in enumerate(path.iter_segments()):
x, y = seg[0]
# Don't store if the y is zero, or if it's higher than the current low.
if y > 0 and y < y_at_x.get(x, np.inf):
y_at_x[x] = y
for n, seg in enumerate(path.iter_segments()):
x, y = seg[0]
if y == 0:
# If we know the last y at this x, use it - 0.5, limit > 0
y = max(0, y_at_x.get(x, 0) - 0.5)
segments.append([x,y])
paths.append(segments)
lc = LineCollection(paths, colors=c.get_colors()) # Recreate a LineCollection with the same params
ax.add_collection(lc)
ax.collections.remove(c) # Remove the original LineCollection
The resulting dendrogram looks like this:

How to add axis offset in matplotlib plot?

I'm drawing several point plots in seaborn on the same graph. The x-axis is ordinal, not numerical; the ordinal values are the same for each point plot. I would like to shift each plot a bit to the side, the way pointplot(dodge=...) parameter does within multiple lines within a single plot, but in this case for multiple different plots drawn on top of each other. How can I do that?
Ideally, I'd like a technique that works for any matplotlib plot, not just seaborn specifically. Adding an offset to the data won't work easily, since the data is not numerical.
Example that shows the plots overlapping and making them hard to read (dodge within each plot works okay)
import pandas as pd
import seaborn as sns
df1 = pd.DataFrame({'x':list('ffffssss'), 'y':[1,2,3,4,5,6,7,8], 'h':list('abababab')})
df2 = df1.copy()
df2['y'] = df2['y']+0.5
sns.pointplot(data=df1, x='x', y='y', hue='h', ci='sd', errwidth=2, capsize=0.05, dodge=0.1, markers='<')
sns.pointplot(data=df2, x='x', y='y', hue='h', ci='sd', errwidth=2, capsize=0.05, dodge=0.1, markers='>')
I could use something other than seaborn, but the automatic confidence / error bars are very convenient so I'd prefer to stick with seaborn here.
Answering this for the most general case first.
A dodge can be implemented by shifting the artists in the figure by some amount. It might be useful to use points as units of that shift. E.g. you may want to shift your markers on the plot by 5 points.
This shift can be accomplished by adding a translation to the data transform of the artist. Here I propose a ScaledTranslation.
Now to keep this most general, one may write a function which takes the plotting method, the axes and the data as input, and in addition some dodge to apply, e.g.
draw_dodge(ax.errorbar, X, y, yerr =y/4., ax=ax, dodge=d, marker="d" )
The full functional code:
import matplotlib.pyplot as plt
from matplotlib import transforms
import numpy as np
import pandas as pd
def draw_dodge(*args, **kwargs):
func = args[0]
dodge = kwargs.pop("dodge", 0)
ax = kwargs.pop("ax", plt.gca())
trans = ax.transData + transforms.ScaledTranslation(dodge/72., 0,
ax.figure.dpi_scale_trans)
artist = func(*args[1:], **kwargs)
def iterate(artist):
if hasattr(artist, '__iter__'):
for obj in artist:
iterate(obj)
else:
artist.set_transform(trans)
iterate(artist)
return artist
X = ["a", "b"]
Y = np.array([[1,2],[2,2],[3,2],[1,4]])
Dodge = np.arange(len(Y),dtype=float)*10
Dodge -= Dodge.mean()
fig, ax = plt.subplots()
for y,d in zip(Y,Dodge):
draw_dodge(ax.errorbar, X, y, yerr =y/4., ax=ax, dodge=d, marker="d" )
ax.margins(x=0.4)
plt.show()
You may use this with ax.plot, ax.scatter etc. However not with any of the seaborn functions, because they don't return any useful artist to work with.
Now for the case in question, the remaining problem is to get the data in a useful format. One option would be the following.
df1 = pd.DataFrame({'x':list('ffffssss'),
'y':[1,2,3,4,5,6,7,8],
'h':list('abababab')})
df2 = df1.copy()
df2['y'] = df2['y']+0.5
N = len(np.unique(df1["x"].values))*len([df1,df2])
Dodge = np.linspace(-N,N,N)/N*10
fig, ax = plt.subplots()
k = 0
for df in [df1,df2]:
for (n, grp) in df.groupby("h"):
x = grp.groupby("x").mean()
std = grp.groupby("x").std()
draw_dodge(ax.errorbar, x.index, x.values,
yerr =std.values.flatten(), ax=ax,
dodge=Dodge[k], marker="o", label=n)
k+=1
ax.legend()
ax.margins(x=0.4)
plt.show()
You can use linspace to easily shift your graphs to where you want them to start and end. The function also makes it very easy to scale the graph so they would be visually the same width
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
start_offset = 3
end_offset = start_offset
y1 = np.random.randint(0, 10, 20) ##y1 has 20 random ints from 0 to 10
y2 = np.random.randint(0, 10, 10) ##y2 has 10 random ints from 0 to 10
x1 = np.linspace(0, 20, y1.size) ##create a number of steps from 0 to 20 equal to y1 array size-1
x2 = np.linspace(0, 20, y2.size)
plt.plot(x1, y1)
plt.plot(x2, y2)
plt.show()

Categories