Related
When I work in R's ggplot() I can store my plots as p<-ggplot()... This makes it easier to add elements to the plot and look at them before they are saved. Is there a way to do this in python? For example, I would like to be able to add elements to the first figure, fig, after I plot the new z values.
What ggplot can do with R.
library(ggplot2)
p<-ggplot(mtcars,aes(x=factor(am),y=mpg))+geom_boxplot()
p+xlab("Transmission Type")+ylab("Miles Per Gallon (mpg)")
p
What I would like to do in Python.
import matplotlib.pyplot as plt
x = [1,2,3]
y = [2,4,1]
fig = plt.plot(x, y)
z = x*2
y = [2,4,1]
fig2 = plt.plot(z, y)
fig.show()
fig2.show()
I am plotting separate figures for each attribute and label for each data sample. Here is the illustration:
As illustrated in the the last subplot (Label), my data contains seven classes (numerically) (0 to 6). I'd like to visualize these classes using a different fancy colors and a legend. Please note that I just want colors for last subplot. How should I do that?
Here is the code of above plot:
x, y = test_data["x"], test_data["y"]
# determine the total number of plots
n, off = x.shape[1] + 1, 0
plt.rcParams["figure.figsize"] = (40, 15)
# plot all the attributes
for i in range(6):
plt.subplot(n, 1, off + 1)
plt.plot(x[:, off])
plt.title('Attribute:' + str(i), y=0, loc='left')
off += 1
# plot Labels
plt.subplot(n, 1, n)
plt.plot(y)
plt.title('Label', y=0, loc='left')
plt.savefig(save_file_name, bbox_inches="tight")
plt.close()
First, just to set up a similar dataset:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.random((100,6))
y = np.random.randint(0, 6, (100))
fig, axs = plt.subplots(6, figsize=(40,15))
We could use plt.scatter() to give individual points different marker styles:
for i in range(x.shape[-1]):
axs[i].scatter(range(x.shape[0]), x[:,i], c=y)
Or we could mask the arrays we're plotting:
for i in range(x.shape[-1]):
for j in np.unique(y):
axs[i].plot(np.ma.masked_where(y!=j, x[:,i]), 'o')
Either way we get the same results:
Edit: Ah you've edited your question! You can do exactly the same thing for your last plot only, just modify my code above to take it out of the loop of subplots :)
As suggested, we imitate the matplotlib step function by creating a LineCollection to color the different line segments:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.collections import LineCollection
from matplotlib.patches import Patch
#random data generation
np.random.seed(12345)
number_of_categories=4
y = np.concatenate([np.repeat(np.random.randint(0, number_of_categories), np.random.randint(1, 30)) for _ in range(20)])
#check the results with less points
#y = y[:10]
x = y[None] * np.linspace(1, 5, 3)[:, None]
x += 2 * np.random.random(x.shape) - 1
#your initial plot
num_plots = x.shape[0] + 1
fig, axes = plt.subplots(num_plots, 1, sharex=True, figsize=(10, 8))
for i, ax in enumerate(axes.flat[:-1]):
ax.plot(x[i,:])
#first we create the matplotlib step function with x-values as their midpoint
axes.flat[-1].step(np.arange(y.size), y, where="mid", color="lightgrey", zorder=-1)
#then we plot colored segments with shifted index simulating the step function
shifted_x = np.arange(y.size+1)-0.5
#and identify the step indexes
idx_steps, = np.nonzero(np.diff(y, prepend=np.inf, append=np.inf))
#create collection of plateau segments
colored_segments = np.zeros((idx_steps.size-1, 2, 2))
colored_segments[:, :, 0] = np.vstack((shifted_x[idx_steps[:-1]], shifted_x[idx_steps[1:]])).T
colored_segments[:, :, 1] = np.repeat(y[idx_steps[:-1]], 2).reshape(-1, 2)
#generate discrete color list
n_levels, idx_levels = np.unique(y[idx_steps[:-1]], return_inverse=True)
colorarr = np.asarray(plt.cm.tab10.colors[:n_levels.size])
#and plot the colored segments
lc_cs = LineCollection(colored_segments, colors=colorarr[idx_levels, :], lw=10)
lines_cs = axes.flat[-1].add_collection(lc_cs)
#scaling and legend generation
axes.flat[-1].set_ylim(n_levels.min()-0.5, n_levels.max()+0.5)
axes.flat[-1].legend([Patch(color=colorarr[i, :]) for i, _ in enumerate(n_levels)],
[f"cat {i}" for i in n_levels],
loc="upper center", bbox_to_anchor=(0.5, -0.15),
ncol=n_levels.size)
plt.show()
Sample output:
Alternatively, you can use broken barh plots or color this axis or even all axes using axvspan.
I am trying to perform a scatter plot within a boxplot as subplot. When I do for just one boxsplot, it works. I can define a specific point with specific color inside of the boxsplot. The green ball (Image 1) is representing an specific number in comparision with boxplot values.
for columnName in data_num.columns:
plt.figure(figsize=(2, 2), dpi=100)
bp = data_num.boxplot(column=columnName, grid=False)
y = S[columnName]
x = columnName
if y > data_num[columnName].describe().iloc[5]:
plt.plot(1, y, 'r.', alpha=0.7,color='green',markersize=12)
count_G = count_G + 1
elif y < data_num[columnName].describe().iloc[5]:
plt.plot(1, y, 'r.', alpha=0.7,color='red',markersize=12)
count_L = count_L + 1
else:
plt.plot(1, y, 'r.', alpha=0.7,color='yellow',markersize=12)
count_E = count_E + 1
Image 1 - Scatter + 1 boxplot
I can create a subplot with boxplots.
fig, axes = plt.subplots(6,10,figsize=(16,16)) # create figure and axes
fig.subplots_adjust(hspace=0.6, wspace=1)
for j,columnName in enumerate(list(data_num.columns.values)[:-1]):
bp = data_num.boxplot(columnName,ax=axes.flatten()[j])
Image 2 - Subplots + Boxplots
But when I try to plot a specific number inside of each boxplot, actually it subscribes the entire plot.
plt.subplot(6,10,j+1)
if y > data_num[columnName].describe().iloc[5]:
plt.plot(1, y, 'r.', alpha=0.7,color='green',markersize=12)
count_G = count_G + 1
elif y < data_num[columnName].describe().iloc[5]:
plt.plot(1, y, 'r.', alpha=0.7,color='red',markersize=12)
count_L = count_L + 1
else:
plt.plot(1, y, 'r.', alpha=0.7,color='black',markersize=12)
count_E = count_E + 1
Image 3 - Subplots + scatter
It is not completely clear what is going wrong. Probably the call to plt.subplot(6,10,j+1) is erasing some stuff. However, such a call is not necessary with the standard modern use of matplotlib, where the subplots are created via fig, axes = plt.subplots(). Be careful to use ax.plot() instead of plt.plot(). plt.plot() plots on the "current" ax, which can be a bit confusing when there are lots of subplots.
The sample code below first creates some toy data (hopefully similar to the data in the question). Then the boxplots and the individual dots are drawn in a loop. To avoid repetition, the counts and the colors are stored in dictionaries. As data_num[columnName].describe().iloc[5] seems to be the median, for readability the code directly calculates that median.
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
column_names = list('abcdef')
S = {c: np.random.randint(2, 6) for c in column_names}
data_num = pd.DataFrame({c: np.random.randint(np.random.randint(0, 3), np.random.randint(4, 8), 20)
for c in column_names})
colors = {'G': 'limegreen', 'E': 'gold', 'L': 'crimson'}
counts = {c: 0 for c in colors}
fig, axes = plt.subplots(1, 6, figsize=(12, 3), gridspec_kw={'hspace': 0.6, 'wspace': 1})
for columnName, ax in zip(data_num.columns, axes.flatten()):
data_num.boxplot(column=columnName, grid=False, ax=ax)
y = S[columnName] # in case S would be a dataframe with one row: y = S[columnName].values[0]
data_median = data_num[columnName].median()
classification = 'G' if y > data_median else 'L' if y < data_median else 'E'
ax.plot(1, y, '.', alpha=0.9, color=colors[classification], markersize=12)
counts[classification] += 1
print(counts)
plt.show()
Background:
I have a list_of_x_and_y_list that contains x and y values which looks like:
[[(44800, 14888), (132000, 12500), (40554, 12900)], [(None, 193788), (101653, 78880), (3866, 160000)]]
I have another data_name_list ["data_a","data_b"] so that
"data_a" = [(44800, 14888), (132000, 12500), (40554, 12900)]
"data_b" = [(None, 193788), (101653, 78880), (3866, 160000)]
The len of list_of_x_and_y_list / or len of data_name_list is > 20.
Question:
How can I create a scatter plot for each item (being the same colour) in the data_name_list?
What I have tried:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax = plt.axes(facecolor='#FFFFFF')
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
print(list_of_x_and_y_list)
for x_and_y_list, data_name, color in zip(list_of_x_and_y_list, data_name_list, colors):
for x_and_y in x_and_y_list,:
print(x_and_y)
x, y = x_and_y
ax.scatter(x, y, label=data_name, color=color) # "label=data_name" creates
# a huge list as a legend!
# :(
plt.title('Matplot scatter plot')
plt.legend(loc=2)
file_name = "3kstc.png"
fig.savefig(file_name, dpi=fig.dpi)
print("Generated: {}".format(file_name))
The Problem:
The legend appears to be a very long list, which I don't know how to rectify:
Relevant Research:
Matplotlib scatterplot
Scatter Plot
Scatter plot in Python using matplotlib
The reason you get a long repeated list as a legend is because you are providing each point as a separate series, as matplotlib does not automatically group your data based on the labels.
A quick fix is to iterate over the list and zip together the x-values and the y-values of each series as two tuples, so that the x tuple contains all the x-values and the y tuple the y-values.
Then you can feed these tuples to the plt.plot method together with the labels.
I felt that the names list_of_x_and_y_list were uneccessary long and complicated, so in my code I've used shorter names.
import matplotlib.pyplot as plt
data_series = [[(44800, 14888), (132000, 12500), (40554, 12900)],
[(None, 193788), (101653, 78880), (3866, 160000)]]
data_names = ["data_a","data_b"]
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax = plt.axes(facecolor='#FFFFFF')
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
for data, data_name, color in zip(data_series, data_names, colors):
x,y = zip(*data)
ax.scatter(x, y, label=data_name, color=color)
plt.title('Matplot scatter plot')
plt.legend(loc=1)
To only get one entry per data_name, you should add data_name only once as a label. The rest of the calls should go with label=None.
The simplest you can achieve this using the current code, is to set data_name to None at the end of the loop:
from matplotlib import pyplot as plt
from random import randint
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.set_facecolor('#FFFFFF')
# create some random data, suppose the sublists have different lengths
list_of_x_and_y_list = [[(randint(1000, 4000), randint(2000, 5000)) for col in range(randint(2, 10))]
for row in range(10)]
data_name_list = list('abcdefghij')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
for x_and_y_list, data_name, color in zip(list_of_x_and_y_list, data_name_list, colors):
for x_and_y in x_and_y_list :
x, y = x_and_y
ax.scatter(x, y, label=data_name, color=color)
data_name = None
plt.legend(loc=2)
plt.show()
Some things can be simplified, making the code 'more pythonic', for example:
for x_and_y in x_and_y_list :
x, y = x_and_y
can be written as:
for x, y in x_and_y_list:
Another issue, is that with a lot of data calling scatter for every point could be rather slow. All the x and y belonging to the same list can be plotted together. For example using list comprehension:
for x_and_y_list, data_name, color in zip(list_of_x_and_y_list, data_name_list, colors):
xs = [x for x, y in x_and_y_list]
ys = [y for x, y in x_and_y_list]
ax.scatter(xs, ys, label=data_name, color=color)
scatter could even get a list of colors per point, but plotting all the points in one go, wouldn't allow for labels per data_name.
Very often, numpy is used to store numerical data. This has some advantages, such as vectorization for quick calculations. With numpy the code would look like:
import numpy as np
for x_and_y_list, data_name, color in zip(list_of_x_and_y_list, data_name_list, colors):
xys = np.array(x_and_y_list)
ax.scatter(xys[:,0], xys[:,1], label=data_name, color=color)
I would like to generate a series of histogram shown below:
The above visualization was done in tensorflow but I'd like to reproduce the same visualization on matplotlib.
EDIT:
Using plt.fill_between suggested by #SpghttCd, I have the following code:
colors=cm.OrRd_r(np.linspace(.2, .6, 10))
plt.figure()
x = np.arange(100)
for i in range(10):
y = np.random.rand(100)
plt.fill_between(x, y + 10-i, 10-i,
facecolor=colors[i]
edgecolor='w')
plt.show()
This works great, but is it possible to use histogram instead of a continuous curve?
EDIT:
joypy based approach, like mentioned in the comment of october:
import pandas as pd
import joypy
import numpy as np
df = pd.DataFrame()
for i in range(0, 400, 20):
df[i] = np.random.normal(i/410*5, size=30)
joypy.joyplot(df, overlap=2, colormap=cm.OrRd_r, linecolor='w', linewidth=.5)
for finer control of colors, you can define a color gradient function which accepts a fractional index and start and stop color tuples:
def color_gradient(x=0.0, start=(0, 0, 0), stop=(1, 1, 1)):
r = np.interp(x, [0, 1], [start[0], stop[0]])
g = np.interp(x, [0, 1], [start[1], stop[1]])
b = np.interp(x, [0, 1], [start[2], stop[2]])
return (r, g, b)
Usage:
joypy.joyplot(df, overlap=2, colormap=lambda x: color_gradient(x, start=(.78, .25, .09), stop=(1.0, .64, .44)), linecolor='w', linewidth=.5)
Examples with different start and stop tuples:
original answer:
You could iterate over your dataarrays you'd like to plot with plt.fill_between, setting colors to some gradient and the line color to white:
creating some sample data:
import numpy as np
t = np.linspace(-1.6, 1.6, 11)
y = np.cos(t)**2
y2 = lambda : y + np.random.random(len(y))/5-.1
plot the series:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
colors = cm.OrRd_r(np.linspace(.2, .6, 10))
plt.figure()
for i in range(10):
plt.fill_between(t+i, y2()+10-i/10, 10-i/10, facecolor = colors[i], edgecolor='w')
If you want it to have more optimized towards your example you should perhaps consider providing some sample data.
EDIT:
As I commented below, I'm not quite sure if I understand what you want - or if you want the best for your task. Therefore here a code which plots besides your approach in your edit two smples of how to present a bunch of histograms in a way that they are better comparable:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.cm as cm
N = 10
np.random.seed(42)
colors=cm.OrRd_r(np.linspace(.2, .6, N))
fig1 = plt.figure()
x = np.arange(100)
for i in range(10):
y = np.random.rand(100)
plt.fill_between(x, y + 10-i, 10-i,
facecolor=colors[i],
edgecolor='w')
data = np.random.binomial(20, .3, (N, 100))
fig2, axs = plt.subplots(N, figsize=(10, 6))
for i, d in enumerate(data):
axs[i].hist(d, range(20), color=colors[i], label=str(i))
fig2.legend(loc='upper center', ncol=5)
fig3, ax = plt.subplots(figsize=(10, 6))
ax.hist(data.T, range(20), color=colors, label=[str(i) for i in range(N)])
fig3.legend(loc='upper center', ncol=5)
This leads to the following plots:
your plot from your edit:
N histograms in N subplots:
N histograms side by side in one plot: