I am starting to render plots with matplotlib as I learn both python and this interesting plotting library. I need help with a custom plot for a problem I am working on. May be there is an inbuilt function already for this.
Problem:
I am trying to draw a table(rectangle) as a plot with 96 individual cells ( 8 rows X 12 cols). Color each alternative cell with a specific color ( like a chess board : instead of black/white I will use some other color combination) and insert value for each cell from a pandas data frame or python dictionary. Show the col and row labels on the side.
Sample Data: http://pastebin.com/N4A7gWuH
I would like the plot to look something like this substituting the values in the cells from a numpy/pandas ds.
Sample Plot: http://picpaste.com/sample-E0DZaoXk.png
Appreciate your input.
PS: did post the same on mathplotlib's mailing list
Basically, you can just use imshow or matshow.
However, I'm not quite clear what you mean.
If you want a chessboard with every "white" cell colored by some other vector, you could do something like this:
import matplotlib.pyplot as plt
import numpy as np
# Make a 9x9 grid...
nrows, ncols = 9,9
image = np.zeros(nrows*ncols)
# Set every other cell to a random number (this would be your data)
image[::2] = np.random.random(nrows*ncols //2 + 1)
# Reshape things into a 9x9 grid.
image = image.reshape((nrows, ncols))
row_labels = range(nrows)
col_labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']
plt.matshow(image)
plt.xticks(range(ncols), col_labels)
plt.yticks(range(nrows), row_labels)
plt.show()
Obviously, this only works for things with and odd number of rows and columns. You can iterate over each row for datasets with an even number of rows and columns.
E.g.:
for i, (image_row, data_row) in enumerate(zip(image, data)):
image_row[i%2::2] = data_row
However, the number of "data" cells in each row is going to be different, which is where I get confused by your problem definition.
By definition, a checkerboard pattern has a different number of "white" cells in each row.
Your data presumably (?) has the same number of values in each row. You need to define what you want to do. You can either truncate the data, or add an extra column.
Edit: I just realized that that's true only for odd-length numbers of columns.
Regardless, I'm still confused by your question.
Do you want have a "full" grid of data and want to set a "checkerboard" pattern of values in the data grid to a different color, or do you want to "intersperse" your data with a "checkerboard" pattern of values plotted as some constant color?
Update
It sounds like you want something more like a spreasheet? Matplotlib isn't ideal for this, but you can do it.
Ideally, you'd just use plt.table, but in this case, it's easier to use matplotlib.table.Table directly:
import matplotlib.pyplot as plt
import numpy as np
import pandas
from matplotlib.table import Table
def main():
data = pandas.DataFrame(np.random.random((12,8)),
columns=['A','B','C','D','E','F','G','H'])
checkerboard_table(data)
plt.show()
def checkerboard_table(data, fmt='{:.2f}', bkg_colors=['yellow', 'white']):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0,0,1,1])
nrows, ncols = data.shape
width, height = 1.0 / ncols, 1.0 / nrows
# Add cells
for (i,j), val in np.ndenumerate(data):
# Index either the first or second item of bkg_colors based on
# a checker board pattern
idx = [j % 2, (j + 1) % 2][i % 2]
color = bkg_colors[idx]
tb.add_cell(i, j, width, height, text=fmt.format(val),
loc='center', facecolor=color)
# Row Labels...
for i, label in enumerate(data.index):
tb.add_cell(i, -1, width, height, text=label, loc='right',
edgecolor='none', facecolor='none')
# Column Labels...
for j, label in enumerate(data.columns):
tb.add_cell(-1, j, width, height/2, text=label, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)
return fig
if __name__ == '__main__':
main()
Related
There is an easy way to generate word frequency plots with NLTK:
my_plot = nltk.FreqDist(some_dynamically_changing_list[0]).plot(20)
some_dynamically_changing_list contains a list of texts that I want to see the word frequency for. The resulting my_plot is an AxesSubplot object. I want to be able to take this object and directly insert it to a dynamically sized subplotted grid. The closest I've gotten to the solution after scouring SO and Google is this. However, I have an issue because I don't want to use plt's plot function. So for example I'd have a dynamic subplot grid:
import matplotlib.pyplot as plt
tot = len(some_dynamically_changing_list)
cols = 5
rows = tot // cols
if tot % cols != 0:
rows += 1
position = range(1, tot + 1)
fig = plt.figure()
for k in range(tot):
ax = fig.add_subplot(rows, cols, position[k])
my_plot = nltk.FreqDist(some_dynamically_changing_list[k]).plot(20)
ax.plot(my_plot) #This is where I have my issue.
Of course this won't work, as ax.plot expects data, not a plot. But I want to plug my my_plot into that specific slot. How can I do that (if I can)?
My dataset is in the form of :
Data[0] = [headValue,x0,x1,..xN]
Data[1] = [headValue_ya,ya0,ya1,..yaN]
Data[2] = [headValue_yb,yb0,yb1,..ybN]
...
Data[n] = [headvalue_yz,yz0,yz1,..yzN]
I want to plot f(y*) = x, so I can visualize all Lineplots in the same figure with different colors, each color determined by the headervalue_y*.
I also want to add a colorbar whose color matching the lines and therefore the header values, so we can link visually which header value leads to which behaviour.
Here is what I am aiming for :(Plot from Lacroix B, Letort G, Pitayu L, et al. Microtubule Dynamics Scale with Cell Size to Set Spindle Length and Assembly Timing. Dev Cell. 2018;45(4):496–511.e6. doi:10.1016/j.devcel.2018.04.022)
I have trouble adding the colorbar, I have tried to extract N colors from a colormap (N is my number of different headValues, or column -1) and then adding for each line plot the color corresponding here is my code to clarify:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
Data = [['Time',0,0.33,..200],[0.269,4,4.005,...11],[0.362,4,3.999,...16.21],...[0.347,4,3.84,...15.8]]
headValues = [0.269,0.362,0.335,0.323,0.161,0.338,0.341,0.428,0.245,0.305,0.305,0.314,0.299,0.395,0.32,0.437,0.203,0.41,0.392,0.347]
# the differents headValues_y* of each column here in a list but also in Data
# with headValue[0] = Data[1][0], headValue[1] = Data[2][0] ...
cmap = mpl.cm.get_cmap('rainbow') # I choose my colormap
rgba = [] # the color container
for value in headValues:
rgba.append(cmap(value)) # so rgba will contain a different color for each headValue
fig, (ax,ax1) = plt.subplots(2,1) # creating my figure and two axes to put the Lines and the colorbar
c = 0 # index for my colors
for i in range(1, len(Data)):
ax.plot( Data[0][1:], Data[i][1:] , color = rgba[c])
# Data[0][1:] is x, Data[i][1:] is y, and the color associated with Data[i][0]
c += 1
fig.colorbar(mpl.cm.ScalarMappable(cmap= mpl.colors.ListedColormap(rgba)), cax=ax1, orientation='horizontal')
# here I create my scalarMappable for my lineplot and with the previously selected colors 'rgba'
plt.show()
The current result:
How to add the colorbar on the side or the bottom of the first axis ?
How to properly add a scale to this colorbar correspondig to different headValues ?
How to make the colorbar scale and colors match to the different lines on the plot with the link One color = One headValue ?
I have tried to work with scatterplot which are more convenient to use with scalarMappable but no solution allows me to do all these things at once.
Here is a possible approach. As the 'headValues' aren't sorted, nor equally spaced and one is even used twice, it is not fully clear what the most-desired result would be.
Some remarks:
The standard way of creating a colorbar in matplotlib doesn't need a separate subplot. Matplotlib will reduce the existing plot a bit and put the colorbar next to it (or below for a vertical bar).
Converting the 'headValues' to a numpy array allows for compact code, e.g. writing rgba = cmap(headValues) directly calculates the complete array.
Calling cmap on unchanged values will map 0 to the lowest color and 1 to the highest color, so for values only between 0.16 and 0.44 they all will be mapped to quite similar colors. One approach is to create a norm to map 0.16 to the lowest color and 0.44 to the highest. In code: norm = plt.Normalize(headValues.min(), headValues.max()) and then calculate rgba = cmap(norm(headValues)).
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
headValues = np.array([0.269, 0.362, 0.335, 0.323, 0.161, 0.338, 0.341, 0.428, 0.245, 0.305, 0.305, 0.314, 0.299, 0.395, 0.32, 0.437, 0.203, 0.41, 0.392, 0.347])
x = np.linspace(0, 200, 500)
# create Data similar to the data in the question
Data = [['Time'] + list(x)] + [[val] + list(np.sqrt(4 * x) * val + 4) for val in headValues]
headValues = np.array([d[0] for d in Data[1:]])
order = np.argsort(headValues)
inverse_order = np.argsort(order)
cmap = mpl.cm.get_cmap('rainbow')
rgba = cmap(np.linspace(0, 1, len(headValues))) # evenly spaced colors
fig, ax = plt.subplots(1, 1)
for i in range(1, len(Data)):
ax.plot(Data[0][1:], Data[i][1:], color=rgba[inverse_order[i-1]])
# Data[0][1:] is x, Data[i][1:] is y, and the color associated with Data[i-1][0]
cbar = fig.colorbar(mpl.cm.ScalarMappable(cmap=mpl.colors.ListedColormap(rgba)), orientation='vertical',
ticks=np.linspace(0, 1, len(rgba) * 2 + 1)[1::2])
cbar.set_ticklabels(headValues[order])
plt.show()
Alternatively, the colors can be assigned using their position in the colormap, but without creating
cmap = mpl.cm.get_cmap('rainbow')
norm = plt.Normalize(headValues.min(), headValues.max())
fig, ax = plt.subplots(1, 1)
for i in range(1, len(Data)):
ax.plot(Data[0][1:], Data[i][1:], color=cmap(norm(Data[i][0])))
cbar = fig.colorbar(mpl.cm.ScalarMappable(cmap=cmap, norm=norm))
To get ticks for each of the 'headValues', these ticks can be set explicitly. As putting a label for each tick will result in overlapping text, labels that are too close to other labels can be replaced by an empty string:
headValues.sort()
cbar2 = fig.colorbar(mpl.cm.ScalarMappable(cmap=cmap, norm=norm), ticks=headValues)
cbar2.set_ticklabels([val if val < next - 0.007 else '' for val, next in zip(headValues[:-1], headValues[1:])]
+ [headValues[-1]])
At the left the result of the first approach (colors in segments), at the right the alternative colorbars (color depending on value):
Currently, I'm working on an introductory paper on data manipulation and such; however... the CSV I'm working on has some things I wish to do a scatter graph on!
I want a scatter graph to show me the volume sold on certain items as well as their average price, differentiating all data according to their region (Through colours I assume).
So what I want is to know if I can add the region column as a quantitative value
or if there's a way to make this possible...
It's my first time using Python and I'm confused way too often
I'm not sure if this is what you mean, but here is some working code, assuming you have data in the format of [(country, volume, price), ...]. If not, you can change the inputs to the scatter method as needed.
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
n_countries = 50
# get the data into "countries", for example
countries = ...
# in this example: countries is [('BS', 21, 25), ('WZ', 98, 25), ...]
df = pd.DataFrame(countries)
# arbitrary method to get a color
def get_color(i, max_i):
cmap = matplotlib.cm.get_cmap('Spectral')
return cmap(i/max_i)
# get the figure and axis - make a larger figure to fit more points
# add labels for metric names
def get_fig_ax():
fig = plt.figure(figsize=(14,14))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('volume')
ax.set_ylabel('price')
return fig, ax
# switch around the assignments depending on your data
def get_x_y_labels():
x = df[1]
y = df[2]
labels = df[0]
return x, y, labels
offset = 1 # offset just so annotations aren't on top of points
x, y, labels = get_x_y_labels()
fig, ax = get_fig_ax()
# add a point and annotation for each of the labels/regions
for i, region in enumerate(labels):
ax.annotate(region, (x[i] + offset, y[i] + offset))
# note that you must use "label" for "legend" to work
ax.scatter(x[i], y[i], color=get_color(i, len(x)), label=region)
# Add the legend just outside of the plot.
# The .1, 0 at the end will put it outside
ax.legend(loc='upper right', bbox_to_anchor=(1, 1, .1, 0))
plt.show()
I have the following dataframe and would like to differentiate the minor decimal differences in each "step" with a different color scheme in a heatmap.
Sample data:
Sample Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
A 64.847 54.821 20.897 39.733 23.257 74.942 75.945
B 64.885 54.767 20.828 39.613 23.093 74.963 75.928
C 65.036 54.772 20.939 39.835 23.283 74.944 75.871
D 64.869 54.740 21.039 39.889 23.322 74.925 75.894
E 64.911 54.730 20.858 39.608 23.101 74.956 75.930
F 64.838 54.749 20.707 39.394 22.984 74.929 75.941
G 64.887 54.781 20.948 39.748 23.238 74.957 75.909
H 64.903 54.720 20.783 39.540 23.028 74.898 75.911
I 64.875 54.761 20.911 39.695 23.082 74.897 75.866
J 64.839 54.717 20.692 39.377 22.853 74.849 75.939
K 64.857 54.736 20.934 39.699 23.130 74.880 75.903
L 64.754 54.746 20.777 39.536 22.991 74.877 75.902
M 64.798 54.811 20.963 39.824 23.187 74.886 75.895
An example of what I am looking for:
My first approach would be based on a figure with multiple subplots. Number of plots would equal number of columns in your dataframe; the gap between the plots could be shrinked down to zero:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
Result:
Not sure if this will lead to a helpful or even self describing visualization of data, but that's your choice - perhaps this helps to start...
Supplemental:
Regarding adding the colorbars: of course you can. But - besides not knowing the background of your data and the purpose of the visualization - I'd like to add some thoughts on all that:
First: adding all those colorbars as a separate bunch of bars on one side or below the heatmap is probably possible, but I find it already quite hard to read the data, plus: you already have all those annotations - it would mess all up I think.
Additionally: in the meantime #ImportanceOfBeingErnest provided such a beutiful solution on that topic, that this would be not too meaningful imo here.
Second: if you really want to stick to the heatmap thing, perhaps splitting up and giving every column its colorbar would suit better:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, figsize=(10, 3))
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c)
if i>0:
a.yaxis.set_ticks([])
f.tight_layout()
However, all that said - I dare to doubt that this is the best visualization for your data. Of course, I don't know what you want to say, see or find with these plots, but that's the point: if the visualization type would fit to the needs, I guess I'd know (or at least could imagine).
Just for example:
A simple df.plot() results in
and I feel that this tells more about different characteristics of your columns within some tenths of a second than the heatmap.
Or are you explicitely after the differences to each columns' means?
(df - df.mean()).plot()
... or the distribution of each column around them?
(df - df.mean()).boxplot()
What I want to say: data visualization becomes powerful when a plot begins to tell sth about the underlying data before you begin/have to explain anything...
I suppose the problem can be divided into several parts.
Getting several heatmaps with different colormaps into the same picture. This can be done masking the complete array column-wise, plot each masked array seperately via imshow and apply a different colormap. To visualize the concept:
Obtaining variable number of distinct colormaps. Matplotlib provides a large number of colormaps, however, they are in general very different concerning luminosity and saturation. Here it seems desireable to have colormaps of differing hue, but otherwise same saturation and luminosity.
An option is to create the colormaps on the fly, choosing n different (and equally spaced) hues, and create a colormap using the same saturation and luminosity.
Obtaining a distinct colorbar for each column. Since the values within columns might be on totally different scales, a colorbar for each column would be needed to know the values shown, e.g. in the first column the brightest color may correspond to a value of 1, while in the second column it may correspond to a value of 100. Several colorbars can be created inside of the axes of a GridSpec which is placed next to the actual heatmap axes. The number of columns and rows of that gridspec would be dependent of the number of columns in the dataframe.
In total this may then look as follows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
def get_hsvcmap(i, N, rot=0.):
nsc = 24
chsv = mcolors.rgb_to_hsv(plt.cm.hsv(((np.arange(N)/N)+rot) % 1.)[i,:3])
rhsv = mcolors.rgb_to_hsv(plt.cm.Reds(np.linspace(.2,1,nsc))[:,:3])
arhsv = np.tile(chsv,nsc).reshape(nsc,3)
arhsv[:,1:] = rhsv[:,1:]
rgb = mcolors.hsv_to_rgb(arhsv)
return mcolors.LinearSegmentedColormap.from_list("",rgb)
def columnwise_heatmap(array, ax=None, **kw):
ax = ax or plt.gca()
premask = np.tile(np.arange(array.shape[1]), array.shape[0]).reshape(array.shape)
images = []
for i in range(array.shape[1]):
col = np.ma.array(array, mask = premask != i)
im = ax.imshow(col, cmap=get_hsvcmap(i, array.shape[1], rot=0.5), **kw)
images.append(im)
return images
### Create some dataset
ind = list("ABCDEFGHIJKLM")
m = len(ind)
n = 8
df = pd.DataFrame(np.random.randn(m,n) + np.random.randint(20,70,n),
index=ind, columns=[f"Step {i}" for i in range(2,2+n)])
### Plot data
fig, ax = plt.subplots(figsize=(8,4.5))
ims = columnwise_heatmap(df.values, ax=ax, aspect="auto")
ax.set(xticks=np.arange(len(df.columns)), yticks=np.arange(len(df)),
xticklabels=df.columns, yticklabels=df.index)
ax.tick_params(bottom=False, top=False,
labelbottom=False, labeltop=True, left=False)
### Optionally add colorbars.
fig.subplots_adjust(left=0.06, right=0.65)
rows = 3
cols = len(df.columns) // rows + int(len(df.columns)%rows > 0)
gs = GridSpec(rows, cols)
gs.update(left=0.7, right=0.95, wspace=1, hspace=0.3)
for i, im in enumerate(ims):
cax = fig.add_subplot(gs[i//cols, i % cols])
fig.colorbar(im, cax = cax)
cax.set_title(df.columns[i], fontsize=10)
plt.show()
I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)