I'm trying to display the topic extraction results of an LDA text analysis across several data sets in the form of a matplotlib subplot.
Here's where I'm at:
I think my issue is my unfamiliarity with matplotlib. I have done all my number crunching ahead of time so that I can focus on how to plot the data:
top_words_master = []
top_weights_master = []
for i in range(len(tf_list)):
tf = tf_vectorizer.fit_transform(tf_list[i])
lda.fit(tf)
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names_out()
top_features_ind = lda.components_[0].argsort()[: -n_top_words - 1 : -1]
top_features = [tf_feature_names[i] for i in top_features_ind]
weights = lda.components_[0][top_features_ind]
top_words_master.append(top_features)
top_weights_master.append(weights)
This gives me my words and my weights (the x axis values) to make my sub-plot matrix of row/bar charts.
My attempt to construct this via matplot lib:
fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
plt.subplots_adjust(hspace=0.5)
fig.suptitle("Topics in LDA Model", fontsize=18, y=0.95)
axes = axes.flatten()
for i in range(len(tf_list)):
ax = axes[i]
ax.barh(top_words_master[i], top_weights_master[i], height=0.7)
ax.set_title(topic_map[f"Topic {i +1}"], fontdict={"fontsize": 30})
ax.invert_yaxis()
ax.tick_params(axis="both", which="major", labelsize=20)
for j in "top right left".split():
ax.spines[j].set_visible(False)
fig.suptitle("Topics in LDA Model", fontsize=40)
plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()
However, it only showed one, the first one. For the remaining 6 data sets it just printed:
<Figure size 432x288 with 0 Axes> <Figure size 432x288 with 0 Axes> <Figure size 432x288 with 0 Axes> <Figure size 432x288 with 0 Axes> <Figure size 432x288 with 0 Axes>
Question
I've been at this for days. I feel I'm close, but this kind of result is really puzzling me, anyone have a solution or able to point me in the right direction?
As far as I understood from your question, your problem is to get the right indices for your subplots.
In your case, you have an array range(len(tf_list)) to index your data, some data (e.g. top_words_master[i]) to plot, and a figure with 10 subplots (rows=2,cols=5). For example, if you want to plot the 7th item (i=6) of your data, the indices of ax would be axes[1,1].
In order to get the correct indices for the subplot axes, you can use numpy.unravel_index. And, of course, you should not flatten your axes.
import matplotlib.pyplot as plt
import numpy as np
# dummy function
my_func = lambda x: np.random.random(x)
x_max = 100
# fig properties
rows = 2
cols = 5
fig, axes = plt.subplots(rows,cols,figsize=(30, 15), sharex=True)
for i in range(rows*cols):
ax_i = np.unravel_index(i,(rows,cols))
axes[ax_i[0],ax_i[1]].barh(np.arange(x_max),my_func(x_max), height=0.7)
plt.show()
You should create the figure first:
def top_word_comparison(axes, model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
top_features = [feature_names[i] for i in top_features_ind]
weights = topic[top_features_ind]
ax = axes[topic_idx]
ax.barh(top_features, weights, height=0.7)
ax.set_title(topic_map[f"Topic {topic_idx +1}"], fontdict={"fontsize": 30})
ax.invert_yaxis()
ax.tick_params(axis="both", which="major", labelsize=20)
for i in "top right left".split():
ax.spines[i].set_visible(False)
tf_list = [cm_array, xb_array]
fig, axes = plt.subplots(len(tf_list), 5, figsize=(30, 15), sharex=True)
fig.suptitle("Topics in LDA model", fontsize=40)
for i in range(enumerate(tf_list)):
tf = tf_vectorizer.fit_transform(tf_list[i])
n_components = 1
lda.fit(tf)
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names_out()
top_word_comparison(axes[i], lda, tf_feature_names, n_top_words)
plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()
Related
I'm using a for loop to plot 4 confusion matrices side by side:
palette = ['Blues', 'Reds', 'Greens', 'Oranges']
fig, axs = plt.subplots(1,4)
for ax in axs:
for i in range(0,4):
axs[i]=contingency_table(data['outputs'][i], predictions[i].round(), color_map = palette[I])
Running this plots something really weird but doesn't return any error:
The function I'm calling is defined:
def contingency_table(y_true, y_pred, color_map):
#create the confusion matrix
cnf_matrix = confusion_matrix(y_true, y_pred)
group_names = ['True Negative','Type II Error','Type I Error','True Positive']
group_counts = ['{0:0.0f}'.format(value) for value in
cnf_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
cnf_matrix.flatten()/np.sum(cnf_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
return sns.heatmap(cnf_matrix, annot=labels, fmt='', cmap= color_map,);
And I intended on using it in the following way:
plt.figure(figsize = (8,6));
contingency_table(y1_val, pred, color_map = 'Blues');
Perhaps I need to change the function or my for loop but I don't know how. Any help would be great.
Your code is missing a lot of information, making it hard to point out corrections.
Some misunderstanding is that axs[i]=contingency_table(...) would place a plot onto axs[i]. Matplotlib doesn't work like that. The ax needs to be passed as a parameter to contingency_table(....., ax=ax) which on its turn can pass it to sns.heatmap(..., ax=ax).
Note that calling sns.heatmap without providing an ax will use matplotlib's "current ax". In this case, the "current ax" is the last subplot that was created (axs[3]). So the post's code draws the 4 heatmaps all on axs[3], creating 4 colorbars.
The outline of the functions would thus look like:
def contingency_table(y_true, y_pred, color_map, ax):
...
sns.heatmap(cnf_matrix, annot=labels, fmt='', cmap=color_map, ax=ax)
...
color_maps = ['Blues', 'Reds', 'Greens', 'Oranges']
fig, axs = plt.subplots(1, 4)
for i, (ax, color_map) in enumerate(zip(axs, color_maps)):
contingency_table(data['outputs'][i], predictions[i].round(), color_map=color_map, ax=ax)
I've a huge data set with 158 columns and 3.1 million rows. I'm trying to plot univariate distibutions for that data set. Code is as given below.
dtf = pd.read_csv('hackathon_train_data1.csv')
dtf.head()
dtf.columns
Output was:
Index(['visit_id', 'cod_order_nbr', 'cod_orig_ord_nbr', 'src_bu_id',
'int_ref_nbr', 'cod_orig_bu_id', 'cod_src_bu_id', 'onln_flg',
'sohf_ord_dt', 'cod_init',
...
'csat_guid_v42', 'visit_num', 'chat_drawer_rightrail_open',
'chat_unavailable', 'chat_portal', 'ishmximpressions', 'pagination_c40',
'chat_intent_flag', 'coupon_code_stp_v96', 'isbreadcrumbhit_flg'],
dtype='object', length=157)
Then I assigned the one of the column names to y and plotted the graph. Column cod_flg has only 2 entries, 0 and 1.
y = "cod_flg"
ax = dtf[y].value_counts().sort_values().plot(kind="barh")
Output was:
Then I tried to refine it as,
totals= []
for i in ax.patches:
totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
ax.text(i.get_width()+.3, i.get_y()+.20,
str(round((i.get_width()/total)*100, 2))+'%',
fontsize=10, color='black')
ax.grid(axis="x")
plt.suptitle(y, fontsize=20)
plt.show()
It threw me this error:
Figure size 432x288 with 0 Axes
Do I need to modify this line? ax.text(i.get_width()+.3, i.get_y()+.20, str(round((i.get_width()/total)*100, 2))+'%', fontsize=10, color='black')
try without plt.show() , if you are using Google Colab
Motivation:
I'm trying to visualize a dataset of many n-dimensional vectors (let's say i have 10k vectors with n=300 dimensions). What i'd like to do is calculate a histogram for each of the n dimensions and plot it as a single line in a bins*n heatmap.
So far i've got this:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
# sample data:
vectors = np.random.randn(10000, 300) + np.random.randn(300)
def ndhist(vectors, bins=500):
limits = (vectors.min(), vectors.max())
hists = []
dims = vectors.shape[1]
for dim in range(dims):
h, bins = np.histogram(vectors[:, dim], bins=bins, range=limits)
hists.append(h)
hists = np.array(hists)
fig = plt.figure(figsize=(16, 9))
sns.heatmap(hists)
axes = fig.gca()
axes.set(ylabel='dimensions', xlabel='values')
print(dims)
print(limits)
ndhist(vectors)
This generates the following output:
300
(-6.538069472429366, 6.52159540162285)
Problem / Question:
How can i change the axes ticks?
for the y-axis i'd like to simply change this back to matplotlib's default, so it picks nice ticks like 0, 50, 100, ..., 250 (bonus points for 299 or 300)
for the x-axis i'd like to convert the shown bin indices into the bin (left) boundaries, then, as above, i'd like to change this back to matplotlib's default selection of some "nice" ticks like -5, -2.5, 0, 2.5, 5 (bonus points for also including the actual limits -6.538, 6.522)
Own solution attempts:
I've tried many things like the following already:
def ndhist_axlabels(vectors, bins=500):
limits = (vectors.min(), vectors.max())
hists = []
dims = vectors.shape[1]
for dim in range(dims):
h, bins = np.histogram(vectors[:, dim], bins=bins, range=limits)
hists.append(h)
hists = np.array(hists)
fig = plt.figure(figsize=(16, 9))
sns.heatmap(hists, yticklabels=False, xticklabels=False)
axes = fig.gca()
axes.set(ylabel='dimensions', xlabel='values')
#plt.xticks(np.linspace(*limits, len(bins)), bins)
plt.xticks(range(len(bins)), bins)
axes.xaxis.set_major_locator(matplotlib.ticker.AutoLocator())
plt.yticks(range(dims+1), range(dims+1))
axes.yaxis.set_major_locator(matplotlib.ticker.AutoLocator())
print(dims)
print(limits)
ndhist_axlabels(vectors)
As you can see however, the axes labels are pretty wrong. My guess is that the extent or limits are somewhere stored in the original axis, but lost when switching back to the AutoLocator. Would greatly appreciate a nudge in the right direction.
Maybe you're overthinking this. To plot image data, one can use imshow and get the ticking and formatting for free.
import numpy as np
from matplotlib import pyplot as plt
# sample data:
vectors = np.random.randn(10000, 300) + np.random.randn(300)
def ndhist(vectors, bins=500):
limits = (vectors.min(), vectors.max())
hists = []
dims = vectors.shape[1]
for dim in range(dims):
h, _ = np.histogram(vectors[:, dim], bins=bins, range=limits)
hists.append(h)
hists = np.array(hists)
fig, ax = plt.subplots(figsize=(16, 9))
extent = [limits[0], limits[-1], hists.shape[0]-0.5, -0.5]
im = ax.imshow(hists, extent=extent, aspect="auto")
fig.colorbar(im)
ax.set(ylabel='dimensions', xlabel='values')
ndhist(vectors)
plt.show()
If you read the docs, you will notice that the xticklabels/yticklabels arguments are overloaded, such that if you provide an integer instead of a string, it will interpret the argument as xtickevery/ytickevery and place ticks only at the corresponding locations. So in your case, seaborn.heatmap(hists, yticklabels=50) fixes your y-axis problem.
Regarding your xtick labels, I would simply provide them explictly:
xtickevery = 50
xticklabels = ['{:.1f}'.format(b) if ii%xtickevery == 0 else '' for ii, b in enumerate(bins)]
sns.heatmap(hists, yticklabels=50, xticklabels=xticklabels)
Finally came up with a version that works for me for now and uses AutoLocator based on some simple linear mapping...
def ndhist(vectors, bins=1000, title=None):
t = time.time()
limits = (vectors.min(), vectors.max())
hists = []
dims = vectors.shape[1]
for dim in range(dims):
h, bs = np.histogram(vectors[:, dim], bins=bins, range=limits)
hists.append(h)
hists = np.array(hists)
fig = plt.figure(figsize=(16, 12))
sns.heatmap(
hists,
yticklabels=50,
xticklabels=False
)
axes = fig.gca()
axes.set(
ylabel=f'dimensions ({dims} total)',
xlabel=f'values (min: {limits[0]:.4g}, max: {limits[1]:.4g}, {bins} bins)',
title=title,
)
def val_to_idx(val):
# calc (linearly interpolated) index loc for given val
return bins*(val - limits[0])/(limits[1] - limits[0])
xlabels = [round(l, 3) for l in limits] + [
v for v in matplotlib.ticker.AutoLocator().tick_values(*limits)[1:-1]
]
# drop auto-gen labels that might be too close to limits
d = (xlabels[4] - xlabels[3])/3
if (xlabels[1] - xlabels[-1]) < d:
del xlabels[-1]
if (xlabels[2] - xlabels[0]) < d:
del xlabels[2]
xticks = [val_to_idx(val) for val in xlabels]
axes.set_xticks(xticks)
axes.set_xticklabels([f'{l:.4g}' for l in xlabels])
plt.show()
print(f'histogram generated in {time.time() - t:.2f}s')
ndhist(np.random.randn(100000, 300), bins=1000, title='randn')
Thanks to Paul for his answer giving me the idea.
If there's an easier or more elegant solution, i'd still be interested though.
I was wondering how I am able to plot images side by side using matplotlib for example something like this:
The closest I got is this:
This was produced by using this code:
f, axarr = plt.subplots(2,2)
axarr[0,0] = plt.imshow(image_datas[0])
axarr[0,1] = plt.imshow(image_datas[1])
axarr[1,0] = plt.imshow(image_datas[2])
axarr[1,1] = plt.imshow(image_datas[3])
But I can't seem to get the other images to show. I'm thinking that there must be a better way to do this as I would imagine trying to manage the indexes would be a pain. I have looked through the documentation although I have a feeling I may be look at the wrong one. Would anyone be able to provide me with an example or point me in the right direction?
EDIT:
See the answer from #duhaime if you want a function to automatically determine the grid size.
The problem you face is that you try to assign the return of imshow (which is an matplotlib.image.AxesImage to an existing axes object.
The correct way of plotting image data to the different axes in axarr would be
f, axarr = plt.subplots(2,2)
axarr[0,0].imshow(image_datas[0])
axarr[0,1].imshow(image_datas[1])
axarr[1,0].imshow(image_datas[2])
axarr[1,1].imshow(image_datas[3])
The concept is the same for all subplots, and in most cases the axes instance provide the same methods than the pyplot (plt) interface.
E.g. if ax is one of your subplot axes, for plotting a normal line plot you'd use ax.plot(..) instead of plt.plot(). This can actually be found exactly in the source from the page you link to.
One thing that I found quite helpful to use to print all images :
_, axs = plt.subplots(n_row, n_col, figsize=(12, 12))
axs = axs.flatten()
for img, ax in zip(imgs, axs):
ax.imshow(img)
plt.show()
You are plotting all your images on one axis. What you want ist to get a handle for each axis individually and plot your images there. Like so:
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax1.imshow(...)
ax2 = fig.add_subplot(2,2,2)
ax2.imshow(...)
ax3 = fig.add_subplot(2,2,3)
ax3.imshow(...)
ax4 = fig.add_subplot(2,2,4)
ax4.imshow(...)
For more info have a look here: http://matplotlib.org/examples/pylab_examples/subplots_demo.html
For complex layouts, you should consider using gridspec: http://matplotlib.org/users/gridspec.html
If the images are in an array and you want to iterate through each element and print it, you can write the code as follows:
plt.figure(figsize=(10,10)) # specifying the overall grid size
for i in range(25):
plt.subplot(5,5,i+1) # the number of images in the grid is 5*5 (25)
plt.imshow(the_array[i])
plt.show()
Also note that I used subplot and not subplots. They're both different
Below is a complete function show_image_list() that displays images side-by-side in a grid. You can invoke the function with different arguments.
Pass in a list of images, where each image is a Numpy array. It will create a grid with 2 columns by default. It will also infer if each image is color or grayscale.
list_images = [img, gradx, grady, mag_binary, dir_binary]
show_image_list(list_images, figsize=(10, 10))
Pass in a list of images, a list of titles for each image, and other arguments.
show_image_list(list_images=[img, gradx, grady, mag_binary, dir_binary],
list_titles=['original', 'gradx', 'grady', 'mag_binary', 'dir_binary'],
num_cols=3,
figsize=(20, 10),
grid=False,
title_fontsize=20)
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def img_is_color(img):
if len(img.shape) == 3:
# Check the color channels to see if they're all the same.
c1, c2, c3 = img[:, : , 0], img[:, :, 1], img[:, :, 2]
if (c1 == c2).all() and (c2 == c3).all():
return True
return False
def show_image_list(list_images, list_titles=None, list_cmaps=None, grid=True, num_cols=2, figsize=(20, 10), title_fontsize=30):
'''
Shows a grid of images, where each image is a Numpy array. The images can be either
RGB or grayscale.
Parameters:
----------
images: list
List of the images to be displayed.
list_titles: list or None
Optional list of titles to be shown for each image.
list_cmaps: list or None
Optional list of cmap values for each image. If None, then cmap will be
automatically inferred.
grid: boolean
If True, show a grid over each image
num_cols: int
Number of columns to show.
figsize: tuple of width, height
Value to be passed to pyplot.figure()
title_fontsize: int
Value to be passed to set_title().
'''
assert isinstance(list_images, list)
assert len(list_images) > 0
assert isinstance(list_images[0], np.ndarray)
if list_titles is not None:
assert isinstance(list_titles, list)
assert len(list_images) == len(list_titles), '%d imgs != %d titles' % (len(list_images), len(list_titles))
if list_cmaps is not None:
assert isinstance(list_cmaps, list)
assert len(list_images) == len(list_cmaps), '%d imgs != %d cmaps' % (len(list_images), len(list_cmaps))
num_images = len(list_images)
num_cols = min(num_images, num_cols)
num_rows = int(num_images / num_cols) + (1 if num_images % num_cols != 0 else 0)
# Create a grid of subplots.
fig, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
# Create list of axes for easy iteration.
if isinstance(axes, np.ndarray):
list_axes = list(axes.flat)
else:
list_axes = [axes]
for i in range(num_images):
img = list_images[i]
title = list_titles[i] if list_titles is not None else 'Image %d' % (i)
cmap = list_cmaps[i] if list_cmaps is not None else (None if img_is_color(img) else 'gray')
list_axes[i].imshow(img, cmap=cmap)
list_axes[i].set_title(title, fontsize=title_fontsize)
list_axes[i].grid(grid)
for i in range(num_images, len(list_axes)):
list_axes[i].set_visible(False)
fig.tight_layout()
_ = plt.show()
As per matplotlib's suggestion for image grids:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
fig = plt.figure(figsize=(4., 4.))
grid = ImageGrid(fig, 111, # similar to subplot(111)
nrows_ncols=(2, 2), # creates 2x2 grid of axes
axes_pad=0.1, # pad between axes in inch.
)
for ax, im in zip(grid, image_data):
# Iterating over the grid returns the Axes.
ax.imshow(im)
plt.show()
I end up at this url about once a week. For those who want a little function that just plots a grid of images without hassle, here we go:
import matplotlib.pyplot as plt
import numpy as np
def plot_image_grid(images, ncols=None, cmap='gray'):
'''Plot a grid of images'''
if not ncols:
factors = [i for i in range(1, len(images)+1) if len(images) % i == 0]
ncols = factors[len(factors) // 2] if len(factors) else len(images) // 4 + 1
nrows = int(len(images) / ncols) + int(len(images) % ncols)
imgs = [images[i] if len(images) > i else None for i in range(nrows * ncols)]
f, axes = plt.subplots(nrows, ncols, figsize=(3*ncols, 2*nrows))
axes = axes.flatten()[:len(imgs)]
for img, ax in zip(imgs, axes.flatten()):
if np.any(img):
if len(img.shape) > 2 and img.shape[2] == 1:
img = img.squeeze()
ax.imshow(img, cmap=cmap)
# make 16 images with 60 height, 80 width, 3 color channels
images = np.random.rand(16, 60, 80, 3)
# plot them
plot_image_grid(images)
Sample code to visualize one random image from the dataset
def get_random_image(num):
path=os.path.join("/content/gdrive/MyDrive/dataset/",images[num])
image=cv2.imread(path)
return image
Call the function
images=os.listdir("/content/gdrive/MyDrive/dataset")
random_num=random.randint(0, len(images))
img=get_random_image(random_num)
plt.figure(figsize=(8,8))
plt.imshow(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))
Display cluster of random images from the given dataset
#Making a figure containing 16 images
lst=random.sample(range(0,len(images)), 16)
plt.figure(figsize=(12,12))
for index,value in enumerate(lst):
img=get_random_image(value)
img_resized=cv2.resize(img,(400,400))
#print(path)
plt.subplot(4,4,index+1)
plt.imshow(img_resized)
plt.axis('off')
plt.tight_layout()
plt.subplots_adjust(wspace=0, hspace=0)
#plt.savefig(f"Images/{lst[0]}.png")
plt.show()
Plotting images present in a dataset
Here rand gives a random index value which is used to select a random image present in the dataset and labels has the integer representation for every image type and labels_dict is a dictionary holding key val information
fig,ax = plt.subplots(5,5,figsize = (15,15))
ax = ax.ravel()
for i in range(25):
rand = np.random.randint(0,len(image_dataset))
image = image_dataset[rand]
ax[i].imshow(image,cmap = 'gray')
ax[i].set_title(labels_dict[labels[rand]])
plt.show()
I have a bar graph of 150 values.The code is :
rcParams.update({'figure.autolayout': True})
plt.figure(figsize=(14,9), dpi=600)
reso_names = [x[0] for x in resolution3]
reso_values = [x[1] for x in resolution3]
plt.bar(range(len(reso_values[0:20])), reso_values[0:20], align='center')
plt.xticks(range(len(reso_names[0:20])), list(reso_names[0:20]), rotation='vertical')
plt.margins(0.075)
plt.xlabel('Resolution Category Tier 3')
plt.ylabel('Volume')
plt.title('Resolution Category Tier 3 Volume', {'family' : 'Arial Black',
'weight' : 'bold',
'size' : 22})
plt.savefig('Reso3.pdf', format='pdf')
plt.show()
Since I want to break it down into sub-graphs of 20 each to maintain readability I'm using the [0:20] at the reso_names and reso_values (both lists.
However the problem is that scale cannot be maintained, each sub-graphs have very different scales and that is a problem in terms of consistency not being maintained. How can I set a scale that can be maintained across all the graphs.
You can specify sharey=True to keep the y-scale same in all subplots.
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randint(1, 10, 10)
y = np.random.randint(1, 100, 10)
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)
# do simple plot here, replace barplot yourself
axes[0].plot(x)
axes[1].plot(y)
Or if you prefer to plot them separately, you can manually configure ax.set_ylim().