understanding default usage of `pylab.legend` - python

I am using pylab to produce this image:
where the legend is not what I wanted. The dots represent actual data points, the lines are made with polyfit. I would like the legend to contain either ten entries with the lines and dots merged together for each color or just the ten dot-lines.
The associated piece of code:
for i in range(start, start + size*chunks):
colorVal = scalarMap.to_rgba(values[i])
slc1, slc2 = start + i*size, start + (i+1)*size
mylegend.append(" = ".join([self.dtypes[v1],
"%.2f" %data[v1, slc1]]))
jx = data[x, slc1:slc2]
jy = data[y, slc1:slc2]
p = np.polyfit(jx, jy, deg = 2)
lx = np.linspace(jx[0], jx[-1], 1000)
ly = p[0]*lx**2 + p[1]*lx + p[2]
pl.plot(jx, jy, "o", color = colorVal)
pl.plot(lx, ly, color = colorVal)
pl.xlabel(self.dtypes[x])
pl.ylabel(self.dtypes[y])
pl.title(title)
pl.axis(axis)
pl.legend(my_legend, loc = "upper left", shadow = True)
pl.grid("on")
pl.show()
I realize what the mistake is: I add ten points to the my_legend list, and the legend function of pylab is then reading from it until the list ends. Therefore, only half of them make it. However, I don't know how to fix it. Is there a way I can make the legend function only register one entry for each iteration of the loop?
Also, I would like the points listed in reverse order. I tried
pl.legend(my_legend[::-1])
but that didn't work.
Any ideas to these two issues?

The behavior of pylab.legend is appropriated, once you understand how does it work. When you call pylab.legend(my_legend, ...), the list of strings of the labels is associated to the first 10 lines drawn. The way you do, the first 10 lines are the one added in the first 5 loops.
To show just the dots you can do this:
for i in range(start, start + size*chunks):
[...]
label = " = ".join([self.dtypes[v1], "%.2f" %data[v1, slc1]])
[...]
pl.plot(jx, jy, "o", color = colorVal, label=label)
pl.plot(lx, ly, color = colorVal)
[...]
pl.legend(loc = "upper left", shadow = True)
If you want the legend for the lines, you just put the label=label into the other plot command.
An alternative approach is to create a mylines list (similar to mylegend) and to append just one of the plot command and then call the pl.legend(mylines, mylegend, ...)

Related

How can i make this matplotlib "word cloud" graph better looking?

For the sake of learning, I'm working on this word cloud program, that counts the number of times a word appears in a text, and prints it out in a kind of "word-cloud" image.
The program works fine, but I would like to address a couple of esthetical things like:
How do I remove the numbers on the x-axis and the y-axis?
Is it possible to remove the axes completely?
Sometimes a word positions itself over another. Does anyone know how to make the word not print over each other and make the words position themselves nicely next to each other instead?
The out print is
... and...
I would like it to look something like this (or as least as close to it as possible)
The code in question is
filename = "adventure.txt"
infile = open(filename)
wordcounts = {}
for line in infile:
words = line.split()
for word in words:
w = "".join([e for e in word if e.isalpha()])
w = w.lower()
if w in wordcounts:
wordcounts[w] = wordcounts[w] + 1
else:
wordcounts[w] = 1
#Put all words in list and sort counts
words = list(wordcounts.keys())
words.sort(key=lambda x:wordcounts[x], reverse=True)
import matplotlib.pyplot as plt
import numpy as np
#Set maximum fornt size to 50
scale = 50/wordcounts[words[1]]
#Set up empty plot with limits on x-axis and y-axis
plt.axes(xlim=(0,100), ylim=(0,100) )
#Plot 50 most frequent words with size=frequency
N = min(len(words), 50)
for i in range(0,N):
x = np.random.uniform(0,90)
y = np.random.uniform(0,90)
freq = wordcounts[words[i]]
col =["r", "g", "b", "m", "c", "k"][i % 5]
plt.text(x, y, words[i], fontsize=scale * freq, color=col)
plt.show()
All help is welcomed and appreciated.
Define the figure object without axes' ticks and labels:
fig = plt.figure(figsize = (10, 10), num = 1, clear = True)
ax = plt.subplot(1, 1, 1, xticks = [], yticks = [], frameon = False)
Remove this line:
plt.axes(xlim=(0,100), ylim=(0,100) )
Concluding lines:
for i in range(0,N):
x = np.random.uniform(0,90)
y = np.random.uniform(0,90)
freq = wordcounts[words[i]]
col =["r", "g", "b", "m", "c", "k"][i % 5]
ax.text(x, y, words[i], fontsize=scale * freq, color=col)
plt.show()
To make your plot look similar to the example you provided... it'll take a lot of manual plug&chug, trial&error, whatever you want to call it; you'll have to plug in coordinates for each word and decide where you think each word looks best in terms of x-y coordinates--one tip being that the largest word should be plotted last (i.e. when i == N - 1), while the smallest text should be plotted first (i.e. when i == 0); in that way the larger text won't have smaller text overlaying it. You could also focus on having non-overlapping coordinates with enough distance so that the words aren't too close to one another--alternatively, if you want the words to be touching, you could scale the degree to which they're overlapping one another. Have a colormap that randomizes the RGB list so that the colors are more "scattered" (rather than having all of the yellow text in the top right corner for example). Maybe center the larger text and have the smaller text more towards the periphery. The list goes on, but I think you get the idea.

Change the scale of the graph image

I try to generate a graph and save an image of the graph in python. Although the "plotting" of the values seems ok and I can get my picture, the scale of the graph is badly shifted.
If you compare the correct graph from tutorial example with my bad graph generated from different dataset, the curves are cut at the bottom to early: Y-axis should start just above the highest values and I should also see the curves for the highest X-values (in my case around 10^3).
But honestly, I think that problem is the scale of the y-axis, but actually do not know what parameteres should I change to fix it. I tried to play with some numbers (see below script), but without any good results.
This is the code for calculation and generation of the graph image:
import numpy as np
hic_data = load_hic_data_from_reads('/home/besy/Hi-C/MOREX/TCC35_parsedV2/TCC35_V2_interaction_filtered.tsv', resolution=100000)
min_diff = 1
max_diff = 500
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12, 12))
for cnum, c in enumerate(hic_data.chromosomes):
if c in ['ChrUn']:
continue
dist_intr = []
for diff in xrange(min_diff, min((max_diff, 1 + hic_data.chromosomes[c]))):
beg, end = hic_data.section_pos[c]
dist_intr.append([])
for i in xrange(beg, end - diff):
dist_intr[-1].append(hic_data[i, i + diff])
mean_intrp = []
for d in dist_intr:
if len(d):
mean_intrp.append(float(np.nansum(d)) / len(d))
else:
mean_intrp.append(0.0)
xp, yp = range(min_diff, max_diff), mean_intrp
x = []
y = []
for k in xrange(len(xp)):
if yp[k]:
x.append(xp[k])
y.append(yp[k])
l = plt.plot(x, y, '-', label=c, alpha=0.8)
plt.hlines(mean_intrp[2], 3, 5.25 + np.exp(cnum / 4.3), color=l[0].get_color(),
linestyle='--', alpha=0.5)
plt.text(5.25 + np.exp(cnum / 4.3), mean_intrp[2], c, color=l[0].get_color())
plt.plot(3, mean_intrp[2], '+', color=l[0].get_color())
plt.xscale('log')
plt.yscale('log')
plt.ylabel('number of interactions')
plt.xlabel('Distance between bins (in 100 kb bins)')
plt.grid()
plt.ylim(2, 250)
_ = plt.xlim(1, 110)
fig.savefig('/home/besy/Hi-C/MOREX/TCC35_V2_results/filtered/TCC35_V2_decay.png', dpi=fig.dpi)
I think that problem is in scale I need y-axis to start from 10^-1 (0.1), in order to change this I tried this:
min_diff = 0.1
.
.
.
dist_intr = []
for diff in xrange(min_diff, min((max_diff, 0.1 + hic_data.chromosomes[c]))):
.
.
.
plt.ylim((0.1, 20))
But this values return: "integer argument expected, got float"
I also tried to play with:
max_diff, plt.ylim and plt.xlim parameters little bit, but nothing changed to much.
I would like to ask you what parameter/s and how I need change to generate image of the correctly focused graph. Thank you in advance.

Why first figure in the list is not plotted, but at the end there is an empty plot?

I have a problem with matplotlib.
I need to prepare a plot consisted of all plots from list in specified directory. The code below generating that, but it omits first path...
For example, if I need to prepare image consisted of 14 subplots, only 13 are copied, first is omitted and instead of first, there is an empty plot at the last position.
I have checked, that function reads all paths, including first at list.
If you will be able to help and to give me a hint, what I`m doing wrong, I will be grateful.
Best regards
def create_combo_plot(path_to_dir, list_of_png_abspath):
name = path_to_dir.replace('_out', '')
title = name
if name.find('/') != -1:
title = name.split('/')[-1]
list_of_png_abspath
how_many_figures = len(list_)
combo_figure = plt.figure(2, figsize=(100,100))
a = 4
b = int(floor(how_many_figures/4.1)) + 1
for i, l in enumerate(list_of_png_abspath):
print l #I`ve checked, path is reached
j = i + 1
img=mpimg.imread(l)
imgplot = plt.imshow(img, interpolation="nearest")
plot = plt.subplot(b, a, j)
combo_figure.suptitle(title, fontsize=100)
combo_figure.savefig(path_to_dir +'/' + title + '.jpeg')
plt.close(combo_figure)
Replace these two lines:
imgplot = plt.imshow(img, interpolation="nearest")
plot = plt.subplot(b, a, j)
with these:
sub = plt.subplot(b, a, j)
sub.imshow(img, interpolation="nearest")
The line:
imgplot = plt.imshow(img, interpolation="nearest")
adds a new plot to the last active subplot. In your case it was created in the previous loop here:
plot = plt.subplot(b, a, j)
Therefore, you start with the second image and the last subplot stays empty.
But if you create the subplot first:
sub = plt.subplot(b, a, j)
and later explicitly plot into it:
sub.imshow(img, interpolation="nearest")
you should see 14 plots.

How to add significance levels on bar graph using Python's Matplotlib?

I have written some code to graph some data in Python's Matplotlib.
The plot currently:
The code to produce this plot:
groups=['Control','30min','24hour']
cell_lysate_avg=[11887.42595, 4862.429689, 3414.337554]
cell_lysate_sd=[1956.212855, 494.8437915, 525.8556207]
cell_lysate_avg=[i/1000 for i in cell_lysate_avg]
cell_lysate_sd=[i/1000 for i in cell_lysate_sd]
media_avg=[14763.71106,8597.475539,6374.732852]
media_sd=[240.8983759, 167.005365, 256.1374017]
media_avg=[i/1000 for i in media_avg] #to get ng/ml
media_sd=[i/1000 for i in media_sd]
fig, ax = plt.subplots()
index = numpy.arange(len(groups)) #where to put the bars
bar_width=0.45
opacity = 0.5
error_config = {'ecolor': '0.3'}
cell_lysate_plt=plt.bar(index,cell_lysate_avg,bar_width,alpha=opacity,color='black',yerr=cell_lysate_sd,error_kw=error_config,label='Cell Lysates')
media_plt=plt.bar(index+bar_width,media_avg,bar_width,alpha=opacity,color='green',yerr=media_sd,error_kw=error_config,label='Media')
plt.xlabel('Groups',fontsize=15)
plt.ylabel('ng/ml',fontsize=15)
plt.title('\n'.join(wrap('Average Over Biological Repeats for TIMP1 ELISA (n=3)',45)),fontsize=15)
plt.xticks(index + bar_width, groups)
plt.legend()
ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)
I have calculated the various two tailed t tests associated with this data and I want to display using standard scientific journal representation - i.e. a line connecting two bars with a star which represents a significance level of (say) >0.05. Can anybody tell me how to do this?
As far as I know there is no standard scientific journal representation for showing significance. The exact way you draw it is a matter of taste. This is probably the reason why matplotlib has no specific function for significance bars (at least to my knowledge). You could just do it manually. E.g:
from matplotlib.markers import TICKDOWN
def significance_bar(start,end,height,displaystring,linewidth = 1.2,markersize = 8,boxpad =0.3,fontsize = 15,color = 'k'):
# draw a line with downticks at the ends
plt.plot([start,end],[height]*2,'-',color = color,lw=linewidth,marker = TICKDOWN,markeredgewidth=linewidth,markersize = markersize)
# draw the text with a bounding box covering up the line
plt.text(0.5*(start+end),height,displaystring,ha = 'center',va='center',bbox=dict(facecolor='1.', edgecolor='none',boxstyle='Square,pad='+str(boxpad)),size = fontsize)
pvals = [0.001,0.1,0.00001]
offset =1
for i,p in enumerate(pvals):
if p>=0.05:
displaystring = r'n.s.'
elif p<0.0001:
displaystring = r'***'
elif p<0.001:
displaystring = r'**'
else:
displaystring = r'*'
height = offset + max(cell_lysate_avg[i],media_avg[i])
bar_centers = index[i] + numpy.array([0.5,1.5])*bar_width
significance_bar(bar_centers[0],bar_centers[1],height,displaystring)
Instead of the stars you could of course also explicitly write p<0.05 or something similar. You can then spend hours fiddling with the parameters until it looks just right.

Adding a single label to the legend for a series of different data points plotted inside a designated bin in Python using matplotlib.pyplot.plot()

I have a script for plotting astronomical data of redmapping clusters using a csv file. I could get the data points in it and want to plot them using different colors depending on their redshift values: I am binning the dataset into 3 bins (0.1-0.2, 0.2-0.25, 0.25,0.31) based on the redshift.
The problem arises with my code after I distinguish to what bin the datapoint belongs: I want to have 3 labels in the legend corresponding to red, green and blue data points, but this is not happening and I don't know why. I am using plot() instead of scatter() as I also had to do the best fit from the data in the same figure. So everything needs to be in 1 figure.
import numpy as np
import matplotlib.pyplot as py
import csv
z = open("Sheet4CSV.csv","rU")
data = csv.reader(z)
x = []
y = []
ylow = []
yupp = []
xlow = []
xupp = []
redshift = []
for r in data:
x.append(float(r[2]))
y.append(float(r[5]))
xlow.append(float(r[3]))
xupp.append(float(r[4]))
ylow.append(float(r[6]))
yupp.append(float(r[7]))
redshift.append(float(r[1]))
from operator import sub
xerr_l = map(sub,x,xlow)
xerr_u = map(sub,xupp,x)
yerr_l = map(sub,y,ylow)
yerr_u = map(sub,yupp,y)
py.xlabel("$Original\ Tx\ XCS\ pipeline\ Tx\ keV$")
py.ylabel("$Iterative\ Tx\ pipeline\ keV$")
py.xlim(0,12)
py.ylim(0,12)
py.title("Redmapper Clusters comparison of Tx pipelines")
ax1 = py.subplot(111)
##Problem starts here after the previous line##
for p in redshift:
for i in xrange(84):
p=redshift[i]
if 0.1<=p<0.2:
ax1.plot(x[i],y[i],color="b", marker='.', linestyle = " ")#, label = "$z < 0.2$")
exit
if 0.2<=p<0.25:
ax1.plot(x[i],y[i],color="g", marker='.', linestyle = " ")#, label="$0.2 \leq z < 0.25$")
exit
if 0.25<=p<=0.3:
ax1.plot(x[i],y[i],color="r", marker='.', linestyle = " ")#, label="$z \geq 0.25$")
exit
##There seems nothing wrong after this point##
py.errorbar(x,y,yerr=[yerr_l,yerr_u],xerr=[xerr_l,xerr_u], fmt= " ",ecolor='magenta', label="Error bars")
cof = np.polyfit(x,y,1)
p = np.poly1d(cof)
l = np.linspace(0,12,100)
py.plot(l,p(l),"black",label="Best fit")
py.plot([0,15],[0,15],"black", linestyle="dotted", linewidth=2.0, label="line $y=x$")
py.grid()
box = ax1.get_position()
ax1.set_position([box.x1,box.y1,box.width, box.height])
py.legend(loc='center left',bbox_to_anchor=(1,0.5))
py.show()
In the 1st 'for' loop, I have indexed every value 'p' in the list 'redshift' so that bins can be created using 'if' statement. But if I add the labels that are hashed out against each py.plot() inside the 'if' statements, each data point 'i' that gets plotted in the figure as an intersection of (x[i],y[i]) takes the label and my entire legend attains in total 87 labels (including the 3 mentioned in the code at other places)!!!!!!
I essentially need 1 label for each bin...
Please tell me what needs to done after the bins are created and py.plot() commands used...Thanks in advance :-)
Sorry I cannot post my image here due to low reputation!
The data 'appended' for x, y and redshift lists from the csv file are as follows:
x=[5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547]
y=[5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677]
redshift = [0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19]
Working with numerical data like this, you should really consider using a numerical library, like numpy.
The problem in your code arises from processing each record (a coordinate (x,y) and the corresponding value redshift) one at a time. You are calling plot for each point, thereby creating legends for each of those 84 datapoints. You should consider your "bins" as groups of data that belong to the same dataset and process them as such. You could use "logical masks" to distinguish between your "bins", as shown below.
It's also not clear why you call exit after each plotting action.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547])
y = np.array([5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677])
redshift = np.array([0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19])
bin3 = 0.25 <= redshift
bin2 = np.logical_and(0.2 <= redshift, redshift < 0.25)
bin1 = np.logical_and(0.1 <= redshift, redshift < 0.2)
plt.ion()
labels = ("$z < 0.2$", "$0.2 \leq z < 0.25$", "$z \geq 0.25$")
colors = ('r', 'g', 'b')
for bin, label, co in zip( (bin1, bin2, bin3), labels, colors):
plt.plot(x[bin], y[bin], color=co, ls='none', marker='o', label=label)
plt.legend()
plt.show()

Categories