Making Dendrogram Bins Thicker in matplotlib? - python

I have a linkage matrix of about size 10,000 that I've plotted using scipy.cluster.hierarchical. The default rendering is poor -- as expected, given the size of the input -- because the bins are way too narrow to discern any meaningful structure in the dendrogram. How can I force the bins to be further apart so I can see the data better? I realize this will require the image to be huge, but that's OK.
I'm aware of dendrogram's truncate functionality. I will likely end up using it, but I'd like to get a look at the full data in a presentation I can grok visually before I start truncating.
Here's the rendering as it appears now. Increasing the image size using figsize does not appear to help, nor does xtick.major.pad.
fig = pylab.figure(figsize=(10, 10))
Z = sch.dendrogram(Y, leaf_rotation=90)
fig.show()
fig.savefig('dendrogram.jpg')
Thank you for your help in advance!

Related

How to make this chart easier to read?

I want to know how to get my x axis labels to display bigger so that the team labels aren't overlapping. I'm sure it's just a matter of configuring the chart size
My code:
plt.plot(prem_data.Team, prem_data.attack_scored,'o')
plt.plot(prem_data.Team, prem_data.defence_saves)
plt.xlabel("Team")
plt.ylabel("Attack goals scored & Defence tackles")
plt.legend(["attack scored", "defence saved"])
plt.show()
I can imagine there being two mutually non-exclusive solutions.
Directly alter the size of the font. This can be achieved via calling plt.rcParams.update({'font.size': <font_size>}), assuming that you have imported matplotlib.pyplot under the alias plt, as you have done in the source code provided. You would probably want to set the <font_size> to be small to prevent overlapping labels, but this would require some experimentation.
Increase the size of the figure. This can be done in a number of ways, but perhaps the simplest method you can implement with minimal edits to your current code would be to use the command plt.rcParams["figure.figsize"] = <fig_size> where <fig_size> is a tuple specifying the size of the figure in inches, such as (10, 5).
With some trial and error, you should be able to manipulate the size of the font and the figure to produce a plot with improved readability.
Note: The method for altering figure size I introduced above is not the most conventional way to go about this problem. Instead, it is much more common to use matplotlib.pyplot.figure or similar variants. For more information, I recommend that you check out this thread and the documentation.

Python: how to plot points with little overlapping

I am using python to plot points. The plot shows relationship between area and the # of points of interest (POIs) in this area. I have 3000 area values and 3000 # of POI values.
Now the plot looks like this:
The problem is that, at lower left side, points are severely overlapping each other so it is hard to get enough information. Most areas are not that big and they don't have many POIs.
I want to make a plot with little overlapping. I am wondering whether I can use unevenly distributed axis or use histogram to make a beautiful plot. Can anyone help me?
I would suggest using a logarithmic scale for the y axis. You can either use pyplot.semilogy(...) or pyplot.yscale('log') (http://matplotlib.org/api/pyplot_api.html).
Note that points where area <= 0 will not be rendered.
I think we have two major choices here. First adjusting this plot, and second choosing to display your data in another type of plot.
In the first option, I would suggest clipping the boundries. You have plenty of space around the borders. If you limit the plot to the boundries, your data would scale better. On top of it, you may choose to plot the points with smaller dots, so that they would seem less overlapping.
Second option would be to choose displaying data in a different view, such as histograms. This might give a better insight in terms of distribution of your data among different bins. But this would be completely different type of view, in regards to the former plot.
I would suggest trying to adjust the plot by limiting the boundries of the plot to the data points, so that the plot area would have enough space to scale the data and try using histograms later. But as I mentioned, these are two different things and would give different insights about your data.
For adjusting you might try this:
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,y1,y2))
You would probably need to make minor adjustments to the axis variables. Note that there should definetly be better options instead of this, but this was the first thing that came to my mind.

Matplotlib: Avoid congestion in X axis

I'm using this code to plot a cumulative frequency plot:
lot = ocum.plot(x='index', y='cdf', yticks=np.arange(0.0, 1.05, 0.1))
plot.set_xlabel("Data usage")`
plot.set_ylabel("CDF")
fig = plot.get_figure()
fig.savefig("overall.png")
How it appears as follows and is very crowded around the initial part. This is due to my data spread. How can I make it more clear? (uploading to postimg because I don't have enough reputation points)
http://postimg.org/image/ii5z4czld/
I hope that I understood what you want: give more space to the visualization of the "CDF" development for smaller "data usage" values, right? Typically, you would achieve this by changing your X axis scale from linear to logarithmic. Head over to Plot logarithmic axes with matplotlib in python for seeing different ways to achieve that. The simplest might be, in your case, to replace plot() with semilogx().

KDE is very slow with large data

When I try to make a scatter plot, colored by density, it takes forever.
Probably because the length of the data is quite big.
This is basically how I do it:
xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')
As an additional info, I have to add that:
>>len(x_values)
809649
>>len(y_values)
809649
Is it any other option to get the same results but with better speed results?
No, there is not good solutions.
Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.
My tricks: (note these point may change slightly the output)
get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
duplicate data
convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.
Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.
I would suggest plotting a sample of the data.
If the sample is large enough you should get the same distribution.
Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.

Plotting an histogram in log log scale with identical bar thickness

I'm trying to plot input data in an histogram in log-log scale (to quickly view if this could fit a power law), but I'm having trouble in outputting the way I want. I'm using Python and more specificaly the matplotlib/numpy libraries:
thebins = N.linspace(min_data.min(),min_data.max(),int(sys.argv[len(sys.argv)-1]))
thebins = N.log(thebins)
bar_min = plt.hist(min_data,bins=thebins,alpha=0.40,label=['Minimal Distance'],log=True)
min_data is my 1d data array, the two first lines are for creating the bins and then putting them in a log scale. The final line is for 'filling' the bins/histogram with log y scale.
The graphical output is:
It may seem fussy but I'm not satisifed with having bins of different thickness, it seems to me that the data is harder to read or can even be misread from that. Not all log-log histogram have same width bins and I'm convinced it can be done within Python; do you have an idea of to change my code to get there?
Thank you in advance ;)
Should have been a nobrainer: I only had to take the log of my data for the x axis, and then build the histogram passing the argument "log=True" for the y axis.

Categories