Seaborn showing x-tick labels overlapping - python

I am trying to make a box plot that looks like this.
Now, there are a lot of tickmarks that I do not need and truly do not show any additional information.
The code I am using is the following:
plot=sns.boxplot(y=MSE, x=Sim,
width=0.5,
palette='colorblind')
plot=sns.stripplot(y=MSE, x=Sim,
jitter=True,
marker='o',
alpha=0.15,
color='black')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.gca().invert_xaxis()
Where MSE and SIM are two numpy arrays of 400 elements each.
I reviewed some solutions that use locator_params and set_xticklabels. However, I want to know:
why this happen and,
is there a simple transformation in the MSE and SIM arrays to solve this?
I hope my questions are clear enough.
Thanks in advance.

Not very sure what you have as Sim, if it is an array of floats, then they are converted to categorical before plotting. The thing you can do, since the labels are not useful, is to use a range of values thats as long as the y-values.
With that, it still overlaps a lot because you are trying to fit 400 x ticks onto the x-axis, and the font size are set by default to be something readable. For example:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
fig,ax = plt.subplots(figsize=(15,6))
MSE = [np.random.normal(0,1,10) for i in range(100)]
Sim = np.arange(len(MSE))
g = sns.boxplot(y=MSE, x=Sim, width=0.5,palette='colorblind',ax=ax)
You can set the font size to be smaller and they don't overlap but I guess its hardly readable:
So like you said in your case, they are not useful, you can do:
ax.set(xticks=Sim[0::10])

Related

Python: Histogram return wrong values for counts (EDIT: more general with example)

EDIT: Ive found a general example where it doesnt work either!
I am trying to extract the data for a histogram, but different counts seem wrong. As an example code:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.rand(1000000)
bins = np.arange(0,1,0.0001)
a,b,c = plt.hist(data,bins)
This gives me this rather messy histogram, and i've saved the counts as a and the interval as b. Now, plotting a and b, I should expect the same histogram, right? But that's not what I get:
plt.scatter(b[0:len(b)-1],a,s=2)
which gives me this, which doesnt match at all! Furthurmore, when I try and find the maximum value of a, it gives me 144, which fits fine with the scatterplot, but not with the histogram function.
If I count the numbers myself with the following code:
len(np.intersect1d(np.where(data>=b[np.argmax(a)]),np.where(data<b[np.argmax(a)+1])))
then it also gives me 144, in accordance with the values. So is the displayed histogram just wrong for some reason, and I should ignore it and just take the extracted data?
Old, unedited post:
For a physics course I am trying to bin my results in the following way:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
from scipy.optimize import curve_fit
plt.rc("font", family=["Helvetica", "Arial"])
plt.rc("axes", labelsize=18)
plt.rc("xtick", labelsize=16, top=True, direction="in")
plt.rc("ytick", labelsize=16, right=True, direction="in")
plt.rc("axes", titlesize=22)
plt.rc("legend", fontsize=16)
data_Ra = np.loadtxt('Ra226_cal2_ch001.txt',skiprows=5)
t_Ra = data_Ra[:,0]*10**-8 # time in seconds
channels_Ra = data_Ra[:,1]
channels_Ra = channels_Ra[np.where(channels_Ra>0)] # removing all the measurements at channel = 0
intervalspace = 2 #The intervals in which we count
bins=np.arange(0,4000,intervalspace)
counts, intervals , stuff = plt.hist(channels_Ra,bins)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.show()
Here, the histogram plot looks totally fine, with a max near 13000 counts. But when I then use np.max(counts), I am given about 24000, and when I try and just plot the values it gives me with:
plt.scatter(intervals[0:len(intervals)-1]+intervalspace/2,counts,s=1)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.title('Ra225')
plt.show()
it looks like this, which is totally different, and I can't figure out why. I am expecting the scatterplot to resemble the histogram, and while the peaks are located at the same x-vales, the height do not match.
This problem is in other large datasets as well.
I dont think i'm allowed to drop the txt-file here? So im not sure how much more I can show, but any help will be appreciated!
I don't know why you interpret the results in that way.
If you look at the histogram plot, you will be able to see the maximum value of the y-axis is 25,000. That means that there are some values close to 25,000. This fact can be verified in the scatter plot.
Your scatter plot shows actual values. It would be clearer if you describe how your expected plot looks like.
If you want discard some outlier points, you should apply some filtering before plotting the data.

How to ensure even spacing between labels on x axis of matplotlib graph?

I have been given a data for which I need to find a histogram. So I used pandas hist() function and plot it using matplotlib. The code runs on a remote server so I cannot directly see it and hence I save the image. Here is what the image looks like
Here is my code below
import matplotlib.pyplot as plt
df_hist = pd.DataFrame(np.array(raw_data)).hist(bins=5) // raw_data is the data supplied to me
plt.savefig('/path/to/file.png')
plt.close()
As you can see the x axis labels are overlapping. So I used this function plt.tight_layout() like so
import matplotlib.pyplot as plt
df_hist = pd.DataFrame(np.array(raw_data)).hist(bins=5)
plt.tight_layout()
plt.savefig('/path/to/file.png')
plt.close()
There is some improvement now
But still the labels are too close. Is there a way to ensure the labels do not touch each other and there is fair spacing between them? Also I want to resize the image to make it smaller.
I checked the documentation here https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html but not sure which parameter to use for savefig.
Since raw_data is not already a pandas dataframe there's no need to turn it into one to do the plotting. Instead you can plot directly with matplotlib.
There are many different ways to achieve what you'd like. I'll start by setting up some data which looks similar to yours:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gamma
raw_data = gamma.rvs(a=1, scale=1e6, size=100)
If we go ahead and use matplotlib to create the histogram we may find the xticks too close together:
fig, ax = plt.subplots(1, 1, figsize=[5, 3])
ax.hist(raw_data, bins=5)
fig.tight_layout()
The xticks are hard to read with all the zeros, regardless of spacing. So, one thing you may wish to do would be to use scientific formatting. This makes the x-axis much easier to interpret:
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
Another option, without using scientific formatting would be to rotate the ticks (as mentioned in the comments):
ax.tick_params(axis='x', rotation=45)
fig.tight_layout()
Finally, you also mentioned altering the size of the image. Note that this is best done when the figure is initialised. You can set the size of the figure with the figsize argument. The following would create a figure 5" wide and 3" in height:
fig, ax = plt.subplots(1, 1, figsize=[5, 3])
I think the two best fixes were mentioned by Pam in the comments.
You can rotate the labels with
plt.xticks(rotation=45
For more information, look here: Rotate axis text in python matplotlib
The real problem is too many zeros that don't provide any extra info. Numpy arrays are pretty easy to work with, so pd.DataFrame(np.array(raw_data)/1000).hist(bins=5) should get rid of three zeros off of both axes. Then just add a 'kilo' in the axes labels.
To change the size of the graph use rcParams.
from matplotlib import rcParams
rcParams['figure.figsize'] = 7, 5.75 #the numbers are the dimensions

Pyplot doesn't use the full space on 2D plots when setting equal ratio

I'm plotting some 2D fields using matplotlib and the fields have to be seen with equal aspect ratio. But when I set the aspect ratio I find that there are unnecessary blank spaces. Please consider the following example:
from matplotlib import pyplot as plt
import numpy as np
x=np.arange(100)
y=np.arange(100)
Y, X = np.meshgrid(y,x)
Z = X + Y
plt.contourf(X, Y, Z)
#plt.axes().set_aspect('equal', 'datalim')
plt.tight_layout()
plt.colorbar()
plt.grid()
plt.show()
If I run that command I get this figure:
However, let's say I uncomment the line that sets the equal ratio . So let's say I include this:
plt.axes().set_aspect('equal', 'datalim')
I get the following output:
Which is a very poor use of space. I can't make the actual plot take better advantage of the figure space no matter how hard I try (I don't have that much knowledge of pyplot).
I there a way to expand the actual data part of the equal-ratio plot so that I have less white space?
Thank you.
The issue you're having is caused by "datalim", which asks the axes to apply the usual limits you would expect from a normal line or scatter plot, e.g. the use of 5% margin on each side of the shown data.
I do not see any reason to use "datalim" here. So you may just leave it out,
plt.axes().set_aspect('equal')
and get a plot with equal aspect and no white space around.

Python Pylab pcolor options for publication quality plots

I am trying to make DFT (discrete fourier transforms) plots using pcolor in python. I have previously been using Mathematica 8.0 to do this but I find that the colorbar in mathematica 8.0 has bad one-to-one correlation with the data I try to represent. For instance, here is the data that I am plotting:
[[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]]
So, its a lot of zeros or small numbers in a DFT matrix or small quantity of high frequency energies.
When I plot this using mathematica, this is the result:
The color bar is off and I thought I'd like to plot this with python instead.
My python code (that I hijacked from here) is:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
#from pylab import pcolor, show, colorbar, xticks, yticks
from pylab import *
data = np.array([[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]], np.float)
pcolor(data)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
#show()
savefig('/home/mydir/foo.eps',figsize=(4,4),dpi=100)
And this python code plots as:
Now here is my question/list of questions:
I like how python plots this and would like to use this but...
How do I make all the "blue" which represents "0" go away like it does in my mathematica plot?
How do I rotate the plot to have the bright red spot in the top left corner?
The way I set the "dpi", is that correct?
Any useful references that I should use to strengthen my love for python?
I have looked through other questions on here and the user manual for numpy but found not much help.
I plan on publishing this data and it is rather important that I get all the bits and pieces right! :)
Edit:
Modified python code and resulting plot! What improvements would one suggest to this to make it publication worthy?
from numpy import corrcoef, sum, log, arange, save
from numpy.random import rand
from pylab import *
data = np.array([[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]], np.float)
v1 = abs(data).max()
v2 = abs(data).min()
pcolor(data, cmap="binary")
colorbar()
#xlabel("X", fontsize=12, fontweight="bold")
#ylabel("Y", fontsize=12, fontweight="bold")
xticks(arange(0.5,10.5),range(0,10),fontsize=19)
yticks(arange(0.5,10.5),range(0,10),fontsize=19)
axis([0,7,0,7])
#show()
savefig('/home/mydir/Desktop/py_dft.eps',figsize=(4,4),dpi=600)
The following will get you closer to what you want:
import matplotlib.pyplot as plt
plt.pcolor(data, cmap=plt.cm.OrRd)
plt.yticks(np.arange(0.5,10.5),range(0,10))
plt.xticks(np.arange(0.5,10.5),range(0,10))
plt.colorbar()
plt.gca().invert_yaxis()
plt.gca().set_aspect('equal')
plt.show()
The list of available colormaps by default is here. You'll need one that starts out white.
If none of those suits your needs, you can try generating your own, start by looking at LinearSegmentedColormap.
Just for the record, in Mathematica 9.0:
GraphicsGrid#{{MatrixPlot[l,
ColorFunction -> (ColorData["TemperatureMap"][Rescale[#, {Min#l, Max#l}]] &),
ColorFunctionScaling -> False], BarLegend[{"TemperatureMap", {0, Max#l}}]}}

How to plot non-numeric data in Matplotlib

I wish to plot the time variation of my y-axis variable using Matplotlib. This is no problem for continuously discrete data, however how should this be tackled for non-continuous data.
I.e. if I wanted to visualise the times at which my car was stationary on the way to work the x-axis would be time and the y-axis would be comprised of the variables 'stationary' and 'moving' (pretty useless example i know)
The non-continuous data would need to be indexed somehow, but i don't know how to proceed...any ideas?
Is this the type of thing you want? (If not, you might want to check out the matplotlib gallery page to give yourself some ideas, or maybe just draw a picture and post it.)
import matplotlib.pyplot as plt
data = [0]*5 + [1]*10 + [0]*3 +[1]*2
print data
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(data)
ax.set_yticks((0, 1.))
ax.set_yticklabels(('stopped', 'moving'))
ax.set_ybound((-.2, 1.2))
ax.set_xlabel("time (minutes)")
plt.show()

Categories