Scaled logarithmic binning in python - python

I'm interested in plotting the probability distribution of a set of points which are distributed as a power law. Further, I would like to use logarithmic binning to be able to smooth out the large fluctuations in the tail. If I just use logarithmic binning, and plot it on a log log scale, such as
pl.hist(MyList,log=True, bins=pl.logspace(0,3,50))
pl.xscale('log')
for example, then the problem is that the larger bins account for more points, i.e. the heights of my bins are not scaled by bin size.
Is there a way to use logarithmic binning, and yet make python scale all the heights by the size of the bin? I know I can probably do this in some roundabout fashion manually, but it seems like this should be a feature that exists, but I can't seem to find it. If you think histograms are fundamentally a bad way to represent my data and you have a better idea, then I'd love to hear that too.
Thanks!

Matplotlib won't help you much if you have special requirements of your histograms. You can, however, easily create and manipulate a histogram with numpy.
import numpy as np
from matplotlib import pyplot as plt
# something random to plot
data = (np.random.random(10000)*10)**3
# log-scaled bins
bins = np.logspace(0, 3, 50)
widths = (bins[1:] - bins[:-1])
# Calculate histogram
hist = np.histogram(data, bins=bins)
# normalize by bin width
hist_norm = hist[0]/widths
# plot it!
plt.bar(bins[:-1], hist_norm, widths)
plt.xscale('log')
plt.yscale('log')
Obviously when you do present your data in a non-obvious way like this, you have to be very careful about how to label your y axis properly and write an informative figure caption.

Related

Seaborn showing x-tick labels overlapping

I am trying to make a box plot that looks like this.
Now, there are a lot of tickmarks that I do not need and truly do not show any additional information.
The code I am using is the following:
plot=sns.boxplot(y=MSE, x=Sim,
width=0.5,
palette='colorblind')
plot=sns.stripplot(y=MSE, x=Sim,
jitter=True,
marker='o',
alpha=0.15,
color='black')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.gca().invert_xaxis()
Where MSE and SIM are two numpy arrays of 400 elements each.
I reviewed some solutions that use locator_params and set_xticklabels. However, I want to know:
why this happen and,
is there a simple transformation in the MSE and SIM arrays to solve this?
I hope my questions are clear enough.
Thanks in advance.
Not very sure what you have as Sim, if it is an array of floats, then they are converted to categorical before plotting. The thing you can do, since the labels are not useful, is to use a range of values thats as long as the y-values.
With that, it still overlaps a lot because you are trying to fit 400 x ticks onto the x-axis, and the font size are set by default to be something readable. For example:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
fig,ax = plt.subplots(figsize=(15,6))
MSE = [np.random.normal(0,1,10) for i in range(100)]
Sim = np.arange(len(MSE))
g = sns.boxplot(y=MSE, x=Sim, width=0.5,palette='colorblind',ax=ax)
You can set the font size to be smaller and they don't overlap but I guess its hardly readable:
So like you said in your case, they are not useful, you can do:
ax.set(xticks=Sim[0::10])

Matplotlib Interpolation

I have a Data Frame df with two columns 'Egy' and 'fx' that I plot in this way:
plot_1 = df_data.plot(x="Egy", y="fx", color="red", ax=ax1, linewidth=0.85)
plot_1.set_xscale('log')
plt.show()
But then I want to smooth this curve using spline like this:
from scipy.interpolate import spline
import numpy as np
x_new = np.linspace(df_data['Egy'].min(), df_data['Egy'].max(),500)
f = spline(df_data['Egy'], df_data['fx'],x_new)
plot_1 = ax1.plot(x_new, f, color="black", linewidth=0.85)
plot_1.set_xscale('log')
plt.show()
And the plot I get is this (forget about the scatter blue points).
There are a lot of "peaks" in the smooth curve, mainly at lower x. How Can I smooth this curve properly?
When I consider the "busybear" suggestion of use np.logspace instead of np.linspace I get the following picture, which is not very satisfactory either.
You have your x values linearly scaled with np.linspace but your plot is log scaled. You could try np.geomspace for your x values or plot without the log scale.
Using spline will only work well for functions that are already smooth. What you need is to regularize the data and then interpolate afterwards. This will help to smooth out the bumps. Regularization is an advanced topic, and it would not be appropriate to discuss it in detail here.
Update: for regularization using machine learning, you might look into the scikit library for Python.

histogram on matplotlib for a

I am new here, although I have been reading answers to questions for quite a while. I have a problem, I have a seismic hazard curve looking roughly as follows:
I need to plot it like an histogram. That is what a hazard curve looks like - I would need to plot the median as a histogram
I have tried to use plt.hist as follows:
n, bins, patches = plt.hist(x, 50, facecolor='green', alpha=0.75)
where x is my frequency data array:
x = [1.00E-02, 1.00E-03, 1.00E-04, 1.00E-05, 1.00E-06, 1.00E-07, 1.00E-08, 1.00E-09, 1.00E-10]
but it gives me back an empty image. I think it is because it's used to plot probability density functions (and mine is not a probability density function) but I am not sure if I am right. Can someone give me some pointers on how to do this?

Plot 2 histograms with different length of data points in one graph using matplotlib

I have two set of data with one containing around 11 million data points and the another around 5000. I would like to plot them both on one histogram. But because of the difference in size I need to normalise the frequency so I can plot them on the same figure. Below I have simulated what I have done with my data to be able to plot them. I have used the normed=True.
from numpy.random import randn
import matplotlib.pyplot as plt
import random
datalist1=[]
for x in range(1,50000):
datalist1.append(random.uniform(1,2))
datalist2=randn(5000000)
fig= plt.figure(1)
plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
plt.xlabel("Value")
plt.ylabel("Normalised Frequency")
plt.legend()
plt.show()
Can you please tell me if this is a good way to get around this issue? I would like to match the tallest hight between the two histogram frequencies to be 1 (or 100%).
The normed=True setting normalizes the histogram to an area of 1. That gives the histogram an interpretation as estimates of probability density functions.
In short, it actually makes sense not to normalize on the peak but on the area.
But if you really want to normalize by height you can modify the polygon data of the histogram:
h = plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
h = plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
This solution feels a bit hackish, but at least it's quick and dirty :)

Python - matplotlib axes limits approximate ticker location

When no axes limits are specified, matplotlib chooses default values as nice, round numbers below and above the minimum and maximum values in the list to be plotted.
Sometimes I have outliers in my data and I don't want them included when the axes are selected. I can detect the outliers, but I don't want to actually delete them, just have them be beyond the area of the plot. I have tried setting the axes to be the minimum and maximum value in the list not including the outliers, but that means that those values lie exactly on the axes, and the bounds of the plot do not line up with ticker points.
Is there a way to specify that the axes limits should be in a certain range, but let matplotlib choose an appropriate point?
For example, the following code produces a nice plot with the y-axis limits automatically set to (0.140,0.165):
from matplotlib import pyplot as plt
plt.plot([0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636])
plt.show()
After introducing an outlier in the data and setting the limits manually, the y-axis limits are set to slightly below 0.145 and slightly above 0.160 - not nearly as neat and tidy.
from matplotlib import pyplot as plt
plt.plot([0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 500000, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636])
plt.ylim(0.142921640661, 0.160337332636)
plt.show()
Is there any way to tell matplotlib to either ignore the outlier value when setting the limits, or set the axes to 'below 0.142921640661' and 'above 0.160337332636', but let it decide an appropriate location? I can't simply round the numbers up and down, as all my datasets occur on a different scale of magnitude.
You could make your data a masked array:
from matplotlib import pyplot as plt
import numpy as np
data = [0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 500000, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636]
data = np.ma.array(data, mask=False)
data.mask = data>0.16
plt.plot(data)
plt.show()
unutbu actually gave me an idea that solves the problem. It's not the most efficient solution, so if anyone has any other ideas, I'm all ears.
EDIT: I was originally masking the data like unutbu said, but that doesn't actually set the axes right. I have to remove the outliers from the data.
After removing the outliers from the data, the remaining values can be plotted and the y-axis limits obtained. Then the data with the outliers can be plotted again, but setting the limits from the first plot.
from matplotlib import pyplot as plt
data = [0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 500000, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636]
cleanedData = remove_outliers(data) #Function defined by me elsewhere.
plt.plot(cleanedData)
ymin, ymax = plt.ylim()
plt.clf()
plt.plot(data)
plt.ylim(ymin,ymax)
plt.show()

Categories