Histogram shows unlimited bins despite bin specification in matplotlib - python

I have a error data and when I tried to make a histogram of the data the intervals or the bin sizes were showing large as shown in the below image
Below is the code
import matplotlib.pyplot as plt
plt.figure()
plt.hist(error)
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
When I tried to explicitly mention the bins as we usually do, like in the below code I get the hist plot as shown below
plt.figure()
plt.hist(error, bins=[-4,-3,-2,-1, 0,1, 2,3, 4,])
#plt.hist(error, bins = 6)
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
I wish to make the hist look nice, something like below (an example from google) with bins clearly defined.
i Tried with seaborn displot and it gave a nice plot as shown below.
import seaborn as sns
sns.displot(error, bins=[-4,-3,-2,-1, 0,1, 2,3, 4,])
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
Why is that the matplotlib not able to make this plot? Did I miss anything or do I need to set something in order to make the usual histogram plot? Please highlight

The matplotlib documentation for plt.hist() explains that the first parameter can either by a 1D array or a sequence of 1D arrays. The latter case is used if you pass in a 2D array and will result in plotting a separate bar with cycling colors for each of the rows.
This is what we see in your example: The X-axis ticks still correspond to the bin-edges that were passed in - but for each bin there are many bars. So, I'm assuming you passed in a multidimensional array.
To fix this, simply flatten your data before passing it to matplotlib, e.g. plt.hist(np.ravel(error), bins=bins).

Related

Plots not visible when using a line plot

I am new to python and I am trying to plot x and y (both have a large number of data) but when I use a plt.plot there is not plot visible on the output.
The code I have been using is
for i in range(len(a)):
plt.plot(a[i],b[i])
plt.figure()
plt.show()
when I tried a scatter plot
for i in range(len(a)):
plt.scatter(a[i],b[i])
plt.figure()
plt.show()
I am not able to understand the reason for missing the line plot and even when I try seaborn it showing me an error ValueError: If using all scalar values, you must pass an index
import numpy as np
import matplotlib.pyplot as plt
a = np.linspace(0,5,100)
b = np.linspace(0,10,100)
plt.plot(a,b)
plt.show()
I think this answers your question. I have taken sample values of a and b. The matplotlib line plots are not required to run in loops
A line is created between two points. If you are plotting single values, a line can't be constructed.
Well, you might say "but I am plotting many points," which already contains part of the answer (points). Actually, matplotlib.plot() plots line-objects. So every time, you call plot, it creates a new one (no matter if you are calling it on the same or on a new axis). The reason why you don't get lines is that only single points are plotted. The reason why you're not even seeing the these points is that plot() does not indicate the points with markers per default. If you add marker='o' to plot(), you will end up with the same figure as with scatter.
A scatter-plot on the other hand is an unordered collection of points. There characteristic is that there are no lines between these points because they are usually not a sequence. Nonetheless, because there are no lines between them, you can plot them all at once. Per default, they have all the same color but you can even specify a color vector so that you can encode a third information in it.
import matplotlib.pyplot as plt
import numpy as np
# create random data
a = np.random.rand(10)
b = np.random.rand(10)
# open figure + axes
fig,axs = plt.subplots(1,2)
# standard scatter-plot
axs[0].scatter(a,b)
axs[0].set_title("scatter plot")
# standard line-plot
axs[1].plot(a,b)
axs[1].set_title("line plot")

Vary legend properties for different data in seaborn scatterplots

I have created a seaborn scatterplot for a dataset, where I set the sizes parameter to one column, and the hue parameter to another. Now the hue parameter only consists of five different values and is supposed to help classifying my data, while the sizes parameter consists of a lot more to represent actual numeric data. In this current data set, my hue values only consist of 0, 2, and 4, but in the "brief" legend option, the legend labels are not synchronized to that, which is very confusing. In the "full" legend option, the hue-labels are correct, but the size-labels are way too many. Therefore I would like to display the full legend for my hue parameter, but only a brief legend for the sizes parameter, because it consists of lots of unique values.
How the overcrowded "full" legend looks
The "brief" legend that is confusingly labeled
Edit: I edited some code in that demonstrates the issue for a random dataset. To specify my question again, I want the "shape" parameters to get fully depicted on the legend, while the "size" parameters have to be shortened (equivalent to the legend setting "brief").
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x_condition=np.arange(0,20,1)
y_condition=np.arange(0,20,1)
size=np.random.randint(0,200,20)
# I haven't made a random distribution here, because I wanted to make sure it contains at least one of each [0,2,4]
shape=[0,2,0,4]*5
df=pd.DataFrame({"x_condition":x_condition,"y_condition":y_condition,"size":size,"shape":shape})
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="brief")
plt.show()
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="full")
plt.show()

How to ensure even spacing between labels on x axis of matplotlib graph?

I have been given a data for which I need to find a histogram. So I used pandas hist() function and plot it using matplotlib. The code runs on a remote server so I cannot directly see it and hence I save the image. Here is what the image looks like
Here is my code below
import matplotlib.pyplot as plt
df_hist = pd.DataFrame(np.array(raw_data)).hist(bins=5) // raw_data is the data supplied to me
plt.savefig('/path/to/file.png')
plt.close()
As you can see the x axis labels are overlapping. So I used this function plt.tight_layout() like so
import matplotlib.pyplot as plt
df_hist = pd.DataFrame(np.array(raw_data)).hist(bins=5)
plt.tight_layout()
plt.savefig('/path/to/file.png')
plt.close()
There is some improvement now
But still the labels are too close. Is there a way to ensure the labels do not touch each other and there is fair spacing between them? Also I want to resize the image to make it smaller.
I checked the documentation here https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html but not sure which parameter to use for savefig.
Since raw_data is not already a pandas dataframe there's no need to turn it into one to do the plotting. Instead you can plot directly with matplotlib.
There are many different ways to achieve what you'd like. I'll start by setting up some data which looks similar to yours:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gamma
raw_data = gamma.rvs(a=1, scale=1e6, size=100)
If we go ahead and use matplotlib to create the histogram we may find the xticks too close together:
fig, ax = plt.subplots(1, 1, figsize=[5, 3])
ax.hist(raw_data, bins=5)
fig.tight_layout()
The xticks are hard to read with all the zeros, regardless of spacing. So, one thing you may wish to do would be to use scientific formatting. This makes the x-axis much easier to interpret:
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
Another option, without using scientific formatting would be to rotate the ticks (as mentioned in the comments):
ax.tick_params(axis='x', rotation=45)
fig.tight_layout()
Finally, you also mentioned altering the size of the image. Note that this is best done when the figure is initialised. You can set the size of the figure with the figsize argument. The following would create a figure 5" wide and 3" in height:
fig, ax = plt.subplots(1, 1, figsize=[5, 3])
I think the two best fixes were mentioned by Pam in the comments.
You can rotate the labels with
plt.xticks(rotation=45
For more information, look here: Rotate axis text in python matplotlib
The real problem is too many zeros that don't provide any extra info. Numpy arrays are pretty easy to work with, so pd.DataFrame(np.array(raw_data)/1000).hist(bins=5) should get rid of three zeros off of both axes. Then just add a 'kilo' in the axes labels.
To change the size of the graph use rcParams.
from matplotlib import rcParams
rcParams['figure.figsize'] = 7, 5.75 #the numbers are the dimensions

Pyplot colormap line by line

I'm beginning with plotting on python using the very nice pyplot. I aim at showing the evolution of two series of data along time. Instead of doing a casual plot of data function of time, I'd like to have a scatter plot (data1,data2) where the time component is shown as a color gradient.
In my two column file, the time would be described by the line number. Either written as a 3rd column in the file either using the intrinsic capability of pyplot to get the line number on its own.
Can anyone help me in doing that ?
Thanks a lot.
Nicolas
When plotting using matplotlib.pyplot.scatter you can pass a third array via the keyword argument c. This array can choose the colors that you want your scatter points to be. You then also pick an appropriate colormap from matplotlib.cm and assign that with the cmap keyword argument.
This toy example creates two datasets data1 and data2. It then also creates an array colors, an array of continual values equally spaced between 0 and 1, and with the same length as data1 and data2. It doesn't need to know the "line number", it just needs to know the total number of data points, and then equally spaces the colors.
I've also added a colorbar. You can remove this by removing the plt.colorbar() line.
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
N = 500
data1 = np.random.randn(N)
data2 = np.random.randn(N)
colors = np.linspace(0,1,N)
plt.scatter(data1, data2, c=colors, cmap=cm.Blues)
plt.colorbar()
plt.show()

Python - matplotlib axes limits approximate ticker location

When no axes limits are specified, matplotlib chooses default values as nice, round numbers below and above the minimum and maximum values in the list to be plotted.
Sometimes I have outliers in my data and I don't want them included when the axes are selected. I can detect the outliers, but I don't want to actually delete them, just have them be beyond the area of the plot. I have tried setting the axes to be the minimum and maximum value in the list not including the outliers, but that means that those values lie exactly on the axes, and the bounds of the plot do not line up with ticker points.
Is there a way to specify that the axes limits should be in a certain range, but let matplotlib choose an appropriate point?
For example, the following code produces a nice plot with the y-axis limits automatically set to (0.140,0.165):
from matplotlib import pyplot as plt
plt.plot([0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636])
plt.show()
After introducing an outlier in the data and setting the limits manually, the y-axis limits are set to slightly below 0.145 and slightly above 0.160 - not nearly as neat and tidy.
from matplotlib import pyplot as plt
plt.plot([0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 500000, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636])
plt.ylim(0.142921640661, 0.160337332636)
plt.show()
Is there any way to tell matplotlib to either ignore the outlier value when setting the limits, or set the axes to 'below 0.142921640661' and 'above 0.160337332636', but let it decide an appropriate location? I can't simply round the numbers up and down, as all my datasets occur on a different scale of magnitude.
You could make your data a masked array:
from matplotlib import pyplot as plt
import numpy as np
data = [0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 500000, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636]
data = np.ma.array(data, mask=False)
data.mask = data>0.16
plt.plot(data)
plt.show()
unutbu actually gave me an idea that solves the problem. It's not the most efficient solution, so if anyone has any other ideas, I'm all ears.
EDIT: I was originally masking the data like unutbu said, but that doesn't actually set the axes right. I have to remove the outliers from the data.
After removing the outliers from the data, the remaining values can be plotted and the y-axis limits obtained. Then the data with the outliers can be plotted again, but setting the limits from the first plot.
from matplotlib import pyplot as plt
data = [0.144490353418, 0.142921640661, 0.144511781706, 0.143587888773, 500000, 0.146009766101, 0.147241517391, 0.147224266382, 0.151530932135, 0.158778411784, 0.160337332636]
cleanedData = remove_outliers(data) #Function defined by me elsewhere.
plt.plot(cleanedData)
ymin, ymax = plt.ylim()
plt.clf()
plt.plot(data)
plt.ylim(ymin,ymax)
plt.show()

Categories