I am just learning some basics of Data Analysis.
I have a simple csv data file like the one below.
START,FIRST,SECOND,ITEM
1,100,200,A
2,100,200,B
2,100,300,C
2,200,300,D
3,200,100,E
3,200,100,F
3,200,100,G
3,200,100,H
3,200,100,I
3,200,100,J
I wrote this small program to read this csv file and then print a histogram using matplotlib for the three columns START, FIRST, and SECOND. I also print a scatter plot for FIRST vs SECOND columns.
#!/exp/anaconda3/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
file_name = 'junk.csv'
data = pd.read_csv(file_name)
print(data.describe())
plt.rcParams['axes.grid'] = True
fix, axs = plt.subplots(2,2, figsize=(15,10))
axs[0, 0].hist(data['START'], 100, density=True, facecolor='g', alpha=0.8)
axs[1, 0].scatter(data['FIRST'], data['SECOND'], facecolor='violet')
axs[0, 1].hist(data['FIRST'], 100, density=True, facecolor='r', alpha=0.8)
axs[1, 1].hist(data['SECOND'], 100, density=True, facecolor='b', alpha=0.8)
plt.show()
What I do not understand is in the histogram plots, for example, the bottom right hand image with blue bars in the attached picture, why does it not simply plot how many times the number 200 is occurring instead of showing that 200 occurs 0.10 times. How is that possible? Same goes for the 300 as well.
Can someone help me understand what and how matplot is coming up with the Y-axis values? These values do not make sense to me.
Thank you.
Ruby Drew
Try density = False. The density parameter tells matplotlib whether you want it to normalise the heights or not such that it represents a probability density.
First note that a histogram is primarily meant to count continuous samples in small bins. For discrete data, the bins should be carefully chosen to have boundaries nicely in-between the values. When you add bins=N, matplotlib supposes a continuous distribution and subdivides the space from smallest to largest sample into N equally-sized bins. For discrete data this can have unexpected side effects, such as putting samples in either one bin for the values on the bin boundaries.
With density=True, the heights of the bars is recalculated such that the total area of all bins sums to 1. For a continuous distribution with many samples, this resembles the probability density function and can be used to draw a kde plot with the same y-axis.
So, what's happening in the blue histogram:
100 bins are created between 100 and 300. Each bin will be 2 wide.
3 bins get values: the bin 100-102 gets count 3, either the bin 198-200 or the bin 200-202 get a count of 1, the bin 298-300 also gets a count of one.
The total height of the bins is now 5. As their width is 2, the histogram counts need to be divided by (total_height * bin_width) to obtain a total area of 1.
Clearly, the sum of height times width of the bars sums to 1: 0.3*2 + 0.1*2 + 0.1*2 = 1.
The latest version (0.11) of Seaborn's histplot has a parameter to indicate that a distribution is discrete. And a parameter stat= where you can choose between 'count' for bin heights indicating the usual counts and 'probability' for heights relative to their probability, mimicking a probability mass function. The blue histogram could be drawn as:
import seaborn as sns
sns.histplot(data, x='SECOND', discrete=True, stat='probability', facecolor='b', alpha=0.8, ax=axs[1, 1])
I'm trying to plot a colour map and I need to use mlab.specgram() to perform a cross-correlation of two functions in the frequency domain, so I can't use pyplot.specgram(). I used pyplot.imshow in order to plot a colour map of the cross-correlation, but as a result the axes are just index numbers rather than the actual time values corresponding to the power shown in the colour map.
I've tried to change the labels using xticks()/yticks() and the extent argument, but all it does is show me a small portion of my colour map instead of changing the labels.
Is there a way for me the change the scale of my axes to match the actual frequency and time?
For reference:
# My spectrograms:
spec_H1, freqs, t = mlab.specgram(H_filt, NFFT=NFFT, Fs=fs, noverlap=NOVL, mode=mode)
spec_L1, freqs, t = mlab.specgram(L_filt, NFFT=NFFT, Fs=fs, noverlap=NOVL, mode=mode)
# The cross-correlation:
X = np.real(spec_H1 * np.conj(spec_L1))
# The figure:
plt.figure(figsize=(10,10))
plt.imshow(abs(X), cmap = 'jet')
plt.colorbar()
plt.ylim(0,200)
plt.xlim(0,500)
As you can see, for example, the time axis (x) should run from 0 to 4 seconds, but it's sampled such that it runs from 0 to 500. How do I change this?
In the pyplot document for scatter plot:
matplotlib.pyplot.scatter(x, y, s=20, c='b', marker='o', cmap=None, norm=None,
vmin=None, vmax=None, alpha=None, linewidths=None,
faceted=True, verts=None, hold=None, **kwargs)
The marker size
s:
size in points^2. It is a scalar or an array of the same length as x and y.
What kind of unit is points^2? What does it mean? Does s=100 mean 10 pixel x 10 pixel?
Basically I'm trying to make scatter plots with different marker sizes, and I want to figure out what does the s number mean.
This can be a somewhat confusing way of defining the size but you are basically specifying the area of the marker. This means, to double the width (or height) of the marker you need to increase s by a factor of 4. [because A = WH => (2W)(2H)=4A]
There is a reason, however, that the size of markers is defined in this way. Because of the scaling of area as the square of width, doubling the width actually appears to increase the size by more than a factor 2 (in fact it increases it by a factor of 4). To see this consider the following two examples and the output they produce.
# doubling the width of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*4**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()
gives
Notice how the size increases very quickly. If instead we have
# doubling the area of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*2**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()
gives
Now the apparent size of the markers increases roughly linearly in an intuitive fashion.
As for the exact meaning of what a 'point' is, it is fairly arbitrary for plotting purposes, you can just scale all of your sizes by a constant until they look reasonable.
Edit: (In response to comment from #Emma)
It's probably confusing wording on my part. The question asked about doubling the width of a circle so in the first picture for each circle (as we move from left to right) it's width is double the previous one so for the area this is an exponential with base 4. Similarly the second example each circle has area double the last one which gives an exponential with base 2.
However it is the second example (where we are scaling area) that doubling area appears to make the circle twice as big to the eye. Thus if we want a circle to appear a factor of n bigger we would increase the area by a factor n not the radius so the apparent size scales linearly with the area.
Edit to visualize the comment by #TomaszGandor:
This is what it looks like for different functions of the marker size:
x = [0,2,4,6,8,10,12,14,16,18]
s_exp = [20*2**n for n in range(len(x))]
s_square = [20*n**2 for n in range(len(x))]
s_linear = [20*n for n in range(len(x))]
plt.scatter(x,[1]*len(x),s=s_exp, label='$s=2^n$', lw=1)
plt.scatter(x,[0]*len(x),s=s_square, label='$s=n^2$')
plt.scatter(x,[-1]*len(x),s=s_linear, label='$s=n$')
plt.ylim(-1.5,1.5)
plt.legend(loc='center left', bbox_to_anchor=(1.1, 0.5), labelspacing=3)
plt.show()
Because other answers here claim that s denotes the area of the marker, I'm adding this answer to clearify that this is not necessarily the case.
Size in points^2
The argument s in plt.scatter denotes the markersize**2. As the documentation says
s : scalar or array_like, shape (n, ), optional
size in points^2. Default is rcParams['lines.markersize'] ** 2.
This can be taken literally. In order to obtain a marker which is x points large, you need to square that number and give it to the s argument.
So the relationship between the markersize of a line plot and the scatter size argument is the square. In order to produce a scatter marker of the same size as a plot marker of size 10 points you would hence call scatter( .., s=100).
import matplotlib.pyplot as plt
fig,ax = plt.subplots()
ax.plot([0],[0], marker="o", markersize=10)
ax.plot([0.07,0.93],[0,0], linewidth=10)
ax.scatter([1],[0], s=100)
ax.plot([0],[1], marker="o", markersize=22)
ax.plot([0.14,0.86],[1,1], linewidth=22)
ax.scatter([1],[1], s=22**2)
plt.show()
Connection to "area"
So why do other answers and even the documentation speak about "area" when it comes to the s parameter?
Of course the units of points**2 are area units.
For the special case of a square marker, marker="s", the area of the marker is indeed directly the value of the s parameter.
For a circle, the area of the circle is area = pi/4*s.
For other markers there may not even be any obvious relation to the area of the marker.
In all cases however the area of the marker is proportional to the s parameter. This is the motivation to call it "area" even though in most cases it isn't really.
Specifying the size of the scatter markers in terms of some quantity which is proportional to the area of the marker makes in thus far sense as it is the area of the marker that is perceived when comparing different patches rather than its side length or diameter. I.e. doubling the underlying quantity should double the area of the marker.
What are points?
So far the answer to what the size of a scatter marker means is given in units of points. Points are often used in typography, where fonts are specified in points. Also linewidths is often specified in points. The standard size of points in matplotlib is 72 points per inch (ppi) - 1 point is hence 1/72 inches.
It might be useful to be able to specify sizes in pixels instead of points. If the figure dpi is 72 as well, one point is one pixel. If the figure dpi is different (matplotlib default is fig.dpi=100),
1 point == fig.dpi/72. pixels
While the scatter marker's size in points would hence look different for different figure dpi, one could produce a 10 by 10 pixels^2 marker, which would always have the same number of pixels covered:
import matplotlib.pyplot as plt
for dpi in [72,100,144]:
fig,ax = plt.subplots(figsize=(1.5,2), dpi=dpi)
ax.set_title("fig.dpi={}".format(dpi))
ax.set_ylim(-3,3)
ax.set_xlim(-2,2)
ax.scatter([0],[1], s=10**2,
marker="s", linewidth=0, label="100 points^2")
ax.scatter([1],[1], s=(10*72./fig.dpi)**2,
marker="s", linewidth=0, label="100 pixels^2")
ax.legend(loc=8,framealpha=1, fontsize=8)
fig.savefig("fig{}.png".format(dpi), bbox_inches="tight")
plt.show()
If you are interested in a scatter in data units, check this answer.
You can use markersize to specify the size of the circle in plot method
import numpy as np
import matplotlib.pyplot as plt
x1 = np.random.randn(20)
x2 = np.random.randn(20)
plt.figure(1)
# you can specify the marker size two ways directly:
plt.plot(x1, 'bo', markersize=20) # blue circle with size 10
plt.plot(x2, 'ro', ms=10,) # ms is just an alias for markersize
plt.show()
From here
It is the area of the marker. I mean if you have s1 = 1000 and then s2 = 4000, the relation between the radius of each circle is: r_s2 = 2 * r_s1. See the following plot:
plt.scatter(2, 1, s=4000, c='r')
plt.scatter(2, 1, s=1000 ,c='b')
plt.scatter(2, 1, s=10, c='g')
I had the same doubt when I saw the post, so I did this example then I used a ruler on the screen to measure the radii.
I also attempted to use 'scatter' initially for this purpose. After quite a bit of wasted time - I settled on the following solution.
import matplotlib.pyplot as plt
input_list = [{'x':100,'y':200,'radius':50, 'color':(0.1,0.2,0.3)}]
output_list = []
for point in input_list:
output_list.append(plt.Circle((point['x'], point['y']), point['radius'], color=point['color'], fill=False))
ax = plt.gca(aspect='equal')
ax.cla()
ax.set_xlim((0, 1000))
ax.set_ylim((0, 1000))
for circle in output_list:
ax.add_artist(circle)
This is based on an answer to this question
If the size of the circles corresponds to the square of the parameter in s=parameter, then assign a square root to each element you append to your size array, like this: s=[1, 1.414, 1.73, 2.0, 2.24] such that when it takes these values and returns them, their relative size increase will be the square root of the squared progression, which returns a linear progression.
If I were to square each one as it gets output to the plot: output=[1, 2, 3, 4, 5]. Try list interpretation: s=[numpy.sqrt(i) for i in s]
I'm using matplotlib to plot a probability distribution which looks something like the sum of a Pareto distribution and a Gaussian with positive mean. In other words it has very large values near 0, a small local minimum at x = a, a local maximum at x = b > a, and a long right tail decaying to 0.
I'd like to set the y limits based only on those values to the right of the local minimum, i.e. cut off the left-most values so as to focus on the local maximum. I know I can do this with:
plt.plot(pdf)
plt.ylim((0, local_maximum))
However, this sets ymax to exactly the value of the local maximum, which makes the plot look ugly for two reasons:
the local maximum touches the top boundary of the plot, with no space above
the y axis is not a round multiple of ytics, so it's not clear what the maximum value is
Matplotlib's algorithm for choosing a default axis is pretty good, so my current hack is to plot twice: the first time I plot only the data above the local minimum for the purpose of choosing a good ylim, and the second time I plot all the data, as follows:
fig, ax = plt.subplots()
# first plot the data above the local minimum x=a, just to get a good ymax
plt.plot(pdf[a:])
ymin, ymax = plt.ylim()
# now plot all the data using the nice ymax
fig.clear()
plt.ylim((0, ymax))
plt.plot(pdf)
This gives me a good ymax that is a round multiple of ytics and fits al the data up to the local maximum, with a bit of whitespace.
Is there a better way, that doesn't require plotting twice?
I have two set of data with one containing around 11 million data points and the another around 5000. I would like to plot them both on one histogram. But because of the difference in size I need to normalise the frequency so I can plot them on the same figure. Below I have simulated what I have done with my data to be able to plot them. I have used the normed=True.
from numpy.random import randn
import matplotlib.pyplot as plt
import random
datalist1=[]
for x in range(1,50000):
datalist1.append(random.uniform(1,2))
datalist2=randn(5000000)
fig= plt.figure(1)
plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
plt.xlabel("Value")
plt.ylabel("Normalised Frequency")
plt.legend()
plt.show()
Can you please tell me if this is a good way to get around this issue? I would like to match the tallest hight between the two histogram frequencies to be 1 (or 100%).
The normed=True setting normalizes the histogram to an area of 1. That gives the histogram an interpretation as estimates of probability density functions.
In short, it actually makes sense not to normalize on the peak but on the area.
But if you really want to normalize by height you can modify the polygon data of the histogram:
h = plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
h = plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
This solution feels a bit hackish, but at least it's quick and dirty :)