Radius of matplotlib scatter plot [duplicate] - python

In the pyplot document for scatter plot:
matplotlib.pyplot.scatter(x, y, s=20, c='b', marker='o', cmap=None, norm=None,
vmin=None, vmax=None, alpha=None, linewidths=None,
faceted=True, verts=None, hold=None, **kwargs)
The marker size
s:
size in points^2. It is a scalar or an array of the same length as x and y.
What kind of unit is points^2? What does it mean? Does s=100 mean 10 pixel x 10 pixel?
Basically I'm trying to make scatter plots with different marker sizes, and I want to figure out what does the s number mean.

This can be a somewhat confusing way of defining the size but you are basically specifying the area of the marker. This means, to double the width (or height) of the marker you need to increase s by a factor of 4. [because A = WH => (2W)(2H)=4A]
There is a reason, however, that the size of markers is defined in this way. Because of the scaling of area as the square of width, doubling the width actually appears to increase the size by more than a factor 2 (in fact it increases it by a factor of 4). To see this consider the following two examples and the output they produce.
# doubling the width of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*4**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()
gives
Notice how the size increases very quickly. If instead we have
# doubling the area of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*2**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()
gives
Now the apparent size of the markers increases roughly linearly in an intuitive fashion.
As for the exact meaning of what a 'point' is, it is fairly arbitrary for plotting purposes, you can just scale all of your sizes by a constant until they look reasonable.
Edit: (In response to comment from #Emma)
It's probably confusing wording on my part. The question asked about doubling the width of a circle so in the first picture for each circle (as we move from left to right) it's width is double the previous one so for the area this is an exponential with base 4. Similarly the second example each circle has area double the last one which gives an exponential with base 2.
However it is the second example (where we are scaling area) that doubling area appears to make the circle twice as big to the eye. Thus if we want a circle to appear a factor of n bigger we would increase the area by a factor n not the radius so the apparent size scales linearly with the area.
Edit to visualize the comment by #TomaszGandor:
This is what it looks like for different functions of the marker size:
x = [0,2,4,6,8,10,12,14,16,18]
s_exp = [20*2**n for n in range(len(x))]
s_square = [20*n**2 for n in range(len(x))]
s_linear = [20*n for n in range(len(x))]
plt.scatter(x,[1]*len(x),s=s_exp, label='$s=2^n$', lw=1)
plt.scatter(x,[0]*len(x),s=s_square, label='$s=n^2$')
plt.scatter(x,[-1]*len(x),s=s_linear, label='$s=n$')
plt.ylim(-1.5,1.5)
plt.legend(loc='center left', bbox_to_anchor=(1.1, 0.5), labelspacing=3)
plt.show()

Because other answers here claim that s denotes the area of the marker, I'm adding this answer to clearify that this is not necessarily the case.
Size in points^2
The argument s in plt.scatter denotes the markersize**2. As the documentation says
s : scalar or array_like, shape (n, ), optional
size in points^2. Default is rcParams['lines.markersize'] ** 2.
This can be taken literally. In order to obtain a marker which is x points large, you need to square that number and give it to the s argument.
So the relationship between the markersize of a line plot and the scatter size argument is the square. In order to produce a scatter marker of the same size as a plot marker of size 10 points you would hence call scatter( .., s=100).
import matplotlib.pyplot as plt
fig,ax = plt.subplots()
ax.plot([0],[0], marker="o", markersize=10)
ax.plot([0.07,0.93],[0,0], linewidth=10)
ax.scatter([1],[0], s=100)
ax.plot([0],[1], marker="o", markersize=22)
ax.plot([0.14,0.86],[1,1], linewidth=22)
ax.scatter([1],[1], s=22**2)
plt.show()
Connection to "area"
So why do other answers and even the documentation speak about "area" when it comes to the s parameter?
Of course the units of points**2 are area units.
For the special case of a square marker, marker="s", the area of the marker is indeed directly the value of the s parameter.
For a circle, the area of the circle is area = pi/4*s.
For other markers there may not even be any obvious relation to the area of the marker.
In all cases however the area of the marker is proportional to the s parameter. This is the motivation to call it "area" even though in most cases it isn't really.
Specifying the size of the scatter markers in terms of some quantity which is proportional to the area of the marker makes in thus far sense as it is the area of the marker that is perceived when comparing different patches rather than its side length or diameter. I.e. doubling the underlying quantity should double the area of the marker.
What are points?
So far the answer to what the size of a scatter marker means is given in units of points. Points are often used in typography, where fonts are specified in points. Also linewidths is often specified in points. The standard size of points in matplotlib is 72 points per inch (ppi) - 1 point is hence 1/72 inches.
It might be useful to be able to specify sizes in pixels instead of points. If the figure dpi is 72 as well, one point is one pixel. If the figure dpi is different (matplotlib default is fig.dpi=100),
1 point == fig.dpi/72. pixels
While the scatter marker's size in points would hence look different for different figure dpi, one could produce a 10 by 10 pixels^2 marker, which would always have the same number of pixels covered:
import matplotlib.pyplot as plt
for dpi in [72,100,144]:
fig,ax = plt.subplots(figsize=(1.5,2), dpi=dpi)
ax.set_title("fig.dpi={}".format(dpi))
ax.set_ylim(-3,3)
ax.set_xlim(-2,2)
ax.scatter([0],[1], s=10**2,
marker="s", linewidth=0, label="100 points^2")
ax.scatter([1],[1], s=(10*72./fig.dpi)**2,
marker="s", linewidth=0, label="100 pixels^2")
ax.legend(loc=8,framealpha=1, fontsize=8)
fig.savefig("fig{}.png".format(dpi), bbox_inches="tight")
plt.show()
If you are interested in a scatter in data units, check this answer.

You can use markersize to specify the size of the circle in plot method
import numpy as np
import matplotlib.pyplot as plt
x1 = np.random.randn(20)
x2 = np.random.randn(20)
plt.figure(1)
# you can specify the marker size two ways directly:
plt.plot(x1, 'bo', markersize=20) # blue circle with size 10
plt.plot(x2, 'ro', ms=10,) # ms is just an alias for markersize
plt.show()
From here

It is the area of the marker. I mean if you have s1 = 1000 and then s2 = 4000, the relation between the radius of each circle is: r_s2 = 2 * r_s1. See the following plot:
plt.scatter(2, 1, s=4000, c='r')
plt.scatter(2, 1, s=1000 ,c='b')
plt.scatter(2, 1, s=10, c='g')
I had the same doubt when I saw the post, so I did this example then I used a ruler on the screen to measure the radii.

I also attempted to use 'scatter' initially for this purpose. After quite a bit of wasted time - I settled on the following solution.
import matplotlib.pyplot as plt
input_list = [{'x':100,'y':200,'radius':50, 'color':(0.1,0.2,0.3)}]
output_list = []
for point in input_list:
output_list.append(plt.Circle((point['x'], point['y']), point['radius'], color=point['color'], fill=False))
ax = plt.gca(aspect='equal')
ax.cla()
ax.set_xlim((0, 1000))
ax.set_ylim((0, 1000))
for circle in output_list:
ax.add_artist(circle)
This is based on an answer to this question

If the size of the circles corresponds to the square of the parameter in s=parameter, then assign a square root to each element you append to your size array, like this: s=[1, 1.414, 1.73, 2.0, 2.24] such that when it takes these values and returns them, their relative size increase will be the square root of the squared progression, which returns a linear progression.
If I were to square each one as it gets output to the plot: output=[1, 2, 3, 4, 5]. Try list interpretation: s=[numpy.sqrt(i) for i in s]

Related

How to helpfully plot time series data in python

I have a 2 Dimensional set of time series data containing about 1000 samples. I.e I have a long list of 1000 elements each entry is a list of two numbers.
It can be thought of the position, x and y coordinates, of a car with a picture taken each second for 1000 seconds. When I plot this, as seen below, you get a decent idea of the trajectory but it's unclear where the car starts or finishes, i.e which direction it is traveling in. I was thinking about including arrows between each point but I think this would get quite clustered (maybe you know a way to overcome that issue?) Also, I thought of colouring each point with a spectrum that made it clear to see time increasing, i.e hotter points to colder points as time goes on. Any idea how to achieve this in matplotlib?
I believe both your ideas would work well, I just think you need to test which option works best for your case.
Option 1: arrows
To avoid a cluttered plot I believe you could plot arrows between only a selection of points to show the general direction of your trajectory. In my example below I only plot an arrow between points 1 and 2, 6 and 7, and so on and. You might want to increase the spacing between the points to make this work for your long series. It is also possible to connect points that are seperated by, say, 10 points to make them more clearly visible.
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.linspace(0, 10, 100)
y = x
plt.figure()
# plot the data points
for i in range(len(x)):
plt.plot(x[i], y[i], "ro")
# plot arrows between points 1 and 2, 6 and 7 and so on.
for i in range(0, len(x)-1, 5):
plt.arrow(x[i], y[i], x[i+1] - x[i], y[i+1] - y[i], color = "black",zorder = 2, width = 0.05)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This yields this plot.
Option 2: colors
You can generate any number of colors from a colormap, meaning you can make a list of 1000 sequential colors. This way you can plot each of your points in an increasingly warm color.
Example:
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.linspace(0, 10, 100)
y = x
# generate 100 (number of data points) colors from colormap
colors = [plt.get_cmap("coolwarm")(i) for i in np.linspace(0,1, len(x))]
plt.figure()
# plot the data points with the generated colors
for i in range(len(x)):
plt.plot(x[i], y[i], color = colors[i], marker = "o")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This yields this figure, where the oldest data point is cool (blue) and the newest is red (warm).

How do I correctly implement contours of histograms with logscale binning in numpy/matplotlib

I am trying to plot contours of data that his been binned using numpy.hist2d, except the bins are set using numpy.logscale (equal binning in log space).
Unfortunately, this results in a strange behavior that I can't seem to resolve: the placement of the contours does not match the location of the points in x/y. I plot both the 2d histogram of the data, and the contours, and they do not overlap.
It looks like what is actually happening is the contours are being placed on the physical location of the plot in linear space where I expect them to be placed in log space.
It's a strange phenomenon that I think can be best described by the following plots, using identical data but binned in different ways.:
Here is a minimum working example to produce the logbinned data:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(loc=500, scale=100,size=10000)
y = np.random.normal(loc=600, scale=60, size=10000)
nbins = 50
bins = (np.logspace(np.log10(10),np.log10(1000),nbins),np.logspace(np.log10(10),np.log10(1000),nbins))
HH, xe, ye = np.histogram2d(x,y,bins=bins)
plt.hist2d(x,y,bins=bins,cmin=1);
grid = HH.transpose()
extent = np.array([xe.min(), xe.max(), ye.min(), ye.max()])
cs = plt.contourf(grid,2,extent=extent,extend='max',cmap='plasma',alpha=0.5,zorder=100)
plt.contour(grid,2,extent=extent,colors='k',zorder=100)
plt.yscale('log')
plt.xscale('log')
It's fairly clear what is happening -- the contour is getting misplaced do the scaling of the bins. I'd like to be able to plot the histogram and the contour here together.
If anyone has an idea of how to resolve this, that would be very helpful - thanks!
This is your problem:
cs = plt.contourf(grid,2,extent=extent,...)
You are passing in a single 2d array specifying the values of the histograms, but you aren't passing the x and y coordinates these data correspond to. By only passing in extent there's no way for pyplot to do anything other than assume that the underlying grid is uniform, stretched out to fit extent.
So instead what you have to do is to define x and y components for each value in grid. You have to think a bit how to do this, because you have (n, n)-shaped data and (n+1,)-shaped edges to go with it. We should probably choose the center of each bin to associate a data point with. So we need to find the midpoint of each bin, and pass those arrays to contour[f].
Something like this:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
size = 10000
x = rng.normal(loc=500, scale=100, size=size)
y = rng.normal(loc=600, scale=60, size=size)
nbins = 50
bins = (np.geomspace(10, 1000, nbins),) * 2
HH, xe, ye = np.histogram2d(x, y, bins=bins)
fig, ax = plt.subplots()
ax.hist2d(x, y, bins=bins, cmin=1)
grid = HH.transpose()
# compute bin midpoints
midpoints = (xe[1:] + xe[:-1])/2, (ye[1:] + ye[:-1])/2
cs = ax.contourf(*midpoints, grid, levels=2, extend='max', cmap='plasma', alpha=0.5, zorder=100)
ax.contour(*midpoints, grid, levels=2, colors='k', zorder=100)
# these are a red herring during debugging:
#ax.set_yscale('log')
#ax.set_xscale('log')
(I've cleaned up your code a bit.)
Alternatively, if you want to avoid having those white strips at the top and edge, you can keep your bin edges, and pad your grid with zeros:
grid_padded = np.pad(grid, [(0, 1)])
cs = ax.contourf(xe, ye, grid_padded, levels=2, extend='max', cmap='plasma', alpha=0.5, zorder=100)
ax.contour(xe, ye, grid_padded, levels=2, colors='k', zorder=100)
This gives us something like
This seems prettier, but if you think about your data this is less exact, because your data points are shifted with respect to the bin coordinates they correspond to. If you look closely you can see the contours being shifted with respect to the output of hist2d. You could fix this by generating geomspaces with one more final value which you only use for this final plotting step, and again use the midpoints of these edges (complete with a last auxiliary one).

How to plot a circle for each point scatter plot while each has particular radius size

I have a pandas frame with distance matrix, I use PCA to do the dim reduction. The the dataframe of this distance matrix has label for each point, and size.
How can I make each scattered point become a circle with a size dependent on the size from the dataframe
````
pca = PCA(n_components=2)
pca.fit(dist)
mds5 = pca.components_
fig = go.Figure()
fig.add_scatter(x = mds5[0],
y = mds5[1],
mode = 'markers+text',
marker= dict(size = 8,
color= 'blue'
),
text= dist.columns.values,
textposition='top right')
````
I need to have the scatter plot looks something like this example, however, when I add the size for each point in related answers, I cant get the circles to overlap, and when they do, I can zoom in, then they dont overlap anymore
sounds strange, but I need to create a logic, that if two circles overlap, the one with smaller radius will dissapear, so:
how to keep the circle size the same, regardless of the zoom
how to create a logic in python to cancel the smaller overlapping circle?
I'm still not sure which PCA parameter you want to be reflected in the circle size, but: either you want to
use a scatter plot (i.e. ax.scatter()) whose size= is reflecting your chosen PCA parameter; this size will (and should not) rescale when you rescale the figure; it is also not given in (x,y)-units
use multiple plt.Circle((x,y), radius=radius, **kwargs) patches, whose radii are given in (x,y)-units; the point overlap is then consistent on rescale, but this will likely cause deformed points
The following animation will emphasise the issue at hand:
I suppose you want the plt.Circle-based solution, as it keeps the distance static, and then you need to "manually" calculate beforehand whether two points overlap and delete them "manually". You should be able to do this automatically via a comparison between point size (i.e. radius, your PCA parameter) and the euclidian distance between your data points (i.e. np.sqrt(dx**2 + dy**2)).
To use Circles, you could e.g. define a shorthand function:
def my_circle_scatter(ax, x_array, y_array, radius=0.5, **kwargs):
for x, y in zip(x_array, y_array):
circle = plt.Circle((x,y), radius=radius, **kwargs)
ax.add_patch(circle)
return True
and then call it with optional parameters (i.e. the x- and y-coordinates, colors, and so on):
my_circle_scatter(ax, xs, ys, radius=0.2, alpha=.5, color='b')
Where I've used fig,ax=plt.subplots() to create the figure and subplot individually.

Python plt.contour colorbar

I am trying to do a plot of a seismic wave using plt.contour.
I have 3 arrays:
time (x-axis)
frequency (y-axis)
amplitude (z-axis)
This is my results so far:
The problem is that I want to change the scaling of the colorbar: making a gradation and not having this white color when the amplitude is low. But I am not able to do so, even though I spent a lot of time browsing the doc.
I read that plt.pcolormesh is not appropriate here (it is just working here because I am in a special case), but this what I want to get regarding to the colours and colorbar:
This is the code I wrote:
T = len(time[0])*(time[0][1] - time[0][0]) # multiply ampFFT with T to offset
Z = abs(ampFFT)*(T) # abbreviation
# freq = frequency, ampFFT = Fast Fourier Transform of the amplitude of the wave
# freq, amFFT and time have same dimensions: 40 x 1418 (40 steps of time discretization x steps to have the total time. 2D because it is easier to use)
maxFreq = abs(freq).max() # maxium frequency for plot boundaries
maxAmpFFT = abs(Z).max()/2 # maxium ampFFT for plot boundaries of colorbar divided by 2 to scale better with the colors
minAmpFFT = abs(Z).min()
plt.figure(1)
plt.contour(time, freq, Z, vmin=minAmpFFT, vmax=maxAmpFFT)
plt.colorbar()
plt.ylim(0,maxFreq) # 0 to remove the negative frequencies useless here
plt.title("Amplitude intensity regarding to time and frequency")
plt.xlabel('time (in secondes)')
plt.ylabel('frequency (in Hz)')
plt.show()
Thank you for your attention!
NB : In case you were wondering about plt.pcolormesh: the plot is completely messed up when I choose to increase the time discretization (here I split the time in 40, but when I split the time in 1000 the plot is not correct, and I want to be able to split the time in smaller pieces).
EDIT: When I use plt.contourf instead of plt.contour I got this plot:
Which is not really convincing either. I understand why the yellow colour takes so much space (it is because I set a low vmax), but I don't understand why there is still white colour in my plot.
EDIT 2: My teacher plotted my data, and I have the correct data. The only problem that is left is the white background in my plot (and the deep blue on left and right border for nor apparent reason when I use plt.contourf). Despite those problems, the highest amplitude is located around 0.5 Hz, which is in agreement with the work of my teacher.
He used gnuplot, but since I don't know gnuplot, I prefer to use python.
Solution/Workaround I found
Here is what I did to display my data like countourf does, but without the display problems:
Explanation: for the surface, I took abs(freq) instead of just freq because I have negative frequencies.
It is because that when calculating the frequency of a FFT, you have a frequency that repeat itself a 2nd time like this:
You have 2 way of obtaining this frequency:
- the frequency is positive, this array is 2 x Nyquist frequency (so if you divide the array by 2, you have all your wave, and it doesn't repeat itself).
- the frequency starts negative and go to positive, this array also is 2 x Nyquist frequency (so if you remove the negative value you have all your wave, and it doesn't repeat itself).
Python fft.fftfreq use the 2nd option. plot_surface doesn't work well with removing the data of an array (for me it was still displayed). So I made the frequency value absolute and the problem disappeared.
fig = plt.figure(1, figsize=(18,15)) # figsize: increase plot size
ax = fig.gca(projection='3d')
surf = ax.plot_surface(time, abs(freq), Z, rstride=1, cstride=1, cmap=cm.magma, linewidth=0, antialiased=False, vmin=minAmpFFT, vmax=maxAmpFFT)
ax.set_zlim(0, maxAmpFFT)
ax.set_ylim(0, maxFreq)
ax.view_init(azim=90, elev=90) # change view to top view, with axis in the right direction
plt.title("Amplitude intensity (m/Hz^0.5) regarding to time and frequency")
plt.xlabel('x : time (in secondes)')
plt.ylabel('y : frequency (in Hz)')
# ax.yaxis._set_scale('log') # should be in log, but does not work
plt.gca().invert_xaxis() # invert x axis !! MUST BE AFTER X,Y,Z LIM
plt.gca().invert_yaxis() # invert y axis !! MUST BE AFTER X,Y,Z LIM
plt.colorbar(surf)
fig.tight_layout()
plt.show()
This is the plot I got:

Scatterplot Contours In Matplotlib

I have a massive scatterplot (~100,000 points) that I'm generating in matplotlib. Each point has a location in this x/y space, and I'd like to generate contours containing certain percentiles of the total number of points.
Is there a function in matplotlib which will do this? I've looked into contour(), but I'd have to write my own function to work in this way.
Thanks!
Basically, you're wanting a density estimate of some sort. There multiple ways to do this:
Use a 2D histogram of some sort (e.g. matplotlib.pyplot.hist2d or matplotlib.pyplot.hexbin) (You could also display the results as contours--just use numpy.histogram2d and then contour the resulting array.)
Make a kernel-density estimate (KDE) and contour the results. A KDE is essentially a smoothed histogram. Instead of a point falling into a particular bin, it adds a weight to surrounding bins (usually in the shape of a gaussian "bell curve").
Using a 2D histogram is simple and easy to understand, but fundementally gives "blocky" results.
There are some wrinkles to doing the second one "correctly" (i.e. there's no one correct way). I won't go into the details here, but if you want to interpret the results statistically, you need to read up on it (particularly the bandwidth selection).
At any rate, here's an example of the differences. I'm going to plot each one similarly, so I won't use contours, but you could just as easily plot the 2D histogram or gaussian KDE using a contour plot:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kde
np.random.seed(1977)
# Generate 200 correlated x,y points
data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 3]], 200)
x, y = data.T
nbins = 20
fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, sharey=True)
axes[0, 0].set_title('Scatterplot')
axes[0, 0].plot(x, y, 'ko')
axes[0, 1].set_title('Hexbin plot')
axes[0, 1].hexbin(x, y, gridsize=nbins)
axes[1, 0].set_title('2D Histogram')
axes[1, 0].hist2d(x, y, bins=nbins)
# Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents
k = kde.gaussian_kde(data.T)
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
axes[1, 1].set_title('Gaussian KDE')
axes[1, 1].pcolormesh(xi, yi, zi.reshape(xi.shape))
fig.tight_layout()
plt.show()
One caveat: With very large numbers of points, scipy.stats.gaussian_kde will become very slow. It's fairly easy to speed it up by making an approximation--just take the 2D histogram and blur it with a guassian filter of the right radius and covariance. I can give an example if you'd like.
One other caveat: If you're doing this in a non-cartesian coordinate system, none of these methods apply! Getting density estimates on a spherical shell is a bit more complicated.
I have the same question.
If you want to plot contours, which contain some part of points you can use following algorithm:
create 2d histogram
h2, xedges, yedges = np.histogram2d(X, Y, bibs = [30, 30])
h2 is now 2d matrix containing integers which is number of points in some rectangle
hravel = np.sort(np.ravel(h2))[-1] #all possible cases for rectangles
hcumsum = np.sumsum(hravel)
ugly hack,
let give for every point in h2 2d matrix the cumulative number of points for rectangle which contain number of points equal or greater to that we analyze currently.
hunique = np.unique(hravel)
hsum = np.sum(h2)
for h in hunique:
h2[h2 == h] = hcumsum[np.argwhere(hravel == h)[-1]]/hsum
now plot contour for h2, it will be the contour which containing some amount of all points

Categories