Seaborn distplot for data with high SD [duplicate] - python

In matplotlib, I can set the axis scaling using either pyplot.xscale() or Axes.set_xscale(). Both functions accept three different scales: 'linear' | 'log' | 'symlog'.
What is the difference between 'log' and 'symlog'? In a simple test I did, they both looked exactly the same.
I know the documentation says they accept different parameters, but I still don't understand the difference between them. Can someone please explain it? The answer will be the best if it has some sample code and graphics! (also: where does the name 'symlog' come from?)

I finally found some time to do some experiments in order to understand the difference between them. Here's what I discovered:
log only allows positive values, and lets you choose how to handle negative ones (mask or clip).
symlog means symmetrical log, and allows positive and negative values.
symlog allows to set a range around zero within the plot will be linear instead of logarithmic.
I think everything will get a lot easier to understand with graphics and examples, so let's try them:
import numpy
from matplotlib import pyplot
# Enable interactive mode
pyplot.ion()
# Draw the grid lines
pyplot.grid(True)
# Numbers from -50 to 50, with 0.1 as step
xdomain = numpy.arange(-50,50, 0.1)
# Plots a simple linear function 'f(x) = x'
pyplot.plot(xdomain, xdomain)
# Plots 'sin(x)'
pyplot.plot(xdomain, numpy.sin(xdomain))
# 'linear' is the default mode, so this next line is redundant:
pyplot.xscale('linear')
# How to treat negative values?
# 'mask' will treat negative values as invalid
# 'mask' is the default, so the next two lines are equivalent
pyplot.xscale('log')
pyplot.xscale('log', nonposx='mask')
# 'clip' will map all negative values a very small positive one
pyplot.xscale('log', nonposx='clip')
# 'symlog' scaling, however, handles negative values nicely
pyplot.xscale('symlog')
# And you can even set a linear range around zero
pyplot.xscale('symlog', linthreshx=20)
Just for completeness, I've used the following code to save each figure:
# Default dpi is 80
pyplot.savefig('matplotlib_xscale_linear.png', dpi=50, bbox_inches='tight')
Remember you can change the figure size using:
fig = pyplot.gcf()
fig.set_size_inches([4., 3.])
# Default size: [8., 6.]
(If you are unsure about me answering my own question, read this)

symlog is like log but allows you to define a range of values near zero within which the plot is linear, to avoid having the plot go to infinity around zero.
From http://matplotlib.sourceforge.net/api/axes_api.html#matplotlib.axes.Axes.set_xscale
In a log graph, you can never have a zero value, and if you have a value that approaches zero, it will spike down way off the bottom off your graph (infinitely downward) because when you take "log(approaching zero)" you get "approaching negative infinity".
symlog would help you out in situations where you want to have a log graph, but when the value may sometimes go down towards, or to, zero, but you still want to be able to show that on the graph in a meaningful way. If you need symlog, you'd know.

Here's an example of behaviour when symlog is necessary:
Initial plot, not scaled. Notice how many dots cluster at x~0
ax = sns.scatterplot(x= 'Score', y ='Total Amount Deposited', data = df, hue = 'Predicted Category')
[
'
Log scaled plot. Everything collapsed.
ax = sns.scatterplot(x= 'Score', y ='Total Amount Deposited', data = df, hue = 'Predicted Category')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set(xlabel='Score, log', ylabel='Total Amount Deposited, log')
'
Why did it collapse? Because of some values on the x-axis being very close or equal to 0.
Symlog scaled plot. Everything is as it should be.
ax = sns.scatterplot(x= 'Score', y ='Total Amount Deposited', data = df, hue = 'Predicted Category')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set(xlabel='Score, symlog', ylabel='Total Amount Deposited, symlog')

Related

Can matplotlib plot decreasing arrays?

I am processing some data collected in a driving simulator, and I needed to plot the velocity against the location. I managed to convert the velocity and location values into 2 numpy arrays. Due to the settings of the simulator, the location array is continuously decreasing. The sample array is [5712.114 5711.662 5711.209 ... 3185.806 3185.525 3185.243]. Similarly, the velocity array is also decreasing because we were testing the brake behavior. Example array: [27.134 27.134 27.134 ... 16.87 16.872 16.874].
So, when I plot these 2 arrays, what I should see should be a negatively sloped line, and both x and y axis should have decreasing numbers. I used the code below to plot them:
plotting_x = np.array(df["SubjectX"].iloc[start_index-2999:end_index+3000])
plotting_y = np.array(df["Velocity"].iloc[start_index-2999:end_index+3000])
plt.plot(plotting_x, plotting_y, "r")
What I saw is a graph attached here. Anyone know what went wrong? Does Matplotlib not allow decreasing series? Thanks! Matplotlib plot
The problem is that by default matplotlib always defines the x axis increasing, so it will map the points following that rule. Try to reverse it by dong:
ax = plt.gca()
ax.invert_xaxis()
After the plot call.
From what I understand, since both the position and the velocity are decreasing, there is nothing wrong with the plot, simply the first point is in the top right corner and the last is in the bottom left.
At a first glance, I would also say that the position is always decreasing (the vehicle never jumps back) while the velocity has a more interesting behaviour.
You can check if this is the case plotting in two steps with two colours:
plotting_x = np.array(df["SubjectX"].iloc[start_index-2999:end_index])
plotting_y = np.array(df["Velocity"].iloc[start_index-2999:end_index])
plt.plot(plotting_x, plotting_y, "r", label="first")
and
plotting_x = np.array(df["SubjectX"].iloc[start_index:end_index+3000])
plotting_y = np.array(df["Velocity"].iloc[start_index:end_index+3000])
plt.plot(plotting_x, plotting_y, "b", label="second")
then:
plt.legend()
plt.show()
To get a more usual representation you can revert the axis or use:
plotting_x = some_number - np.array(df["SubjectX"].iloc[start_index-2999:end_index+3000])

Plot negative values on a log scale

I am doing some analysis to calculate the value of log_10(x) which is a negative number. I am now trying to plot these values, however, since the range of the answers is very large I would like to use a logarithmic scale for this. If I simply use plt.yscale('log') I get a message telling me UserWarning: Data has no positive values, and therefore cannot be log-scaled. I also cannot supply the values of x to plt.plot as the result of log_10(x) is so large and negative that the answer of x**(log_10(x)) is simply 0.
What might be the most straightforward way of plotting this data?
You can use
plt.yscale('symlog')
to set the scale to a symmetic log scale. This means that it will scale logarithmically to both sides of 0. Only using the negative part of the symlog scale would work just fine.
Two alternatives to ImportanceOfBeingErnest's solution:
Plot -log_10(x) on a semilog y axis and set the y-label to display negative units
Plot -log_10(-log_10(x)) on a linear scale
However, in all cases (including the solution proposed by ImportanceOfBeingErnest), the interpretation is not straightforward since you are displaying or calculating the log of a log.
Finally, in order to return the value for x, you need to calculate 10**(log_10(x)) not x**(log_10(x))

Plotting high precision data as axis ticks or transforming it to different values?

I am trying to plot some data as a histogram in matplotlib using a high precision values as x-axis ticks. The data is between 0 and 0.4, but most values are really close like:
0.05678, 0.05879, 0.125678, 0.129067
I used np.around() in order to make the values (and it made them as it should from 0 to 0.4) less but it didn't work quite right for all of the data.
Here is an example of the one that worked somewhat right ->
and one that didn't ->
you can see there are points after 0.4 which is just not right.
Here is the code I used in Jupyter Notebook:
plt.hist(x=[advb_ratios,adj_ratios,verb_ratios],color = ['r','y','b'], bins =10, label = ['adverbs','adjectives', 'verbs'])
plt.xticks(np.around(ranks,1))
plt.xlabel('Argument Rank')
plt.ylabel('Frequency')
plt.legend()
plt.show()
It is the same for both histograms only different x that I am plotting, all x values that are used are between 0 and 1.
So my questions are:
Is there a way to fix that and reflect my data as it is?
Is it better to give my rank values different labels that will separate them more from one another for example - 1,2,3,4 or I will lose the precision of my data and some useful info?
What is the general approach in such situations? Would a different graphic help? What?
I don't understand your problem, the fact that you data is between 0 and 0.4 should not influence the way it is displayed. I don't see why you need to do anything else but call plt.hist().
In addition, you can pass an array to the bins argument to indicate which bins you want, so you could do something like that to force the size of your bins to be always the same
# Fake data
x1 = np.random.normal(loc=0, scale=0.1, size=(1000,))
x2 = np.random.normal(loc=0.2, scale=0.1, size=(1000,))
x3 = np.random.normal(loc=0.4, scale=0.1, size=(1000,))
plt.hist([x1,x2,x3], bins=np.linspace(0,0.4,10))

matplotlib: get axis ratio of plot

I need to produce scatter plots for several 2D data sets automatically.
By default the aspect ratio is set ax.set_aspect(aspect='equal'), which most of the times works because the x,y values are distributed more or less in a squared region.
Sometimes though, I encounter a data set that, when plotted with the equal ratio, looks like this:
i.e.: too narrow in a given axis. For the above image, the axis are approximately 1:8.
In such a case, an aspect ratio of ax.set_aspect(aspect='auto') would result in a much better plot:
Now, I don't want to set aspect='auto' as my default for all data sets because using aspect='equal' is actually the correct way of displaying such a scatter plot.
I need to fall back to using ax.set_aspect(aspect='auto') only for cases such as the one above.
The question: is there a way to know before hand if the aspect ratio of a plot will be too narrow if aspect='equal' is used? Like getting the actual aspect ratio of the plotted data set.
This way, based on such a number, I can adjust the aspect ratio to something more sane looking (i.e.: auto or some other aspect ratio) instead of 'equal'.
Something like this ought to do,
aspect = (max(x) - min(x)) / (max(y) - min(y))
The axes method get_data_ratio gives the aspect ratio of the bounds of your data as displayed.¹
ax.get_data_ratio()
for example:
M = 4.0
ax.set_aspect('equal' if 1/M < ax.get_data_ratio() < M else 'auto')
¹This is the reciprocal of #farenorth's answer when the axes are zoomed right around the data, i.e., when max(y) == max(ax.get_ylim()) since it is calculated using the ranges in ax.get_ybound and ax.get_xbound.

What is the difference between 'log' and 'symlog'?

In matplotlib, I can set the axis scaling using either pyplot.xscale() or Axes.set_xscale(). Both functions accept three different scales: 'linear' | 'log' | 'symlog'.
What is the difference between 'log' and 'symlog'? In a simple test I did, they both looked exactly the same.
I know the documentation says they accept different parameters, but I still don't understand the difference between them. Can someone please explain it? The answer will be the best if it has some sample code and graphics! (also: where does the name 'symlog' come from?)
I finally found some time to do some experiments in order to understand the difference between them. Here's what I discovered:
log only allows positive values, and lets you choose how to handle negative ones (mask or clip).
symlog means symmetrical log, and allows positive and negative values.
symlog allows to set a range around zero within the plot will be linear instead of logarithmic.
I think everything will get a lot easier to understand with graphics and examples, so let's try them:
import numpy
from matplotlib import pyplot
# Enable interactive mode
pyplot.ion()
# Draw the grid lines
pyplot.grid(True)
# Numbers from -50 to 50, with 0.1 as step
xdomain = numpy.arange(-50,50, 0.1)
# Plots a simple linear function 'f(x) = x'
pyplot.plot(xdomain, xdomain)
# Plots 'sin(x)'
pyplot.plot(xdomain, numpy.sin(xdomain))
# 'linear' is the default mode, so this next line is redundant:
pyplot.xscale('linear')
# How to treat negative values?
# 'mask' will treat negative values as invalid
# 'mask' is the default, so the next two lines are equivalent
pyplot.xscale('log')
pyplot.xscale('log', nonposx='mask')
# 'clip' will map all negative values a very small positive one
pyplot.xscale('log', nonposx='clip')
# 'symlog' scaling, however, handles negative values nicely
pyplot.xscale('symlog')
# And you can even set a linear range around zero
pyplot.xscale('symlog', linthreshx=20)
Just for completeness, I've used the following code to save each figure:
# Default dpi is 80
pyplot.savefig('matplotlib_xscale_linear.png', dpi=50, bbox_inches='tight')
Remember you can change the figure size using:
fig = pyplot.gcf()
fig.set_size_inches([4., 3.])
# Default size: [8., 6.]
(If you are unsure about me answering my own question, read this)
symlog is like log but allows you to define a range of values near zero within which the plot is linear, to avoid having the plot go to infinity around zero.
From http://matplotlib.sourceforge.net/api/axes_api.html#matplotlib.axes.Axes.set_xscale
In a log graph, you can never have a zero value, and if you have a value that approaches zero, it will spike down way off the bottom off your graph (infinitely downward) because when you take "log(approaching zero)" you get "approaching negative infinity".
symlog would help you out in situations where you want to have a log graph, but when the value may sometimes go down towards, or to, zero, but you still want to be able to show that on the graph in a meaningful way. If you need symlog, you'd know.
Here's an example of behaviour when symlog is necessary:
Initial plot, not scaled. Notice how many dots cluster at x~0
ax = sns.scatterplot(x= 'Score', y ='Total Amount Deposited', data = df, hue = 'Predicted Category')
[
'
Log scaled plot. Everything collapsed.
ax = sns.scatterplot(x= 'Score', y ='Total Amount Deposited', data = df, hue = 'Predicted Category')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set(xlabel='Score, log', ylabel='Total Amount Deposited, log')
'
Why did it collapse? Because of some values on the x-axis being very close or equal to 0.
Symlog scaled plot. Everything is as it should be.
ax = sns.scatterplot(x= 'Score', y ='Total Amount Deposited', data = df, hue = 'Predicted Category')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set(xlabel='Score, symlog', ylabel='Total Amount Deposited, symlog')

Categories