How to set y-scale when making a boxplot with dataframe

How to set y-scale when making a boxplot with dataframe - python

I have a column of data with a very large distribution and thus I log2-transform it before plotting and visualizing it. This works fine but I cannot seem to figure out how to set the y-scale to the exponential values of 2 (instead I have just the exponents themselves).
df['num_ratings_log2'] = df['num_ratings'] + 1
df['num_ratings_log2'] = np.log2(df['num_ratings_log2'])
df.boxplot(column = 'num_ratings_log2', figsize=(10,10))
As the scale, I would like to have 1 (2^0), 32 (2^5), 1024 (2^1) ... instead of 0, 5, 10 ...
I want everything else about the plot to stay the same. How can I achieve this?

Instead of taking the log of the data, you can create a normal boxplot and then set a log scale on the y-axis (ax.set_yscale('log'), or symlog to also represent zero). To get the ticks at powers of 2 (instead of powers of 10), use a LogLocator with base 2. A ScalarFormatter shows the values as regular numbers (instead of as powers such as 210). A NullLocator for the minor ticks suppresses undesired extra ticks.
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, LogLocator, NullLocator
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'num_ratings': (np.random.pareto(10, 10000) * 800).astype(int)})
ax = df.boxplot(column='num_ratings', figsize=(10, 10))
ax.set_yscale('symlog') # symlog also allows zero
# ax.yaxis.set_major_formatter(ScalarFormatter()) # show tick labels as regular numbers
ax.yaxis.set_major_formatter(lambda x, p: f'{int(x):,}')
ax.yaxis.set_minor_locator(NullLocator()) # remove minor ticks
plt.show()

Hope you are looking for below,
Code
ax = df.boxplot(column='num_ratings_log2', figsize=(20,10))
ymin = 0
ymax = 20
ax.set_ylim(2**ymin, 2**ymax)

Related

Manually draw log-spaced tick marks and labels in matplotlib

I frequently find myself working in log units for my plots, for example taking np.log10(x) of data before binning it or creating contour plots. The problem is, when I then want to make the plots presentable, the axes are in ugly log units, and the tick marks are evenly spaced.
If I let matplotlib do all the conversions, i.e. by setting ax.set_xaxis('log') then I get very nice looking axes, however I can't do that to my data since it is e.g. already binned in log units. I could manually change the tick labels, but that wouldn't make the tick spacing logarithmic. I suppose I could also go and manually specify the position of every minor tick such it had log spacing, but is that the only way to achieve this? That is a bit tedious so it would be nice if there is a better way.
For concreteness, here is a plot:
I want to have the tick labels as 10^x and 10^y (so '1' is '10', 2 is '100' etc.), and I want the minor ticks to be drawn as ax.set_xaxis('log') would draw them.
Edit: For further concreteness, suppose the plot is generated from an image, like this:
import matplotlib.pyplot as plt
import scipy.misc
img = scipy.misc.face()
x_range = [-5,3] # log10 units
y_range = [-55, -45] # log10 units
p = plt.imshow(img,extent=x_range+y_range)
plt.show()
and all we want to do is change the axes appearance as I have described.
Edit 2: Ok, ImportanceOfBeingErnest's answer is very clever but it is a bit more specific to images than I wanted. I have another example, of binned data this time. Perhaps their technique still works on this, though it is not clear to me if that is the case.
import numpy as np
import pandas as pd
import datashader as ds
from matplotlib import pyplot as plt
import scipy.stats as sps
v1 = sps.lognorm(loc=0, scale=3, s=0.8)
v2 = sps.lognorm(loc=0, scale=1, s=0.8)
x = np.log10(v1.rvs(100000))
y = np.log10(v2.rvs(100000))
x_range=[np.min(x),np.max(x)]
y_range=[np.min(y),np.max(y)]
df = pd.DataFrame.from_dict({"x": x, "y": y})
#------ Aggregate the data ------
cvs = ds.Canvas(plot_width=30, plot_height=30, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'x', 'y')
# Create contour plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contourf(agg, extent=x_range+y_range)
ax.set_xlabel("x")
ax.set_ylabel("y")
plt.show()

The general answer to this question is probably given in this post:
Can I mimic a log scale of an axis in matplotlib without transforming the associated data?
However here an easy option might be to scale the content of the axes and then set the axes to a log scale.
A. image
You may plot your image on a logarithmic scale but make all pixels the same size in log units. Unfortunately imshow does not allow for such kind of image (any more), but one may use pcolormesh for that purpose.
import numpy as np
import matplotlib.pyplot as plt
import scipy.misc
img = scipy.misc.face()
extx = [-5,3] # log10 units
exty = [-45, -55] # log10 units
x = np.logspace(extx[0],extx[-1],img.shape[1]+1)
y = np.logspace(exty[0],exty[-1],img.shape[0]+1)
X,Y = np.meshgrid(x,y)
c = img.reshape((img.shape[0]*img.shape[1],img.shape[2]))/255.0
m = plt.pcolormesh(X,Y,X[:-1,:-1], color=c, linewidth=0)
m.set_array(None)
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.show()
B. contour
The same concept can be used for a contour plot.
import numpy as np
from matplotlib import pyplot as plt
x = np.linspace(-1.1,1.9)
y = np.linspace(-1.4,1.55)
X,Y = np.meshgrid(x,y)
agg = np.exp(-(X**2+Y**2)*2)
fig, ax = plt.subplots()
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
exp = lambda x: 10.**(np.array(x))
cf = ax.contourf(exp(X), exp(Y),agg, extent=exp([x.min(),x.max(),y.min(),y.max()]))
ax.set_xlabel("x")
ax.set_ylabel("y")
plt.show()

Matplotlib: how to locate ticks and showing min and max of data

Good day,
I would like to dynamically locate my ticks and showing the min and max of the data (which is varying, thus I really can't harcode the conditions). I'm trying to use matplotlib.ticker functions and the best that I can find is MaxNLocator().. but unfortunately, it does not consider the limits of my dataset.
What would be the best approach to my problem?
Thanks!
pseudocode as follows:
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
data1 = range(5)
ax1 = plt.subplot(2,1,1)
ax1.plot(data1)
data2 = range(63)
ax2 = plt.subplot(2,1,2)
ax2.plot(data2)
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
ax2.xaxis.set_major_locator(MaxNLocator(integer=True))
plt.show()
and the output is:

Not sure about best approach, but one possible way to do this would be to create a list of numbers between your minimum and maximum using numpy.linspace(start, stop, num). The third argument passed to this lets you control the number of points generated. You can then round these numbers using a list comprehension, and then set the ticks using ax.set_xticks().
Note: This will produce unevenly distributed ticks in some cases, which may be unavoidable in your case
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import numpy as np
data1 = range(5)
ax1 = plt.subplot(2,1,1)
ax1.plot(data1)
data2 = range(63) # max of this is 62, not 63 as in the question
ax2 = plt.subplot(2,1,2)
ax2.plot(data2)
ticks1 = np.linspace(min(data1),max(data1),5)
ticks2 = np.linspace(min(data2),max(data2),5)
int_ticks1 = [round(i) for i in ticks1]
int_ticks2 = [round(i) for i in ticks2]
ax1.set_xticks(int_ticks1)
ax2.set_xticks(int_ticks2)
plt.show()
This gives:
Update: This will give a maximum numbers of ticks of 5, however if the data goes from say range(3) then the number of ticks will be less. I have updates the creating of int_ticks1 and int_ticks2 so that only unique values will be used to avoid repeated plotting of certain ticks if the range is small
Using the following data
data1 = range(3)
data2 = range(3063)
# below removes any duplicate ticks
int_ticks1 = list(set([int(round(i)) for i in ticks1]))
int_ticks2 = list(set([int(round(i)) for i in ticks2]))
This produces the following figure:

same scale of Y axis on differents figures

I try to plot different data with similar representations but slight different behaviours and different origins on several figures. So the min & max of the Y axis is different between each figure, but the scale too.
e.g. here are some extracts of my batch plotting :
Does it exists a simple way with matplotlib to constraint the same Y step on those different figures, in order to have an easy visual interpretation, while keeping an automatically determined Y min and Y max ?
In others words, I'd like to have the same metric spacing between each Y-tick

you could use a MultipleLocator from the ticker module on both axes to define the tick spacings:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
fig=plt.figure()
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
ax1.set_ylim(0,100)
ax2.set_ylim(40,70)
# set ticks every 10
tickspacing = 10
ax1.yaxis.set_major_locator(ticker.MultipleLocator(base=tickspacing))
ax2.yaxis.set_major_locator(ticker.MultipleLocator(base=tickspacing))
plt.show()
EDIT:
It seems like your desired behaviour was different to how I interpreted your question. Here is a function that will change the limits of the y axes to make sure ymax-ymin is the same for both subplots, using the larger of the two ylim ranges to change the smaller one.
import matplotlib.pyplot as plt
import numpy as np
fig=plt.figure()
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
ax1.set_ylim(40,50)
ax2.set_ylim(40,70)
def adjust_axes_limits(ax1,ax2):
yrange1 = np.ptp(ax1.get_ylim())
yrange2 = np.ptp(ax2.get_ylim())
def change_limits(ax,yr):
new_ymin = ax.get_ylim()[0] - yr/2.
new_ymax = ax.get_ylim()[1] + yr/2.
ax.set_ylim(new_ymin,new_ymax)
if yrange1 > yrange2:
change_limits(ax2,yrange1-yrange2)
elif yrange2 > yrange1:
change_limits(ax1,yrange2-yrange1)
else:
pass
adjust_axes_limits(ax1,ax2)
plt.show()
Note that the first subplot here has expanded from (40, 50) to (30, 60), to match the y range of the second subplot

The answer of Tom is pretty fine !
But I decided to use a simpler solution
I define an arbitrary yrange for all my plots e.g.
yrang = 0.003
and for each plot, I do :
ymin, ymax = ax.get_ylim()
ymid = np.mean([ymin,ymax])
ax.set_ylim([ymid - yrang/2 , ymid + yrang/2])
and possibly:
ax.yaxis.set_major_locator(ticker.MultipleLocator(base=0.005))

Pyplot: using percentage on x axis

I have a line chart based on a simple list of numbers. By default the x-axis is just the an increment of 1 for each value plotted. I would like to be a percentage instead but can't figure out how. So instead of having an x-axis from 0 to 5, it would go from 0% to 100% (but keeping reasonably spaced tick marks. Code below. Thanks!
from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid.axislines import Subplot
data=[8,12,15,17,18,18.5]
fig=plt.figure(1,(7,4))
ax=Subplot(fig,111)
fig.add_subplot(ax)
plt.plot(data)

The code below will give you a simplified x-axis which is percentage based, it assumes that each of your values are spaces equally between 0% and 100%.
It creates a perc array which holds evenly-spaced percentages that can be used to plot with. It then adjusts the formatting for the x-axis so it includes a percentage sign using matplotlib.ticker.FormatStrFormatter. Unfortunately this uses the old-style string formatting, as opposed to the new style, the old style docs can be found here.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as mtick
data = [8,12,15,17,18,18.5]
perc = np.linspace(0,100,len(data))
fig = plt.figure(1, (7,4))
ax = fig.add_subplot(1,1,1)
ax.plot(perc, data)
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
xticks = mtick.FormatStrFormatter(fmt)
ax.xaxis.set_major_formatter(xticks)
plt.show()

This is a few months late, but I have created PR#6251 with matplotlib to add a new PercentFormatter class. With this class you can do as follows to set the axis:
import matplotlib.ticker as mtick
# Actual plotting code omitted
ax.xaxis.set_major_formatter(mtick.PercentFormatter(5.0))
This will display values from 0 to 5 on a scale of 0% to 100%. The formatter is similar in concept to what #Ffisegydd suggests doing except that it can take any arbitrary existing ticks into account.
PercentFormatter() accepts three arguments, max, decimals, and symbol. max allows you to set the value that corresponds to 100% on the axis (in your example, 5).
The other two parameters allow you to set the number of digits after the decimal point and the symbol. They default to None and '%', respectively. decimals=None will automatically set the number of decimal points based on how much of the axes you are showing.
Note that this formatter will use whatever ticks would normally be generated if you just plotted your data. It does not modify anything besides the strings that are output to the tick marks.
Update
PercentFormatter was accepted into Matplotlib in version 2.1.0.

Totally late in the day, but I wrote this and thought it could be of use:
def transformColToPercents(x, rnd, navalue):
# Returns a pandas series that can be put in a new dataframe column, where all values are scaled from 0-100%
# rnd = round(x)
# navalue = Nan== this
hv = x.max(axis=0)
lv = x.min(axis=0)
pp = pd.Series(((x-lv)*100)/(hv-lv)).round(rnd)
return pp.fillna(navalue)
df['new column'] = transformColToPercents(df['a'], 2, 0)

What is wrong with this matplotlib code?

I am trying to plot datetime on y axis and time on x-axis using a bar graph. I need to specify the heights in terms of datetime of y-axis and I am not sure how to do that.
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import datetime as dt
# Make a series of events 1 day apart
y = mpl.dates.drange(dt.datetime(2009,10,1),
dt.datetime(2010,1,15),
dt.timedelta(days=1))
# Vary the datetimes so that they occur at random times
# Remember, 1.0 is equivalent to 1 day in this case...
y += np.random.random(x.size)
# We can extract the time by using a modulo 1, and adding an arbitrary base date
times = y % 1 + int(y[0]) # (The int is so the y-axis starts at midnight...)
# I'm just plotting points here, but you could just as easily use a bar.
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(times, y, width = 10)
ax.yaxis_date()
fig.autofmt_ydate()
plt.show()

I think it could be your y range.
When I do your code I get a plot with the y axis ranging from about 11/Dec/2011 to 21/Dec/2011.
However, you've generated dates ranging from 1/10/2009 to 15/1/2010.
When I adjusted my y limits to take this into account it worked fine. I added this after the ax.bar.
plt.ylim( (min(y)-1,max(y)+1) )
Another reason the output is confusing is that since you've picked a width of 10, the bars are too wide and are actually obscuring each other.
Try use ax.plot(times,y,'ro') to see what I mean.
I produced the following plot using ax.bar(times,y,width=.1,alpha=.2) and ax.plot(times,y,'ro') to show you what I meant about bars overlapping each other:
And that's with a width of .1 for the bars, so if they had a width of 10 they'd be completely obscuring each other.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set y-scale when making a boxplot with dataframe - python

Hope you are looking for below, Code ax = df.boxplot(column='num_ratings_log2', figsize=(20,10)) ymin = 0 ymax = 20 ax.set_ylim(2ymin, 2ymax)

Related

Manually draw log-spaced tick marks and labels in matplotlib

Matplotlib: how to locate ticks and showing min and max of data

same scale of Y axis on differents figures

Pyplot: using percentage on x axis

What is wrong with this matplotlib code?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set y-scale when making a boxplot with dataframe - python

Hope you are looking for below, Code ax = df.boxplot(column='num_ratings_log2', figsize=(20,10)) ymin = 0 ymax = 20 ax.set_ylim(2**ymin, 2**ymax)

Related

Manually draw log-spaced tick marks and labels in matplotlib

Matplotlib: how to locate ticks and showing min and max of data

same scale of Y axis on differents figures

Pyplot: using percentage on x axis

What is wrong with this matplotlib code?

Categories

Resources

Hope you are looking for below, Code ax = df.boxplot(column='num_ratings_log2', figsize=(20,10)) ymin = 0 ymax = 20 ax.set_ylim(2ymin, 2ymax)