Handling skewed data in a matplotlib bar chart

Handling skewed data in a matplotlib bar chart - python

I'm working on creating a bar chart for a skewed data set using python matplotlib.
While I'm able to generate the graph without any issue, In the graph generated, the bar related to the skewed data is covering the majority of the bar chart and making the other nonskewed data look relatively small and negligible.
Below is the code used to generate the bar graph.
import numpy as np
import matplotlib.pyplot as plt
x=["A","B","C","D","E","F"]
y=[25,11,46,895,68,5]
fig,ax = plt.subplots()
r1=plt.barh(y=x,
width=y,
height=0.8)
#ht = [x.get_width() for x in r1.get_children()]
r1y = np.asarray([x.get_y() for x in r1.get_children()])
r1h = np.asarray([x.get_height() for x in r1.get_children()])
for i in range(5):
plt.text(y[i],r1y[i]+r1h[i]/2, '%s'% (y[i]), ha='left', va='center')
plt.xticks([0,10,100,1000])
plt.show()
The above code would create a bar chart with 0,10,100 and 1000 as xtick values and they are placed at a relative distance based on their value.
While this is valid and expected behvaior, one single skewed bar is impacting the entire bar chart.
So,is it possible to place these xtick values at equidistant so that the skewed data doesn't occupy the majority of the space in the final output?
In the expected output, values related 0-10-100 should occupy around 66.6% of the space and 100-1000 should occupy the rest of the 33.3% of the space.
Example:

Try to add plt.xscale('log'):
x=["A","B","C","D","E","F"]
y=[25,11,46,895,68,5]
fig,ax = plt.subplots()
r1=plt.barh(y=x,
width=y,
height=0.8)
r1y = np.asarray([x.get_y() for x in r1.get_children()])
r1h = np.asarray([x.get_height() for x in r1.get_children()])
for i in range(5):
plt.text(y[i],r1y[i]+r1h[i]/2, '%s'% (y[i]), ha='left', va='center')
plt.xscale('log')
plt.show()
Output:

Related

Refining dataframe-based stacked bar plot in Python

I am trying to create a stacked bar chart using PyCharm.
I am using matplotlib to explore at fullest its potentialities for simple data visualization.
My original code is for a group chart bar that displays cycle time for different teams. Such information come from a dataframe. The chart also includes autolabeling function (i.e. the height of each bar = continuous variable).
I am trying to convert such group bar chart in a stacked bar chart. The code below needs to be improved because of two factors:
labels for variables have too many decimals: this issue did not occur for the grouped bar chart. The csv file and the derived data frame weren't altered. I am struggling to understand if and where to use round command. I guess the issue is either related to the autolabeling function, because datatype used is float (I need to see at least 1 decimal).
data labels are displaced: as the auto labeling function was created for separated bars, the labels always matched the distance I wanted (based on the vertical offset). Unfortunately I did not figure out how to make sure that this distance is rather centered (see my example, the value for funnel time is at the height of squad time instead, and vice-versa). By logic, the issue should be that the height of each variable is defined ahead (see rects3 in the code, value of bottom) but I don't know how to reflect this in my auto-labeling variable.
The question is what exactly in the code must be altered in order to have the values of cycle time centered?
The code (notes for you are marked in bold):
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
'''PART 1 - Preprocess data -----------------------------------------------'''
#Directory or link of my CSV. This can be used also if you want to use API.
csv1 = r"C:\Users\AndreaPaviglianiti\Downloads\CT_Plot_DF.csv"
#Create and read dataframe. This is just to check the DF before plotting
df = pd.read_csv(csv1, sep=',', engine= 'python')
print(df, '\n')
#Extract columns as lists
squads = df['Squad_Name'].astype('str') #for our horizontal axis
funnel = df['Funnel_Time'].astype('float')
squadt = df['Squad_Time'].astype('float')
wait = df['Waiting_Time'].astype('float')
Here I tried to set the rounding but without success
'''PART 2 - Create the Bar Plot / Chart ----------------------------------'''
x = np.arange(len(squads)) #our labels on x will be the squads' names
width = 0.2 # the width of the bars. The bigger value, the larger bars
distance = 0.2
#Create objects that will be used as subplots (fig and ax).
#Each "rects" is the visualization of a yn value. first witdth is distance between X values,
# the second is the real width of bars.
fig, ax = plt.subplots()
rects1 = ax.bar(x, funnel, width, color='red', label='Funnel Time')
rects2 = ax.bar(x, squadt, width, color='green', bottom=funnel, label='Squad Time')
rects3 = ax.bar(x, wait, width, bottom=funnel+squadt, color='purple', label='Waiting Time')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Mean Cycle Time (h)')
ax.set_xlabel('\n Squads')
ax.set_title("Squad's Cycle Time Comparison in Dec-2020 \n (in mean Hours)")
ax.set_xticks(x)
ax.set_xticklabels(squads)
ax.legend()
fig.align_xlabels() #align labels to columns
# The function to display values above the bars
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width()/2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
Here I tried to change xytext="center" but I get error, I am supposed to use coordinates only or is there an alternative to change the position from the height to the center?
#We will label only the most recent information. To label both add to the code "autolabel(rects1)"
autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
fig.tight_layout()
'''PART 3 - Execute -------------------------------------------------------'''
plt.show()
Thank you for the help!

Plotting a variable number of series and data forecasts onto subplots, including axis labels, formatting and line colours

I'm writing a program which gets data and then uses time series forecasting to predict data values for the next, say, 300 data points.
However, only data which fulfills a certain condition will be plotted, so there is no defined number of subplots for the add_subplot() method. I'm aware of the plot.subplots() function, but something such as
fig, (ax1, ax2) = plt.subplots(1, 2)
implies that two graphs will definitely be plotted and I need to change the specific amount, like with a parameter.
Here is a simplified version of the current code which results in each plot being in separate windows:
fig = plt.figure() # creates a figure instance for the final graph output
plots = 1 # indicates the total number of plots to plot, starting from 1
# passed as a parameter to the add_subplot() function
for data in dataSet:
forecast(data, fig, plots)
plt.figure(fig.number)
plt.show()
And the function:
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
from pandas import Series
from statsmodels.tsa.holtwinters import ExponentialSmoothing
def forecast(data, superFigure, plotNumber):
index = range(0, len(data))
plotData = Series(data, index)
# fit the data values into a specific model:
modelFit = ExponentialSmoothing(plotData, trend="add").fit()
# forecast for the next 300 points:
modelForecast = modelFit.forecast(300)
if [condition]:
# plot the original data points:
points = plotData.plot(marker='x', color='black', label='Base Data')
points.set_xlim(0, len(data) + 300)
# plot the forecast in a different colour:
modelForecast.plot(marker='x', ax=points, color='blue', label='Forecasted Data')
plt.title("Plot Title")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
# format the axes, adding thousand separator
points.get_xaxis().set_major_formatter(
tick.FuncFormatter(lambda x, p: format(int(x), ',')))
points.get_yaxis().set_major_formatter(
tick.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.legend()
plt.show()
This produces multiple graphs such as this (actual labels have been cut out).
Unfortunately you have to close each graph before viewing the next one, and I want every graph to be visible on one page.
I tried changing the code within the "if [condition]" to:
if [condition]:
points = plotData.plot(marker='x', color='black', label='Base Data')
modelForecast.plot(marker='x', ax=points, color='blue', label='Forecasted Data')
dataLine = plt.gca().get_lines()[0]
forecastLine = plt.gca().get_lines()[1]
# put all x and y values into single lists by concatenating them
totalXData = [*dataLine.get_xdata(), *forecastLine.get_xdata()]
totalYData = [*dataLine.get_ydata(), *forecastLine.get_ydata()]
subset = superFigure.add_subplot(10, 10, plotNumber)
for i in range(0, len(totalXData)):
subset.plot(totalXData[i], totalYData[i])
plotNumber += 1
These changes produce this exact graph which seems to have the other graphs squished in the top-left corner, and I get "MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance" warnings.
If I change "superFigure.add_subplot(10, 10, plotNumber)" to "superFigure.add_subplot(20, 20, plotNumber)" I also get "UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations".
I then tried to change it to:
if [condition]:
fig, ax = plt.subplots()
plotData.plot(marker='x', ax=ax, color='black', label='Base Data')
modelForecast.plot(marker='x', ax=ax, color='blue', label='Forecasted Data')
ax.set([...])
ax.legend()
plt.show()
which doesn't produce the desired output assumedly because it recreates the figure on each call of forecast(), unless a figure window can contain multiple figures.
I also sometimes get the following warning:
RuntimeWarning: More than 20 figures have been opened. Figures created
through the pyplot interface (matplotlib.pyplot.figure) are retained
until explicitly closed and may consume too much memory.
fig, ax = plt.subplots()
How can I create subplots which include all the formatting and are displayed in one window all together?

How to change the positions of subplot titles and axis labels in Seaborn FacetGrid?

I am trying to plot a polar plot using Seaborn's facetGrid, similar to what is detailed on seaborn's gallery
I am using the following code:
sns.set(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1.25)
# Set up a grid of axes with a polar projection
g = sns.FacetGrid(df_total, col="Construct", hue="Run", col_wrap=5, subplot_kws=dict(projection='polar'), size=5, sharex=False, sharey=False, despine=False)
# Draw a scatterplot onto each axes in the grid
g.map(plt.plot, 'Rad', ''y axis label', marker=".", ms=3, ls='None').set_titles("{col_name}")
plt.savefig('./image.pdf')
Which with my data gives the following:
I want to keep this organisation of 5 plots per line.
The problem is that the title of each subplot overlap with the values of the ticks, same for the y axis label.
Is there a way to prevent this behaviour? Can I somehow shift the titles slightly above their current position and can I shift the y axis labels slightly on the left of their current position?
Many thanks in advance!
EDIT:
This is not a duplicate of this SO as the problem was that the title of one subplot overlapped with the axis label of another subplot.
Here my problem is that the title of one subplot overlaps with the ticks label of the same subplot and similarly the axis label overlaps with the ticks label of the same subplot.
I also would like to add that I do not care that they overlap on my jupyter notebook (as it as been created with it), however I want the final saved image with no overlap, so perhaps there is something I need to do to save the image in a slightly different format to avoid that, but I don't know what (I am only using plt.savefig to save it).
EDIT 2: If someone would like to reproduce the problem here is a minimal example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
sns.set(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1.5)
# Generate an example radial datast
r = np.linspace(0, 10000, num=100)
df = pd.DataFrame({'label': r, 'slow': r, 'medium-slow': 1 * r, 'medium': 2 * r, 'medium-fast': 3 * r, 'fast': 4 * r})
# Convert the dataframe to long-form or "tidy" format
df = pd.melt(df, id_vars=['label'], var_name='speed', value_name='theta')
# Set up a grid of axes with a polar projection
g = sns.FacetGrid(df, col="speed", hue="speed",
subplot_kws=dict(projection='polar'), size=4.5, col_wrap=5,
sharex=False, sharey=False, despine=False)
# Draw a scatterplot onto each axes in the grid
g.map(plt.scatter, "theta", "label")
plt.savefig('./image.png')
plt.show()
Which gives the following image in which the titles are not as bad as in my original problem (but still some overlap) and the label on the left hand side overlap completely.

In order to move the title a bit higher you can set at new position,
ax.title.set_position([.5, 1.1])
In order to move the ylabel a little further left, you can add some padding
ax.yaxis.labelpad = 25
To do this for the axes of the facetgrid, you'd do:
for ax in g.axes:
ax.title.set_position([.5, 1.1])
ax.yaxis.labelpad = 25

The answer provided by ImportanceOfBeingErnest in this SO question may help.

Matplotlib axis limits and text positions independent of dataset units

I'm trying to make plots that are formatted the same way despite coming from different datasets and I'm running into issues with getting consistent text positions and appropriate axis limits because the datasets are not scaled exactly the same. For example - say I generate the following elevation profile:
import matplotlib.pyplot as plt
import numpy as np
Distance=np.array([1000,3000,7000,15000,20000])
Elevation=np.array([100,200,350,800,400])
def MyPlot(X,Y):
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
ax.plot(X,Y)
fig.set_size_inches(fig.get_size_inches()*2)
ax.set_ylim(min(Y)-50, max(Y)+500)
ax.set_xlim(min(X)-50, max(X)+50)
MaxPoint=X[np.argmax(Y)], max(Y)
ax.scatter(MaxPoint[0], MaxPoint[1], s=10)
ax.text(MaxPoint[0], MaxPoint[1]+100, s='Maximum = '+str(MaxPoint[1]), fontsize=8)
MyPlot(Distance,Elevation)
And then I have another dataset that's scaled differently:
Distance2=Distance*4
Elevation2=Elevation*5
MyPlot(Distance2,Elevation2)][2]][2]
Because of the fact that a unit change is relatively much larger in the first dataset than the second dataset, the text and axis labels do not get formatted as I'd like in the 2nd plot. Is there a way to adjust text position and axis limits that adjusts to the relative scale of the dataset?

First off, for placing text with an offset such as this, you almost never want to use text. Instead, use annotate. The advantage is that you can give an offset of the text in points instead of data units.
Next, to reduce the density of tick locations, use ax.locator_params and change the nbins parameter. nbins controls the tick density. Tick locations will still be automatically chosen, but reducing nbins will reduce the maximum number of tick locations. If you do lower nbins, you may want to also change the numbers that matplotlib considers "even" when picking tick intervals. That way, you have more options to get the expected number of ticks.
Finally, to avoid manually setting limits with a set padding, consider using margins(some_percentage) to pad the extents by a percentage of the current limits.
To show a complete example of all:
import matplotlib.pyplot as plt
import numpy as np
distance=np.array([1000,3000,7000,15000,20000])
elevation=np.array([100,200,350,800,400])
def plot(x, y):
fig, ax = plt.subplots(figsize=(8, 2))
# Plot your data and place a marker at the peak location
maxpoint=x[np.argmax(y)], max(y)
ax.scatter(maxpoint[0], maxpoint[1], s=10)
ax.plot(x, y)
# Reduce the maximum number of ticks and give matplotlib more flexibility
# in the tick intervals it can choose.
# Essentially, this will more or less always have two ticks on the y-axis
# and 4 on the x-axis
ax.locator_params(axis='y', nbins=3, steps=range(1, 11))
ax.locator_params(axis='x', nbins=5, steps=range(1, 11))
# Annotate the peak location. The text will always be 5 points from the
# data location.
ax.annotate('Maximum = {:0.0f}'.format(maxpoint[1]), size=8,
xy=maxpoint, xytext=(5, 5), textcoords='offset points')
# Give ourselves lots of padding on the y-axis, less on the x
ax.margins(x=0.01, y=0.3)
ax.set_ylim(bottom=y.min())
# Set the aspect of the plot to be equal and add some x/y labels
ax.set(xlabel='Distance', ylabel='Elevation', aspect=1)
plt.show()
plot(distance,elevation)
And if we change the data:
plot(distance * 4, elevation * 5)
Finally, you might consider placing the annotation just above the top of the axis, instead of offset from the point:
ax.annotate('Maximum = {:0.0f}'.format(maxpoint[1]), ha='center',
size=8, xy=(maxpoint[0], 1), xytext=(0, 5),
textcoords='offset points',
xycoords=('data', 'axes fraction'))

May be you should use seaborn where no any borders. I think it's very good way.
It will be look like this:
you should write string import seaborn at the beginning of the script.

matplotlib: Creating two (stacked) subplots with SHARED X axis but SEPARATE Y axis values

I am using matplotlib 1.2.x and Python 2.6.5 on Ubuntu 10.0.4. I am trying to create a SINGLE plot that consists of a top plot and a bottom plot.
The X axis is the date of the time series. The top plot contains a candlestick plot of the data, and the bottom plot should consist of a bar type plot - with its own Y axis (also on the left - same as the top plot). These two plots should NOT OVERLAP.
Here is a snippet of what I have done so far.
datafile = r'/var/tmp/trz12.csv'
r = mlab.csv2rec(datafile, delimiter=',', names=('dt', 'op', 'hi', 'lo', 'cl', 'vol', 'oi'))
mask = (r["dt"] >= datetime.date(startdate)) & (r["dt"] <= datetime.date(enddate))
selected = r[mask]
plotdata = zip(date2num(selected['dt']), selected['op'], selected['cl'], selected['hi'], selected['lo'], selected['vol'], selected['oi'])
# Setup charting
mondays = WeekdayLocator(MONDAY) # major ticks on the mondays
alldays = DayLocator() # minor ticks on the days
weekFormatter = DateFormatter('%b %d') # Eg, Jan 12
dayFormatter = DateFormatter('%d') # Eg, 12
monthFormatter = DateFormatter('%b %y')
# every Nth month
months = MonthLocator(range(1,13), bymonthday=1, interval=1)
fig = pylab.figure()
fig.subplots_adjust(bottom=0.1)
ax = fig.add_subplot(111)
ax.xaxis.set_major_locator(months)#mondays
ax.xaxis.set_major_formatter(monthFormatter) #weekFormatter
ax.format_xdata = mdates.DateFormatter('%Y-%m-%d')
ax.format_ydata = price
ax.grid(True)
candlestick(ax, plotdata, width=0.5, colorup='g', colordown='r', alpha=0.85)
ax.xaxis_date()
ax.autoscale_view()
pylab.setp( pylab.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
# Add volume data
# Note: the code below OVERWRITES the bottom part of the first plot
# it should be plotted UNDERNEATH the first plot - but somehow, that's not happening
fig.subplots_adjust(hspace=0.15)
ay = fig.add_subplot(212)
volumes = [ x[-2] for x in plotdata]
ay.bar(range(len(plotdata)), volumes, 0.05)
pylab.show()
I have managed to display the two plots using the code above, however, there are two problems with the bottom plot:
It COMPLETELY OVERWRITES the bottom part of the first (top) plot - almost as though the second plot was drawing on the same 'canvas' as the first plot - I can't see where/why that is happening.
It OVERWRITES the existing X axis with its own indice, the X axis values (dates) should be SHARED between the two plots.
What am I doing wrong in my code?. Can someone spot what is causing the 2nd (bottom) plot to overwrite the first (top) plot - and how can I fix this?
Here is a screenshot of the plot created by the code above:
[[Edit]]
After modifying the code as suggested by hwlau, this is the new plot. It is better than the first in that the two plots are separate, however the following issues remain:
The X axis should be SHARED by the two plots (i.e. the X axis should be shown only for the 2nd [bottom] plot)
The Y values for the 2nd plot seem to be formmated incorrectly
I think these issues should be quite easy to resolve however, my matplotlib fu is not great at the moment, as I have only recently started programming with matplotlib. any help will be much appreciated.

There seem to be a couple of problems with your code:
If you were using figure.add_subplots with the full
signature of subplot(nrows, ncols, plotNum) it may have
been more apparent that your first plot asking for 1 row
and 1 column and the second plot was asking for 2 rows and
1 column. Hence your first plot is filling the whole figure.
Rather than fig.add_subplot(111) followed by fig.add_subplot(212)
use fig.add_subplot(211) followed by fig.add_subplot(212).
Sharing an axis should be done in the add_subplot command using sharex=first_axis_instance
I have put together an example which you should be able to run:
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import matplotlib.dates as mdates
import datetime as dt
n_pts = 10
dates = [dt.datetime.now() + dt.timedelta(days=i) for i in range(n_pts)]
ax1 = plt.subplot(2, 1, 1)
ax1.plot(dates, range(10))
ax2 = plt.subplot(2, 1, 2, sharex=ax1)
ax2.bar(dates, range(10, 20))
# Now format the x axis. This *MUST* be done after all sharex commands are run.
# put no more than 10 ticks on the date axis.
ax1.xaxis.set_major_locator(mticker.MaxNLocator(10))
# format the date in our own way.
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
# rotate the labels on both date axes
for label in ax1.xaxis.get_ticklabels():
label.set_rotation(30)
for label in ax2.xaxis.get_ticklabels():
label.set_rotation(30)
# tweak the subplot spacing to fit the rotated labels correctly
plt.subplots_adjust(hspace=0.35, bottom=0.125)
plt.show()
Hope that helps.

You should change this line:
ax = fig.add_subplot(111)
to
ax = fig.add_subplot(211)
The original command means that there is one row and one column so it occupies the whole graph. So your second graph fig.add_subplot(212) cover the lower part of the first graph.
Edit
If you dont want the gap between two plots, use subplots_adjust() to change the size of the subplots margin.

The example from #Pelson, simplified.
import matplotlib.pyplot as plt
import datetime as dt
#Two subplots that share one x axis
fig,ax=plt.subplots(2,sharex=True)
#plot data
n_pts = 10
dates = [dt.datetime.now() + dt.timedelta(days=i) for i in range(n_pts)]
ax[0].bar(dates, range(10, 20))
ax[1].plot(dates, range(10))
#rotate and format the dates on the x axis
fig.autofmt_xdate()
The subplots sharing an x-axis are created in one line, which is convenient when you want more than two subplots:
fig, ax = plt.subplots(number_of_subplots, sharex=True)
To format the date correctly on the x axis, we can simply use fig.autofmt_xdate()
For additional informations, see shared axis demo and date demo from the pylab examples.
This example ran on Python3, matplotlib 1.5.1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handling skewed data in a matplotlib bar chart - python

Related

Refining dataframe-based stacked bar plot in Python

Plotting a variable number of series and data forecasts onto subplots, including axis labels, formatting and line colours

How to change the positions of subplot titles and axis labels in Seaborn FacetGrid?

Matplotlib axis limits and text positions independent of dataset units

matplotlib: Creating two (stacked) subplots with SHARED X axis but SEPARATE Y axis values

Categories

Resources