Pandas bar chart with unequal groups - python

I have data with a hierarchical structure and want to create a plot with groups of bars.
import pandas as pd
data = [
['alpha', 'x', 1],
['alpha', 'y', 2],
['alpha', 'z', 2],
['beta', 'x', 3],
['beta', 'z', 4]]
df = pd.DataFrame(data, columns=['P','Q','R'])
df.pivot('P','Q','R').plot.bar(rot=0)
This code produces:
How could I:
Eliminate the space for the missing bar, i.e. accommodate groups with different numbers of bars?
Make all the alphas blue and the betas orange, i.e. cycle the colors by group rather than within groups?

What if you create the plot "manually"? You can use loc to filter. Then plot on the same figure.
the magic for the space happens by using the index values. notice in beta I add +1 to the index to create that extra space. I then combine both indexes in xticks and then simply use df['Q'] as the labels.
plt.bar(data=df.loc[df['P']=='alpha'], x=df.loc[df['P']=='alpha'].index, height='R', label='alpha')
plt.bar(data=df.loc[df['P']=='beta'], x=df.loc[df['P']=='beta'].index+1, height='R', label='beta')
plt.xticks(df.loc[df['P']=='alpha'].index.tolist() + list(df.loc[df['P']=='beta'].index+1),df['Q'].tolist())
plt.legend()

I am not sure to get rid of empty cells but you can use a stacked parameter to get the output and also yes you can pass the color array to bar method which will display color accordingly.
import pandas as pd
data = [
['alpha', 'x', 1],
['alpha', 'y', 2],
['alpha', 'z', 2],
['beta', 'x', 3],
['beta', 'z', 4]]
df = pd.DataFrame(data, columns=['P','Q','R'])
df.pivot(index='P',columns='Q',values='R').plot.bar(rot=0, stacked=True,color = ['blue', 'green', 'red'])
I hope it helps.

This is inspired by #MattR's answer, which showed me that plotting bars from scratch is not rocket science. Pandas groupby() seems to be a good tool for this.
In the end I prefer it without extra space between groups.
labels = []
for g, grp in df.groupby('P'):
plt.bar(grp.index, grp.R, label=g)
labels.extend(grp.Q)
plt.xticks(df.index, labels)
plt.legend()

Related

In seaborn, how can I group by a variable without using the "hue" argument?

In seaborn, is it possible to group observations based on a column without using the hue argument?
For example, how could I get these two lines to show up in the same colour, but as separate lines?
Code for generating this is below.
import pandas as pd
import seaborn as sns
df = pd.DataFrame(
{
'group': ["group01", "group01", "group02", "group02"],
'x': [1, 2, 3, 5],
'y': [2, 4, 3, 5]
}
)
sns.lineplot(df, x='x', y='y', hue='group')
plt.show()
This is straightforward to do in R's ggplot, by mapping the group variable to group, rather than to colour. For example, see this.
The reason I want to do this is that I want to show multiple overlaid plots all in the same colour. This helps to show variability across different datasets. The different colours that I would get with seaborn's hue are unnecessary and distracting, especially when there would be dozens of them. Here is the sort of plot I want to create:
seaborn.lineplot has a units parameter, which seems to be equivalent to ggplot's group:
units: vector or key in data
Grouping variable identifying sampling units. When used, a separate line will be drawn for each unit with appropriate semantics,
but no legend entry will be added. Useful for showing distribution of
experimental replicates when exact identities are not needed.
sns.lineplot(df, x='x', y='y', units='group')
Output:
combining units and hue in a more complex example:
df = pd.DataFrame(
{
'group': ["group01", "group01", "group02", "group02", "group01", "group01"],
'group2': ['A', 'A', 'A', 'A', 'B', 'B'],
'x': [1, 2, 3, 5, 2, 4],
'y': [2, 4, 3, 5, 3, 2]
}
)
sns.lineplot(df, x='x', y='y', units='group', hue='group2')
Output:

Set individual wedge hatching for pandas pie chart

I am trying to make pie charts where some of the wedges have hatching and some of them don't, based on their content. The data consists of questions and yes/no/in progress answers, as shown below in the MWE.
import pandas as pd
import matplotlib.pyplot as plt
raw_data = {'Q1': ['IP', 'IP', 'Y/IP', 'Y', 'IP'],
'Q2': ['Y', 'Y', 'Y', 'Y', 'N/IP'],
'Q3': ['N/A', 'IP', 'Y/IP', 'N', 'N']}
df = pd.DataFrame(raw_data, columns = ['Q1', 'Q2', 'Q3'])
df= df.astype('string')
colors={'Y':'green',
'Y/IP':'greenyellow',
'IP':'orange',
'N/IP':'gold',
'N':'red',
'N/A':'grey'
}
for i in df.columns:
pie = df[i].value_counts().plot.pie(colors=[colors[v] for v in df[i].value_counts().keys()])
fig = pie.get_figure()
fig.savefig("D:/windows/"+i+"test.png")
fig.clf()
However, instead of greenyellow and gold I am trying to make the wedges green with yellow hatching, and yellow with red hatching, like so (note the below image does not match the data from the MWE):
I had a look online and am aware I will likely have to split the pie(s) into individual wedges but can't seem to get that to work alongside the pandas value counts. Any help would be massively appreciated. Thanks!
This snippet shows how to add hatching in custom colors to a pie chart. You can extract the Pandas valuecount - this will be a Series - then use it with the snippet I have provided.
I have added the hatch color parameter as a second parameter in the color dictionary:
import matplotlib.pyplot as plt
colors={'Y' :['green', 'lime'],
'IP': ['orange', 'red'],
'N' : ['red', 'cyan']}
labels=['Y', 'N', 'IP']
wedges, _ = plt.pie(x=[1, 2, 3], labels=labels)
for pie_wedge in wedges:
pie_wedge.set_edgecolor(colors[pie_wedge.get_label()][1])
pie_wedge.set_facecolor(colors[pie_wedge.get_label()][0])
pie_wedge.set_hatch('/')
plt.legend(wedges, labels, loc="best")
plt.show()
The result looks like so:

Python: How can I get a bar chart overview showing distinct values of a data frame?

I get an overview of all distinct values from a data frame with this lambda function:
overview = df.apply(lambda col: col.unique())
Which returns the desired result like that:
ColA [1,2,3,...]
ColB [4,5,6,7,8,9...]
ColC [A,B,C]
... ...
How can I visualize this result using subplots / multiple bar plots?
My first attempt was just throwing the object into the plot method of dataframe, which apparantly not works. So I tried to create a dataframe out of the object:
overview = {}
for attr, value in overview.iteritems():
overview[attr] = value
df = pd.DataFrame(overview)
The output is:
ValueError: arrays must all be same length
So I'm trying utilizing a list:
overview = []
for attr, value in obj_overview.iteritems():
overview.append({attr: value})
df = pd.DataFrame(overview)
But the result is a cross-matrix, which has as many rows as columns and row n refers to column n. Which is wrong, too.
How can I get an overview using multiple bar charts / sub plots showing distinct values of a data frame?
There are in fact two possible goals I'd like to achieve:
There are multiple bar charts, where every chart represents one column in the original dataframe. X-axis shows all distinct / unique values, Y-axis shows occurences for each of those values. This is the nice-to-have-option. I know that my current approach cannot cover this. It's based on a similar plugin Alteryx e.g. offers:
This should be possible with my current approach: only one (stacked) bar chart is showing all columnes, where the x-axis shows every column, every respective bar contains all distinct values.
Thanks!
Separate Plots via value_counts:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'ColA': [1, 2, 4, 4, 5],
'ColB': [4, 4, 6, 6, 6],
'ColC': ['A', 'C', 'C', 'E', 'E']})
for col in df:
df[col].value_counts().sort_index().plot(kind='bar', rot=0, ylabel='count')
plt.show()
ColA
ColB
ColC
Single Stacked Plot via melt + crosstab:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'ColA': [1, 2, 4, 4, 5],
'ColB': [4, 4, 6, 6, 6],
'ColC': ['A', 'C', 'C', 'E', 'E']})
overview = df.melt()
overview = pd.crosstab(overview['variable'], overview['value'])
ax = overview.plot(kind='bar', stacked=True, rot=0, ylabel='count')
ax.legend(bbox_to_anchor=(1.2, 1))
plt.tight_layout()
plt.show()
This will give you one heatmap for all numerical columns and one for all alphabetical columns, where the colour represents the number of occurrences. It's a different way to plot the info as an alternative.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
col_dict = {
'A': [1,2,3],
'B': [3,4,4,4,5,5,6],
'C': ['A','B','C'],
'D': ['C', 'D', 'D']
}
num_cols = []
num_idx = []
letter_cols = []
letter_idx = []
for col in col_dict:
if isinstance(col_dict[col][0], int):
num_cols += col_dict[col]
num_idx.append(col)
else:
letter_cols += col_dict[col]
letter_idx.append(col)
num_cols = sorted(list(set(num_cols)))
letter_cols = sorted(list(set(letter_cols)))
num_df = pd.DataFrame(0, index=num_idx, columns=num_cols)
letter_df = pd.DataFrame(0, index=letter_idx, columns=letter_cols)
for col in col_dict:
if isinstance(col_dict[col][0], int):
for item in col_dict[col]:
num_df.loc[col, item] += 1
else:
for item in col_dict[col]:
letter_df.loc[col, item] += 1
print(num_df)
print(letter_df)
plt.set_cmap('inferno')
plt.pcolor(num_df)
plt.yticks(np.arange(0.5, len(num_df.index), 1), num_df.index)
plt.xticks(np.arange(0.5, len(num_df.columns), 1), num_df.columns)
plt.colorbar()
plt.xlabel('Counts')
plt.ylabel('Columns')
plt.title('Numerical occurrences')
plt.figure()
plt.pcolor(letter_df)
plt.yticks(np.arange(0.5, len(letter_df.index), 1), letter_df.index)
plt.xticks(np.arange(0.5, len(letter_df.columns), 1), letter_df.columns)
plt.colorbar()
plt.xlabel('Counts')
plt.ylabel('Columns')
plt.title('Aphabetical occurrences')
plt.show()

Stacked bar plot in python / plotly (express): grouping / ordering of bars

I have data in a dataframe that I want to plot with a stacked bar plot:
test_df = pd.DataFrame([[1, 5, 1, 'A'], [2, 10, 1, 'B'], [3, 3, 1, 'A']], columns = ('ID', 'Value', 'Bucket', 'Type'))
if I do the plot with Plotly Express I get bars stacked on each other and correctly ordered (based on the index):
fig = px.bar(test_df, x='Bucket', y='Value', barmode='stack')
However, I want to color the data based on Type, hence I go for
fig = px.bar(test_df, x='Bucket', y='Value', barmode='stack', color='Type')
This works, except now the ordering is messed up, because all bars are now grouped by Type. I looked through the docs of Plotly Express and couldn't find a way to specify the ordering of the bars independently. Any tips on how to do this?
I found this one here, but the scenario is a bit different and the options mentioned there don't seem to help me:
How to disable plotly express from grouping bars based on color?
Edit: This goes into the right direction, but not with using Plotly Express, but rather Plotly graph_objects:
import plotly.graph_objects as go
test_df = pd.DataFrame([[1, 5, 1, 'A', 'red'], [2, 10, 1, 'B', 'blue'], [3, 3, 1, 'A', 'red']], columns = ('ID', 'Value', 'Bucket', 'Type', 'Color'))
fig = go.Figure()
fig.add_trace(go.Bar(x=test_df["Bucket"], y=test_df["Value"], marker_color=test_df["Color"]))
Output:
Still, I'd prefer the Express version, because so many things are easier to handle there (Legend, Hover properties etc.).
The only way I can understand your question is that you don't want B to be stacked on top of A, but rather the opposite. If that's the case, then you can get what you want through:
fig.data = fig.data[::-1]
fig.layout.legend.traceorder = 'reversed'
Some details:
fig.data = fig.data[::-1] simply reverses the order that the traces appear in fig.data and ultimately in the plotted figure itself. This will however reverse the order of the legend as well. So without fig.layout.legend.traceorder = 'reversed' the result would be:
And so it follows that the complete work-around looks like this:
fig.data = fig.data[::-1]
fig.layout.legend.traceorder = 'reversed'
Complete code:
import pandas as px
import plotly.express as px
test_df = pd.DataFrame([[1, 5, 1, 'A'], [2, 10, 1, 'B'], [3, 3, 1, 'A']], columns = ('ID', 'Value', 'Bucket', 'Type'))
fig = px.bar(test_df, x='Bucket', y='Value', barmode='stack', color='Type')
fig.data = fig.data[::-1]
fig.layout.legend.traceorder = 'reversed'
fig.show()
Ok, sorry for the long delay on this, but I finally got around to solving this.
My solution is possibly not the most straight forward one, but it does work.
The basic idea is to use graph_objects instead of express and then iterate over the dataframe and add each bar as a separate trace. This way, each trace can get a name that can be grouped in a certain way (which is not possible if adding all bars in a single trace, or at least I could not find a way).
Unfortunately, the ordering of the legend is messed up (if you have more then 2 buckets) and there is no way in plotly currently to sort it. But that's a minor thing.
The main thing that bothers me is that this could've been so much easier if plotly.express allowed for manual ordering of the bars by a certain column.
Maybe I'll submit that as a suggestion.
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "browser"
test_df = pd.DataFrame(
[[1, 5, 1, 'B'], [3, 3, 1, 'A'], [5, 10, 1, 'B'],
[2, 8, 2, 'B'], [4, 5, 2, 'A'], [6, 3, 2, 'A']],
columns = ('ID', 'Value', 'Bucket', 'Type'))
# add named colors to the dataframe based on type
test_df.loc[test_df['Type'] == 'A', 'Color'] = 'Crimson'
test_df.loc[test_df['Type'] == 'B', 'Color'] = 'ForestGreen'
# ensure that the dataframe is sorted by the values
test_df.sort_values('ID', inplace=True)
fig = go.Figure()
# it's tedious to iterate over each item, but only this way we can ensure that everything is correctly ordered and labelled
# Set up legend_show_dict to check if an item should be shown or not. This should be only done for the first occurrence to avoid duplication.
legend_show_dict = {}
for i, row in test_df.iterrows():
if row['Type'] in legend_show_dict:
legend_show = legend_show_dict[row['Type']]
else:
legend_show = True
legend_show_dict[row['Type']] = False
fig.add_trace(
go.Bar(
x=[row['Bucket']],
y=[row['Value']],
marker_color=row['Color'],
name=row['Type'],
legendgroup=row['Type'],
showlegend=legend_show,
hovertemplate="<br>".join([
'ID: ' + str(row['ID']),
'Value: ' + str(row['Value']),
'Bucket: ' + str(row['Value']),
'Type: ' + row['Type'],
])
))
fig.update_layout(
xaxis={'categoryorder': 'category ascending', 'title': 'Bucket'},
yaxis={'title': 'Value'},
legend={'traceorder': 'normal'}
)
fig.update_layout(barmode='stack', font_size=20)
fig.show()
This is what it should look like then:

Plotting an array of events with respective colours

I'm pretty new to matplotlib and I'm trying to plot an array for time-series use, but I don't use a date, only the index for order. I have another array with a colour code for every entry in the previous array.
I'm trying to plot them similar to this but only in one line.
My data looks like:
array = ['event0', 'event1', 'event2', 'event0', 'event6', ..]
colours = ['r', 'g', 'b', 'r', 'y', ..]
import matplotlib.pyplot as plt
# input data
array = ['event0', 'event1', 'event2', 'event0', 'event6']
colours = ['r', 'g', 'b', 'r', 'y']
# for easier plotting, convert your data into numerical data:
method 1:
events of same type get same y-coordinate
array_numbers = [float(a.split('event')[1]) for a in array]
method 2:
all events get the same y-coordinate:
array_numbers = [1 for a in array]
plotting:
# create figure, and subplots
fig = plt.figure()
ax = plt.subplot(111)
# plot newly generated numbers against their index, using the colours specified
ax.scatter(range(len(array_numbers)), array_numbers, c=colours)
# create ticklabels:
y_tick_labels = ['sensor-{}'.format(a) for a in range(7)]
# set positions of ticks, and add names
ax.set_yticks(range(7))
ax.set_yticklabels(y_tick_labels)
You can use plt.scatter(x, y, c=colours)

Categories