I have 69 machines and each machine has 12-month production data.
I plot them all with groupby.plot() and got a long list of views. Wondering how to make a tight layout so I can view them at once? Result wanted is each row has 7 columns and 69/7 rows. Please help!
c1.groupby('System ID').plot(x='Month', y='Monthly Production',kind='bar',legend=True)
I thought I'd add an example using seaborn as it might be useful in this context as it's quite easy to wrap things by columns with it. I expect that there's someone who could provide a nicer answer, perhaps using pandas, and I hope they do.
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(1)
N = 2000
df = pd.DataFrame(np.random.randint(0,4, (N,7)))
df['system'] = np.random.randint(0, 69, N )
Which gives df as;
0 1 2 3 4 5 6 system
674 1 2 3 1 0 0 0 15
1699 0 0 1 3 0 0 1 9
1282 0 0 0 0 1 0 2 47
1315 0 3 1 3 1 1 1 37
1210 1 1 0 3 1 3 1 11
Melting the data before plotting:
df_plot = df.melt(id_vars='system')
Which looks as
system variable value
8756 23 4 2
5474 24 2 2
11242 12 5 2
7820 56 3 3
Then
sns.catplot(x = 'variable', y = 'value', col = 'system',
hue = 'variable', dodge = False,
col_wrap = 6, data = df_plot, kind = 'bar', ci = False)
Here's my final answer.
# We can ask for ALL THE AXES and put them into axes
fig, axes = plt.subplots(nrows=10, ncols=7, sharex=True, sharey=False, figsize=(20,15))
axes_list = [item for sublist in axes for item in sublist]
ordered_systems = grouped['Monthly Production'].last().sort_values(ascending=False).index
# Now instead of looping through the groupby
# you CREATE the groupby
# you LOOP through the ordered names
# and you use .get_group to get the right group
grouped = c1.groupby("System ID")
first_month = c1['Month'].min()
last_month = c1['Month'].max()
for system in ordered_systems:
selection = grouped.get_group(system)
ax = axes_list.pop(0)
selection.plot(x='Month', y='Monthly Production', label=system, ax=ax, legend=False)
selection.plot(x='Month', y='Monthly Usage',secondary_y=True, ax=ax, legend=False)
ax.set_title(system)
ax.tick_params(
which='both',
bottom='off',
left='off',
right='off',
top='off'
)
ax.grid(linewidth=0.25)
ax.set_xlim((first_month, last_month))
ax.set_xlabel("")
ax.set_xticks((first_month, last_month))
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Now use the matplotlib .remove() method to
# delete anything we didn't use
for ax in axes_list:
ax.remove()
plt.subplots_adjust(hspace=1)
plt.tight_layout()
Related
I am trying to change the color of each individual bar in my figure here. The code that I used it down below. Instead of each bar changing to the color that I have set in c, there are several colors within each bar. I have included a screenshot of this. How can I fix this? Thank you all in advance!
Clusters is just a categorical variable of 5 groups, ranging from 0 to 4. I have included a second screenshot of the dataframe.
So essentially, what I am trying to do is to plot each cluster for economic ideology and social ideology so I can have a visual comparison of the 5 different clusters over these two dimensions (economic and social ideology). Each cluster should be represented by one color. For example, cluster 0 should be red in color.
c = ['#bf1111', '#1c4975', '#278f36', '#47167a', '#de8314']
plt.subplot(1, 2, 1)
plt.bar(data = ANESdf_LatNEW, height = "EconIdeo",
x = "clusters", color = c)
plt.title('Economic Ideology')
plt.xticks([0, 1, 2, 3, 4])
plt.xlabel('Clusters')
plt.ylabel('')
plt.subplot(1, 2, 2)
plt.bar(data = ANESdf_LatNEW, height = "SocialIdeo",
x = "clusters", color = c)
plt.title('Social Ideology')
plt.xticks([0, 1, 2, 3, 4])
plt.xlabel('Clusters')
plt.ylabel('')
plt.show()
Bar graph here
Top 5 rows of dataframe
I have tried multiple ways of changing colors. For example, instead of having c, I had put in the colors directly at color = ... This did not work either.
Here is a script that does what you seem to be looking for based on your edits and comment.
Note that I do not assume that all clusters have the same size in this context; if that is the case, this approach can be simplified.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# sample dataframe
df = pd.DataFrame(
{
'EconIdeo':[1,2,3,4,3,5,7],
'Clusters':[2,3,0,1,3,0,3]
})
print(df)
# parameters: width for each cluster, colors for each cluster
# (if clusters are not sequential from zero, replace c with dictionary)
width = .75
c = ['#bf1111', '#1c4975', '#278f36', '#47167a', '#de8314']
df['xpos'] = df['Clusters']
df['width'] = width
df['color'] = ''
clusters = df['Clusters'].unique()
for k in clusters:
where = (df['Clusters'] == k)
n = where.sum()
df.loc[where,'xpos'] += np.linspace(-width/2,width/2,2*n+1)[1:-1:2]
df.loc[where,'width'] /=n
df.loc[where,'color'] = c[k]
plt.bar(data = df, height = "EconIdeo", x = 'xpos',
width = 'width', color = 'color')
plt.xticks(clusters,clusters)
plt.show()
Resulting plot:
Input dataframe:
EconIdeo Clusters
0 1 2
1 2 3
2 3 0
3 4 1
4 3 3
5 5 0
6 7 3
Dataframe after script applies changes (to include plotting specifications)
EconIdeo Clusters xpos width color
0 1 2 2.0000 0.750 #278f36
1 2 3 2.7500 0.250 #47167a
2 3 0 -0.1875 0.375 #bf1111
3 4 1 1.0000 0.750 #1c4975
4 3 3 3.0000 0.250 #47167a
5 5 0 0.1875 0.375 #bf1111
6 7 3 3.2500 0.250 #47167a
I tried to make a bar plot.
purchase_value
Buy_Coffee
Buy_ColdDrinks
Buy_Juices
Buy_Pastries
Buy_Sandwiches
0
0
1
0
1
0
1
1
0
0
0
0
2
1
0
0
0
1
3
1
0
0
0
0
4
1
0
0
0
1
5
1
0
0
0
0
plt.bar(purchase_value.index,
purchase_value.value_counts(),
width=0.5,
bottom=None,
align='center',
color=['lightsteelblue',
'cornflowerblue',
'royalblue',
'midnightblue',
'darkblue'])
plt.xticks(rotation='vertical')
plt.show()
But it turned out
ValueError: shape mismatch: objects cannot be broadcast to a single shape
When you call pandas.DataFrame.value_counts, you are computing the count for each unique combination of rows values.
So, by applying this to data you provided, you get:
Buy_Coffee Buy_ColdDrinks Buy_Juices Buy_Pastries Buy_Sandwiches
1 0 0 0 0 3
1 2
0 1 0 1 0 1
The number of combinations could be not equal to the number of lines (purchase_value.index) in your data.
If I understand correctly what you want to plot, you should use:
import pandas as pd
import matplotlib.pyplot as plt
purchase_value = pd.read_csv(r'data/data.csv')
fig, ax = plt.subplots()
purchase_value.plot(kind = 'bar',
ax = ax,
stacked = True,
width=0.5,
bottom=0,
align='center',
color=['lightsteelblue',
'cornflowerblue',
'royalblue',
'midnightblue',
'darkblue'])
plt.show()
If you would to draw the sum of each item for each column, you should use:
fig, ax = plt.subplots()
purchase_value.sum(axis = 0).plot(kind = 'bar',
ax = ax,
stacked = True,
width=0.5,
bottom=0,
align='center',
color=['lightsteelblue',
'cornflowerblue',
'royalblue',
'midnightblue',
'darkblue'])
plt.tight_layout()
plt.show()
I have three-column data in a file named "sample1.dat" and a code that reads the columns and tries to plot the 3rd column against the 2nd column. I pick up parameter values from the 1st column elements as long as their values remain the same.
"sample1.dat" reads
0 1 1
0 2 4
0 3 9
0 4 16
0 5 25
0 6 36
1 1 1
1 2 8
1 3 27
1 4 64
1 5 125
1 6 216
2 1 1
2 2 16
2 3 81
2 4 256
2 5 625
2 6 1296
And my code:
import matplotlib.pyplot as plt
import numpy as np
data = np.loadtxt('sample1.dat')
x = data[:,0]
y = data[:,1]
z = data[:,2]
L = len(data)
col = ['r','g','b']
x0 = x[0]; j=0; jold=-1
for i in range(L):
print('j, col[j]=',j, col[j])
if x[i] == x0:
print('y[i], z[i]=',y[i],z[i])
if i==0 or j != jold: # j-index decides new or the same paramet
label = 'parameter = {}'.format(x0)
else:
label = ''
print('label =',label)
plt.plot(y[i], z[i], color=col[j], marker='o', label=label)
else:
x0 = x[i] # Update when x-value changes,
# i.e. pick up the next parameter value
i -= 1 # Shift back else we miss the 1st point for new x-value
j += 1; jold = j
plt.legend()
plt.xlabel('2nd column')
plt.ylabel('3rd column')
plt.savefig('sample1.png')
plt.show()
The plot outcome:
One can clearly see that two issues persist:
The legends appear only for the first parameter though I tried to avoid the repitition in my code.
The default linestyle is not appearing though the legends show line plus marker plots.
How could I resolve these or is there a smarter way of coding to fulfill the same purpose.
The first issue is due to some strange logic involving j,jold and x0. The code can be simplified by drawing all y,z for each x-value at once. Numpy allows selecting the y's corresponding to a given x0 as y[x==x0s].
The second issue can be solved by explicitly setting the desired linestyle, i.e. ls=''.
import matplotlib.pyplot as plt
import numpy as np
data = np.loadtxt('sample1.dat')
x = data[:, 0]
y = data[:, 1]
z = data[:, 2]
colors = ['r', 'g', 'b']
for x0, color in zip(np.unique(x), colors):
plt.plot(y[x == x0], z[x == x0], color=color, marker='o', ls='', label=f'parameter = {x0:.0f}')
plt.legend()
plt.xlabel('2nd column')
plt.ylabel('3rd column')
plt.show()
An alternative approach would use the seaborn library, which does the selecting and coloring without a lot of intervention, for example:
import seaborn as sns
sns.scatterplot(x=y, y=z, hue=x, palette=['r', 'g', 'b'])
Seaborn can automatically add labels if the data is organized as a dictionary or a pandas dataframe:
data = {'first column': x.astype(int),
'second column': y,
'third column': z}
sns.scatterplot(data=data, x='second column', y='third column', hue='first column', palette=['r', 'g', 'b'])
You can get the result you want in a few lines by using pandas and seaborn.
If you add column names (for instance A, B, and C) to the data in the sample1.dat file as follow:
A B C
0 1 1
0 2 4
0 3 9
0 4 16
0 5 25
0 6 36
1 1 1
1 2 8
1 3 27
1 4 64
1 5 125
1 6 216
2 1 1
2 2 16
2 3 81
2 4 256
2 5 625
2 6 1296
You can then load your data in a pandas dataframe and plot it with seaborn:
import pandas as pd
import seaborn as sns
df=pd.read_fwf('sample1.dat')
col = ['r','g','b']
sns.scatterplot(data=df,x='B',y='C',hue='A',palette=col)
And the output gives:
I have some data grouped by two columns, with a count column:
Category Subcategory Count
0 1 1 10
1 1 2 15
2 1 3 16
3 2 1 2
4 2 2 8
5 2 3 12
6 3 1 33
7 3 3 23
8 4 2 3
9 5 1 2
I would like to plot a clustered column chart based on the above data.
Not all categories contain all subcategories, so for these the plot should show 0. I would like to show values as counts of subcategory within a category, as percentage of the category.
Here is an example chart that Has 2 categories and multiple subcategories as separate clusters. I would like to achieve a similar result.
https://imge.to/i/AVUiY
Additional question: is it possible to get a break in the scale at the Y axis, so that the outlier columns (high values) become smaller, and the small values become more visible?
I hard coded a few things just to get right to the plotting, so first you will want to create what I have called "cat1"-"cat5" from your columns of data.
import numpy as np
import matplotlib.pyplot as plt
# data to plot
n_groups = 3 #number of subcategories
cat1 = (10,15,16)
cat2 = (2,8,12)
cat3 = (33,0,23)
cat4 = (0,3,0)
cat5 = (2,0,0)
# create plot
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.1
rects1 = plt.bar(index, cat1, bar_width, label='1')
rects2 = plt.bar(index + bar_width, cat2, bar_width, label='2')
rects3 = plt.bar(index + 2*bar_width, cat3, bar_width, label='3')
rects4 = plt.bar(index + 3*bar_width, cat4, bar_width, label='4')
rects5 = plt.bar(index + 4*bar_width, cat5, bar_width, label='5')
plt.xlabel('Subcategory')
plt.ylabel('Count')
plt.title('Count by Category')
plt.xticks(index + bar_width, ('1', '2', '3'))
plt.legend()
plt.tight_layout()
plt.show()
To answer your second question check out the brokenaxes package: https://github.com/bendichter/brokenaxes
I know that you can use the mosaic plot from statsmodels but it is a bit frustrating when your categories have some empty values (like here). I was wondering whether it exists a solution with a graphic library like matplotlib or seaborn, which would be more handy.
I think it would be a nice feature for seaborn, as contingency tables are frequently built with pandas. However it seems that it won't be implemented anytime soon.
Finally, how to have a mosaic plot with 3 dimensions, and possible empty categories ?
Here is a generic mosaic plot (from wikipedia)
As nothing existed in python, here is the code I made. The last dimension should be of size 1 (i.e. a regular table) or 2 for now. Feel free to update the code to fix that, it might be unreadable with more than 3, though.
It's a bit long but it does the job. Example below.
There are few options, most are self explanatory, otherwise:
dic_color_row: a dictionary where keys are the outer-most index (Index_1 in example below) and the values are colors, avoid black/gray colors
pad: the space between each bar of the plot
alpha_label: the 3rd dimension use alpha trick to differentiate, between them, it will be rendered as dark grey / light grey in the legend and you can change the name of each label (similar to col_labels or row_labels)
color_label: to add background color to the y-tick labels. [True/False]
def mosaic_plot(df, dic_color_row, row_labels=None, col_labels=None, alpha_label=None, top_label="Size",
x_label=None, y_label=None, pad=0.01, color_ylabel=False, ax=None, order="Size"):
"""
From a contingency table NxM, plot a mosaic plot with the values inside. There should be a double-index for rows
e.g.
3 4 1 0 2 5
Index_1 Index_2
AA C 0 0 0 2 3 0
P 6 0 0 13 0 0
BB C 0 2 0 0 0 0
P 45 1 10 10 1 0
CC C 0 6 35 15 29 0
P 1 1 0 2 0 0
DD C 0 56 0 3 0 0
P 30 4 2 0 1 9
order: how columns are order, by default, from the biggest to the smallest in term of category. Possible values are
- "Size" [default]
- "Normal" : as the columns are order in the input df
- list of column names to reorder the column
top_label: Size of each columns. The label can be changed to adapt to your value.
If `False`, nothing is displayed and the secondary legend is set on top instead of on right.
"""
is_multi = len(df.index.names) == 2
if ax == None:
fig, ax = plt.subplots(1,1, figsize=(len(df.columns), len(df.index.get_level_values(0).unique())))
size_col = df.sum().sort_values(ascending=False)
prop_com = size_col.div(size_col.sum())
if order == "Size":
df = df[size_col.index.values]
elif order == "Normal":
prop_com = prop_com[df.columns]
size_col = size_col[df.columns]
else:
df = df[order]
prop_com = prop_com[order]
size_col = size_col[order]
if is_multi:
inner_index = df.index.get_level_values(1).unique()
prop_ii0 = (df.swaplevel().loc[inner_index[0]]/(df.swaplevel().loc[inner_index[0]]+df.swaplevel().loc[inner_index[1]])).fillna(0)
alpha_ii = 0.5
true_y_labels = df.index.levels[0]
else:
alpha_ii = 1
true_y_labels = df.index
Yt = (df.groupby(level=0).sum().iloc[:,0].div(df.groupby(level=0).sum().iloc[:,0].sum())+pad).cumsum() - pad
Ytt = df.groupby(level=0).sum().iloc[:,0].div(df.groupby(level=0).sum().iloc[:,0].sum())
x = 0
for j in df.groupby(level=0).sum().iteritems():
bot = 0
S = float(j[1].sum())
for lab, k in j[1].iteritems():
bars = []
ax.bar(x, k/S, width=prop_com[j[0]], bottom=bot, color=dic_color_row[lab], alpha=alpha_ii, lw=0, align="edge")
if is_multi:
ax.bar(x, k/S, width=prop_com[j[0]]*prop_ii0.loc[lab, j[0]], bottom=bot, color=dic_color_row[lab], lw=0, alpha=1, align="edge")
bot += k/S + pad
x += prop_com[j[0]] + pad
## Aesthetic of the plot and ticks
# Y-axis
if row_labels == None:
row_labels = Yt.index
ax.set_yticks(Yt - Ytt/2)
ax.set_yticklabels(row_labels)
ax.set_ylim(0, 1 + (len(j[1]) - 1) * pad)
if y_label == None:
y_label = df.index.names[0]
ax.set_ylabel(y_label)
# X-axis
if col_labels == None:
col_labels = prop_com.index
xticks = (prop_com + pad).cumsum() - pad - prop_com/2.
ax.set_xticks(xticks)
ax.set_xticklabels(col_labels)
ax.set_xlim(0, prop_com.sum() + pad * (len(prop_com)-1))
if x_label == None:
x_label = df.columns.name
ax.set_xlabel(x_label)
# Top label
if top_label:
ax2 = ax.twiny()
ax2.set_xlim(*ax.get_xlim())
ax2.set_xticks(xticks)
ax2.set_xticklabels(size_col.values.astype(int))
ax2.set_xlabel(top_label)
ax2.tick_params(top=False, right=False, pad=0, length=0)
# Ticks and axis settings
ax.tick_params(top=False, right=False, pad=5)
sns.despine(left=0, bottom=False, right=0, top=0, offset=3)
# Legend
if is_multi:
if alpha_label == None:
alpha_label = inner_index
bars = [ax.bar(np.nan, np.nan, color="0.2", alpha=[1, 0.5][b]) for b in range(2)]
if top_label:
plt.legend(bars, alpha_label, loc='center left', bbox_to_anchor=(1, 0.5), ncol=1, )
else:
plt.legend(bars, alpha_label, loc="lower center", bbox_to_anchor=(0.5, 1), ncol=2)
plt.tight_layout(rect=[0, 0, .9, 0.95])
if color_ylabel:
for tick, label in zip(ax.get_yticklabels(), true_y_labels):
tick.set_bbox(dict( pad=5, facecolor=dic_color_row[label]))
tick.set_color("w")
tick.set_fontweight("bold")
return ax
With a dataframe you get after a crosstabulation:
df
Index_1 Index_2 v w x y z
AA Q 0 0 0 2 3
AA P 6 0 0 13 0
BB Q 0 2 0 0 0
BB P 45 1 10 10 1
CC Q 0 6 0 15 9
CC P 0 1 0 2 0
DD Q 0 56 0 3 0
DD P 30 4 2 0 1
make sure that you have the 2 columns as index:
df.set_index(["Index_1", "Index_2"], inplace=True)
and then just call:
mosaic_plot(df,
{"AA":"r", "BB":"b", "CC":"y", "DD":"g"}, # dict of color, mandatory
x_label='My Category',
)
It's not perfect, but I hope it will help others.