Making bar plot of different clusters - python

I am currently learning K-means, so now I am writing a program in Python to determine different clusters of text that are similar to each other.
So now I got the results for two different clusters (using some fictional words but everything else is the same).
print(dfs) = [ features score
0 America 0.577350
1 new 0.288675
2 president 0.288675
3 Biden 0.288675
, features score
0 Corona 0.593578
1 COVID-19 0.296789
2 research 0.296789
3 health 0.158114]
And dfs is the following type
type(dfs) = list
And the following:
type(dfs[0]) = pandas.core.frame.DataFrame
But how can I easily create bar plots for each cluster in dfs where you see the score attached to each word?
Thanks in advance!

Iterate over the dfs list to access individual dataframes, then, use df.plot.bar with x='features and y='score' as arguments to plot the bar chart relative to that same dataframe. Use the resulting axis from plot function to attach the scores for each bar in the features column. For that, iterate over each patch from the bar plot axis using the x from the rectangle anchor point and the height of the bar as arguments to the annotate function.
...
...
fig, axes = plt.subplots(1, len(dfs))
for num, df in enumerate(dfs):
ax = df.plot.bar(x='features', y='score', ax=axes[num])
for p in ax.patches:
ax.annotate(f'{p.get_height():.4f}', xy=(p.get_x() * 1.01, p.get_height() * 1.01))
axes[num].tick_params(axis='x', labelrotation=30)
axes[num].set_title(f'Dataframe #{num}')
plt.show()

Related

Stacked bar plot in subplots using pandas .plot()

I created a hypothetical DataFrame containing 3 measurements for 20 experiments. Each experiment is associated with a Subject (3 possibilities).
import random
random.seed(42) #set seed
tuples = list(zip(*[list(range(20)),random.choices(['Jean','Marc','Paul'], k = 20)]))#index labels
index=pd.MultiIndex.from_tuples(tuples, names=['num_exp','Subject'])#index
test= pd.DataFrame(np.random.randint(0,100,size=(20, 3)),index=index,columns=['var1','var2','var3']) #DataFrame
test.head() #first lines
head
I succeeded in constructing stacked bar plots with the 3 measurements (each bar is an experiment) for each subject:
test.groupby('Subject').plot(kind='bar', stacked=True,legend=False) #plots
plot1 plot2 plot3
Now, I would like to put each plot (for each subject) in a subplot. If I use the "subplots" argument, it gives me the following :
test.groupby('Subject').plot(kind='bar', stacked=True,legend=False,subplots= True) #plot with subplot
plotsubplot1 plotsubplot2 plotsubplot3
It created a subplot for each measurment because they correspond to columns in my DataFrame.
I don't know how I could do otherwise because I need them as columns to create stacked bars.
So here is my question :
Is it possible to construct this kind of figure with stacked bar plots in subplots (ideally in an elegant way, without iterating) ?
Thanks in advance !
I solved my problem with a simple loop without using anything else than pandas .plot()
Pandas .plot() has an ax parameters for matplotlib axes object.
So, starting from the list of distinct subjects :
subj= list(dict.fromkeys(test.index.get_level_values('Subject')))
I define my subplots :
fig, axs = plt.subplots(1, len(subj))
Then, I have to iterate for each subplot :
for a in range(len(subj)):
test.loc[test.index.get_level_values('Subject') == subj[a]].unstack(level=1).plot(ax= axs[a], kind='bar', stacked=True,legend=False,xlabel='',fontsize=10) #Plot
axs[a].set_title(subj[a],pad=0,fontsize=15) #title
axs[a].tick_params(axis='y', pad=0,size=1) #yticks
And it works well ! :finalresult

How to create grouped and stacked bars

I have a very huge dataset with a lot of subsidiaries serving three customer groups in various countries, something like this (in reality there are much more subsidiaries and dates):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'subsidiary': ['EU','EU','EU','EU','EU','EU','EU','EU','EU','US','US','US','US','US','US','US','US','US'],'date': ['2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05'],'business': ['RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC','RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC'],'value': [500.36,600.45,700.55,750.66,950.89,1300.13,100.05,120.00,150.01,800.79,900.55,1000,3500.79,5000.36,4500.25,50.17,75.25,90.33]})
print(df)
I'd like to make an analysis per subsidiary by producing a stacked bar chart. To do this, I started by defining the x-axis to be the unique months and by defining a subset per business type in a country like this:
x=df['date'].drop_duplicates()
EUCORP = df[(df['subsidiary']=='EU') & (df['business']=='CORP')]
EURETAIL = df[(df['subsidiary']=='EU') & (df['business']=='RETAIL')]
EUPUBLIC = df[(df['subsidiary']=='EU') & (df['business']=='PUBLIC')]
I can then make a bar chart per business type:
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35)
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35)
However, if I try to stack all three together in one chart, I keep failing:
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35, bottom=EURETAIL)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35, bottom=EURETAIL+EUCORP)
plt.show()
I always receive the below error message:
ValueError: Missing category information for StrCategoryConverter; this might be caused by unintendedly mixing categorical and numeric data
ConversionError: Failed to convert value(s) to axis units: subsidiary date business value
0 EU 2019-03 RETAIL 500.36
1 EU 2019-04 RETAIL 600.45
2 EU 2019-05 RETAIL 700.55
I tried converting the months into the dateformat and/or indexing it, but it actually confused me further...
I would really appreciate any help/support on any of the following, as I a already spend a lot of hours to try to figure this out (I am still a python noob, sry):
How can I fix the error to create a stacked bar chart?
Assuming, the error can be fixed, is this the most efficient way to create the bar chart (e.g. do I really need to create three sub-dfs per subsidiary, or is there a more elegant way?)
Would it be possible to code an iteration, that produces a stacked bar chart by country, so that I don't need to create one per subsidiary?
As an FYI, stacked bars are not the best option, because they can make it difficult to compare bar values and can easily be misinterpreted. The purpose of a visualization is to present data in an easily understood format; make sure the message is clear. Side-by-side bars are often a better option.
Side-by-side stacked bars are a difficult manual process to construct, it's better to use a figure-level method like seaborn.catplot, which will create a single, easy to read, data visualization.
Bar plot ticks are located by 0 indexed range (not datetimes), the dates are just labels, so it is not necessary to convert them to a datetime dtype.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
seaborn
import seaborn as sns
sns.catplot(kind='bar', data=df, col='subsidiary', x='date', y='value', hue='business')
Create grouped and stacked bars
See Stacked Bar Chart and Grouped bar chart with labels
The issue with the creation of the stacked bars in the OP is bottom is being set on the entire dataframe for that group, instead of only the values that make up the bar height.
do I really need to create three sub-dfs per subsidiary. Yes, a DataFrame is needed for every group, so 6, in this case.
Creating the data subsets can be automated using a dict-comprehension to unpack the .groupby object into a dict.
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])} to create a dict of DataFrames
Access the values like: data['EUCORP'].value
Automating the plot creation is more arduous, as can be seen x depends on how many groups of bars for each tick, and bottom depends on the values for each subsequent plot.
import numpy as np
import matplotlib.pyplot as plt
labels=df['date'].drop_duplicates() # set the dates as labels
x0 = np.arange(len(labels)) # create an array of values for the ticks that can perform arithmetic with width (w)
# create the data groups with a dict comprehension and groupby
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])}
# build the plots
subs = df.subsidiary.unique()
stacks = len(subs) # how many stacks in each group for a tick location
business = df.business.unique()
# set the width
w = 0.35
# this needs to be adjusted based on the number of stacks; each location needs to be split into the proper number of locations
x1 = [x0 - w/stacks, x0 + w/stacks]
fig, ax = plt.subplots()
for x, sub in zip(x1, subs):
bottom = 0
for bus in business:
height = data[f'{sub}{bus}'].value.to_numpy()
ax.bar(x=x, height=height, width=w, bottom=bottom)
bottom += height
ax.set_xticks(x0)
_ = ax.set_xticklabels(labels)
As you can see, small values are difficult to discern, and using ax.set_yscale('log') does not work as expected with stacked bars (e.g. it does not make small values more readable).
Create only stacked bars
As mentioned by #r-beginners, use .pivot, or .pivot_table, to reshape the dataframe to a wide form to create stacked bars where the x-axis is a tuple ('date', 'subsidiary').
Use .pivot if there are no repeat values for each category
Use .pivot_table, if there are repeat values that must be combined with aggfunc (e.g. 'sum', 'mean', etc.)
# reshape the dataframe
dfp = df.pivot(index=['date', 'subsidiary'], columns=['business'], values='value')
# plot stacked bars
dfp.plot(kind='bar', stacked=True, rot=0, figsize=(10, 4))

How to plot with x = column A, but color/hue = column B categorical vars

I have a pandas dataframe that I want to create a bar plot from using Seaborn. The problem is I want to use one of two categorical variables, say column A, in X-axis, but a different categorical column, say column B, to color the bars. Values in B can represent more than one value in A.
MajorCategories name review_count
Food,Restaurants Mon Ami Gabi 8348
Food,Restaurants Bacchanal Buffet 8339
Restaurants Wicked Spoon 6708
Food,Restaurants Hash House A Go Go 5763
Restaurants Gordon Ramsay BurGR 5484
Restaurants Secret Pizza 4286
Restaurants The Buffet at Bellagio 4227
Hotels & Travel McCarran International Airport 3627
Restaurants Yardbird Southern Table & Bar 3576
So, I would like my barplot to plot the bars with x = 'name' and y='review_count', at the same time color/hue?? = Major Categories. It is possible in Seaborn without many lines of code?
Below are the links to the images I get in seaborn, and the one I am trying to get.
sns.catplot(x="review_count", y="name", kind="bar", data=plot_data, aspect= 1.5)
Plot I get using seaborn using the code above
Plot I am trying to achieve, this one is using ggplot2 in R
Try passing hue and set dodge=False:
sns.catplot(x="review_count", y="name", hue='MajorCategories',
kind="bar", data=plot_data,
dodge=False, aspect= 1.5)
Output:

Plotting Pandas data as an array of bar chart does not honour sharex = True

I have a Pandas dataframe that contains a column containing 'year' data and a column containing 'count' data. There is also a column containing a 'category' variable. Not each category has data for each year. I would like to plot an array of bar charts, one above the other, using a common x axis (year). The code I've written almost works except the x axis is not common for all plots.
The code example is given below. Basically, the code creates an array of axes with sharex=True and then steps through each axis plotting the relevant data from the dataframe.
# Define dataframe
myDF = pd.DataFrame({'year':list(range(2000,2010))+list(range(2001,2008))+list(range(2005,2010)),
'category':['A']*10 + ['B']*7 + ['C']*5,
'count':[2,3,4,3,4,5,4,3,4,5,2,3,4,5,4,5,6,9,8,7,8,6]})
# Plot counts for individual categories in array of bar charts
fig, axarr = plt.subplots(3, figsize = (4,6), sharex = True)
for i in range(0,len(myDF['category'].unique())):
myDF.loc[myDF['category'] == myDF['category'].unique()[i],['year','count']].plot(kind = 'bar',
ax = axarr[i],
x = 'year',
y = 'count',
legend = False,
title = 'Category {0} bar chart'.format(myDF['category'].unique()[i]))
fig.subplots_adjust(hspace=0.5)
plt.show()
A screenshot of the outcome is given below:
I was expecting the Category A bars to extend from 2000 to 2009, Category B bars to extend from 2001 to 2007 and Category C bars to extend from 2005 to 2009. However, it seems that only the first 5 bars of each category are plotted regardless of the value on the x axis. Presumably, the reason only 5 bars are plotted is because the last category only had data for 5 years. A bigger problem is that the data plotted for the other categories is not plotted against the correct year. I've searched for solutions and tried various modifications but nothing seems to work.
Any suggestions to resolve this issue would be very welcome.
Try the following approach:
d = myDF.groupby(['year', 'category'])['count'].sum().unstack()
fig, axarr = plt.subplots(3, figsize = (4,6), sharex=True)
for i, cat in enumerate(d.columns):
d[cat].plot(kind='bar', ax=axarr[i], title='Category {cat} bar chart'.format(cat=cat))
fig.subplots_adjust(hspace=0.5)

Plot multiple lines from one data frame

I have all the data I want to plot in one pandas data frame, e.g.:
date flower_color flower_count
0 2017-08-01 blue 1
1 2017-08-01 red 2
2 2017-08-02 blue 5
3 2017-08-02 red 2
I need a few different lines on one plot: x-value should be the date from the first column and y-value should be flower_count, and the y-value should depend on the flower_color given in the second column.
How can I do that without filtering the original df and saving it as a new object first? My only idea was to create a data frame for only red flowers and then specifying it like:
figure.line(x="date", y="flower_count", source=red_flower_ds)
figure.line(x="date", y="flower_count", source=blue_flower_ds)
You can try this
fig, ax = plt.subplots()
for name, group in df.groupby('flower_color'):
group.plot('date', y='flower_count', ax=ax, label=name)
If my understanding is right, you need a plot with two subplots. The X for both subplots are dates, and the Ys are the flower counts for each color?
In this case, you can employ the subplots in pandas visualization.
fig, axes = plt.subplots(2)
z[z.flower_color == 'blue'].plot(x=['date'], y= ['flower_count'],ax=axes[0]).set_ylabel('blue')
z[z.flower_color == 'red'].plot(x=['date'], y= ['flower_count'],ax=axes[1]).set_ylabel('red')
plt.show()
The output will be like:
Hope it helps.

Categories