I have a problem about drawing a nested pie graph in Matplotlib in Python. I wrote some codes to handle with this process but I have an issue related with design and label
I'd like to draw a kind of this nested pie graph. (from the uppermost layer of the nested to its innermost is SEX, ALIGN with covering their counts)
Here is my dataframe which is shown below.
ALIGN SEX count
2 Bad Characters Male Characters 1542
5 Good Characters Male Characters 1419
3 Good Characters Female Characters 714
0 Bad Characters Female Characters 419
8 Neutral Characters Male Characters 254
6 Neutral Characters Female Characters 138
1 Bad Characters Genderless Characters 9
4 Good Characters Genderless Characters 4
7 Neutral Characters Genderless Characters 3
9 Reformed Criminals Male Characters 2
Here is my code snippets related with showing nested pie graph which is shown below.
fig, ax = plt.subplots(figsize=(24,12))
size = 0.3
ax.pie(dc_df_ALIGN_SEX.groupby('SEX')['count'].sum(), radius=1,
labels=dc_df_ALIGN_SEX['SEX'].drop_duplicates(),
autopct='%1.1f%%',
wedgeprops=dict(width=size, edgecolor='w'))
ax.pie(dc_df_ALIGN_SEX['count'], radius=1-size, labels = dc_df_ALIGN_SEX["ALIGN"],
wedgeprops=dict(width=size, edgecolor='w'))
ax.set(aspect="equal", title='Pie plot with `ax.pie`')
plt.show()
How can I design 4 row and 4 column and put each one in each slot and showing labels in legend area?
Since the question has been changed, I'm posting a new answer.
First, I slightly simplified your DataFrame:
import pandas as pd
df = pd.DataFrame([['Bad', 'Male', 1542],
['Good', 'Male', 1419],
['Good', 'Female', 714],
['Bad', 'Female', 419],
['Neutral', 'Male', 254],
['Neutral', 'Female', 138],
['Bad', 'Genderless', 9],
['Good', 'Genderless', 4],
['Neutral', 'Genderless', 3],
['Reformed', 'Male', 2]])
df.columns = ['ALIGN', 'SEX', 'n']
For the numbers in the outer ring, we can use a simple groupby, as you did:
outer = df.groupby('SEX').sum()
But for the numbers in the inner ring, we need to group by both categorical variables, which results in a MultiIndex:
inner = df.groupby(['SEX', 'ALIGN']).sum()
inner
n
SEX ALIGN
Female Bad 419
Good 714
Neutral 138
Genderless Bad 9
Good 4
Neutral 3
Male Bad 1542
Good 1419
Neutral 254
Reformed 2
We can extract the appropriate labels from the MultiIndex with its get_level_values() method:
inner_labels = inner.index.get_level_values(1)
Now you can turn the above values into one-dimensional arrays and plug them into your plot calls:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(24,12))
size = 0.3
ax.pie(outer.values.flatten(), radius=1,
labels=outer.index,
autopct='%1.1f%%',
wedgeprops=dict(width=size, edgecolor='w'))
ax.pie(inner.values.flatten(), radius=1-size,
labels = inner_labels,
wedgeprops=dict(width=size, edgecolor='w'))
ax.set(aspect="equal", title='Pie plot with `ax.pie`')
plt.show()
You define the function percentage_growth(l) in a way that supposes its argument l to be a list (or some other one-dimensional object). But then (to assign colors) you call this function on dc_df_ALIGN_SEX, which is apparently your DataFrame. So the function (in the first iteration of its loop) tries to evaluate dc_df_ALIGN_SEX[0], which throws the key error, because that is not a proper way to index the DataFrame.
Perhaps you want to do something like percentage_growth(dc_df_ALIGN_SEX['count']) instead?
Related
My dataframe has more than 10 columns and each column has values like yes/no/na/not specified.
And I want to calculate the count of occurrences in each column and create stacked bar graph.
Below is the image that I need:
Yes, this is possible. But you'll need to re-format your data a little first.
Here's the dataset I'm using in this example. It has the labels in the columns, and 1000 random Yes, No or Maybe responses as values.
asthma boneitis diabetes pneumonia
0 No No Yes Maybe
1 No No No Yes
2 No No No No
3 Yes No No Maybe
4 Yes No No Maybe
.. ... ... ... ...
995 No No Yes No
996 Maybe Yes Yes Yes
997 No No No Yes
998 No No No No
999 No No Maybe No
In order to format the data correctly for the plot, do this:
df2 = df.stack().groupby(level=[1]).value_counts().unstack()
# Preferred order of stacked bar elements
stack_order = ['Yes', 'Maybe', 'No']
df2 = df2[stack_order]
At this point, the data looks like this:
Yes Maybe No
asthma 83 83 834
boneitis 174 173 653
diabetes 244 260 496
pneumonia 339 363 298
Now you're ready to plot the data. Here's the code to do that:
df2.plot.bar(rot=0, stacked=True)
I'm using rot=0 to avoid rotating the text labels (they would normally be at a 45 degree angle,) and stacked=True to produce a stacked bar chart.
The plot looks like this:
Appendix
Code for generating test data set:
import pandas as pd
import numpy as np
categories = [
'asthma',
'boneitis',
'diabetes',
'pneumonia',
]
distribution = {
cat: (i + 1) / 12
for i, cat in enumerate(categories)
}
df = pd.DataFrame({
cat: np.random.choice(['Yes', 'Maybe', 'No'], size=1000, p=[prob, prob, 1 - 2 * prob])
for cat, prob in distribution.items()
})
I am plotting a point plot to show the relationship between "workclass", "sex", "occupation" and "Income exceed 50K or not". However, the result is a mess. The legends are stick together, Female and Male are both shown in blue colors in the legend etc.
#Co-relate categorical features
grid = sns.FacetGrid(train, row='occupation', size=6, aspect=1.6)
grid.map(sns.pointplot, 'workclass', 'exceeds50K', 'sex', palette='deep', markers = ["o", "x"] )
grid.add_legend()
Please advise how to fit the size of the plot. Thanks!
It sounds like 'exceeds50k' is a categorical variable. Your y variable needs to be continuous for a point plot. So assuming this is your dataset:
import pandas as pd
import seaborn as sns
df =pd.read_csv("https://raw.githubusercontent.com/katreparitosh/Income-Predictor-Model/master/Database/adult.csv")
We simplify some categories to plot for example sake:
df['native.country'] = [i if i == 'United-States' else 'others' for i in df['native.country'] ]
df['race'] = [i if i == 'White' else 'others' for i in df['race'] ]
df.head()
age workclass fnlwgt education education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country income
0 90 ? 77053 HS-grad 9 Widowed ? Not-in-family White Female 0 4356 40 United-States <=50K
1 82 Private 132870 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United
If the y variable is categorical, you might want to use a barplot:
sns.catplot(hue='income',x='sex', palette='deep',data=df,
col='native.country',
row='race',kind='count',height=3,aspect=1.6)
If it is continuous, for example age, you can see it works:
grid = sns.FacetGrid(df, row='race', height=3, aspect=1.6)
grid.map(sns.pointplot, 'native.country', 'age', 'sex', palette='deep', markers = ["o", "x"] )
grid.add_legend()
I am coming from R ggplot2 background and, and bit confused in matplotlib plot
here my dataframe
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
language county count
0 en us 32
1 cs ch 432
2 es sp 43
3 pt br 55
4 hi in 6
5 en fr 23
6 es ar 455
7 es pr 23
Now I want to plot
A stacked bar chart where x axis show language and y axis show complete count, the big total height show total count for that language and stacked bar show number of countries for that language
A side by side, with same parameters only countries show side by side instead of stacked one
Most of the example show it directly using dataframe and matplotlib plot but I want to plot it in sequential script so I have more control over it, also can edit whatever I want, something like this script
ind = np.arange(df.languages.nunique())
width = 0.35
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(ind, df.languages, width, color='r')
ax.bar(ind, df.count, width,bottom=df.languages, color='b')
ax.set_ylabel('Count')
ax.set_title('Score y language and country')
ax.set_xticks(ind, df.languages)
ax.set_yticks(np.arange(0, 81, 10))
ax.legend(labels=[df.countries])
plt.show()
btw, my panda pivot code for same plotting
df.pivot(index = "Language", columns = "Country", values = "count").plot.bar(figsize=(15,10))
plt.xticks(rotation = 0,fontsize=18)
plt.xlabel('Language' )
plt.ylabel('Count ')
plt.legend(fontsize='large', ncol=2,handleheight=1.5)
plt.show()
import matplotlib.pyplot as plt
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
modified = {}
modified['language'] = np.unique(df.language)
country_count = []
total_count = []
for x in modified['language']:
country_count.append(len(df[df['language']==x]))
total_count.append(df[df['language']==x]['count'].sum())
modified['country_count'] = country_count
modified['total_count'] = total_count
mod_df = pd.DataFrame(modified)
print(mod_df)
ind = mod_df.language
width = 0.35
p1 = plt.bar(ind,mod_df.total_count, width)
p2 = plt.bar(ind,mod_df.country_count, width,
bottom=mod_df.total_count)
plt.ylabel("Total count")
plt.xlabel("Languages")
plt.legend((p1[0], p2[0]), ('Total Count', 'Country Count'))
plt.show()
First,modify the dataframe to below dataframe.
language country_count total_count
0 cs 1 432
1 en 2 55
2 es 3 521
3 hi 1 6
4 pt 1 55
This is the plot:
As the value of country count is small, you cannot clearly see the stacked country count.
import seaborn as sns
import matplotlib.pyplot as plt
figure, axis = plt.subplots(1,1,figsize=(10,5))
sns.barplot(x="language",y="count",data=df,ci=None)#,hue='county')
axis.set_title('Score y language and country')
axis.set_ylabel('Count')
axis.set_xlabel("Language")
sns.countplot(x=df.language,data=df)
I have a data frame that looks like -
id age_bucket state gender duration category1 is_active
1 (40, 70] Jammu and Kashmir m 123 ABB 1
2 (17, 24] West Bengal m 72 ABB 0
3 (40, 70] Bihar f 109 CA 0
4 (17, 24] Bihar f 52 CA 1
5 (24, 30] MP m 23 ACC 1
6 (24, 30] AP m 103 ACC 1
7 (30, 40] West Bengal f 182 GF 0
I want to create a bar plot with how many people are active for each age_bucket and state (top 10). For for gender and category1 I want to create a pie chart with the proportion of active people. The top of the bar should display the total count for active and inactive members and similarly % should be display on pie chart based on is_active.
How to do it in python using seaborn or matplotlib?
I have done so far -
import seaborn as sns
%matplotlib inline
sns.barplot(x='age_bucket',y='is_active',data=df)
sns.barplot(x='category1',y='is_active',data=df)
It sounds like you want to count the observations rather than plotting a value from a column along the yaxis. In seaborn, the function for this is countplot():
sns.countplot('age_bucket', hue='is_active', data=df)
Since the returned object is a matplotlib axis, you could assign it to a variable (e.g. ax) and then use ax.annotate to place text in the the figure manually:
ax = sns.countplot('age_bucket', hue='is_active', data=df)
ax.annotate('1 1', (0, 1), ha='center', va='bottom', fontsize=12)
Seaborn has no way of creating pie charts, so you would need to use matplotlib directly. However, it is often easier to tell counts and proportions from bar charts so I would generally recommend that you stick to those unless you have a specific constraint that forces you to use a pie chart.
I have two DataFrames that I am plotting as a stripplot. I am able to plot them pretty much as I wish, but I would like to know if it is possible to add the category labels for the "hue".
The plot currently looks like this:
However, I would like to add the labels of the categories (there are only two of them) to each "column" for each letter. So that it looks something like this:
The DataFrames look like this (although these are just edited snippets):
Case Letter Size Weight
0 upper A 20 bold
1 upper A 23 bold
2 lower A 61 bold
3 lower A 62 bold
4 upper A 78 bold
5 upper A 95 bold
6 upper B 23 bold
7 upper B 40 bold
8 lower B 47 bold
9 upper B 59 bold
10 upper B 61 bold
11 upper B 99 bold
12 lower C 23 bold
13 upper D 23 bold
14 upper D 66 bold
15 lower D 99 bold
16 upper E 5 bold
17 upper E 20 bold
18 upper E 21 bold
19 upper E 22 bold
...and...
Case Letter Size Weight
0 upper A 4 normal
1 upper A 6 normal
2 upper A 7 normal
3 upper A 8 normal
4 upper A 9 normal
5 upper A 12 normal
6 upper A 25 normal
7 upper A 26 normal
8 upper A 38 normal
9 upper A 42 normal
10 lower A 43 normal
11 lower A 57 normal
12 lower A 90 normal
13 upper B 4 normal
14 lower B 6 normal
15 upper B 8 normal
16 upper B 9 normal
17 upper B 12 normal
18 upper B 21 normal
19 lower B 25 normal
The relevant code I have is:
fig, ax = plt.subplots(figsize=(10, 7.5))
plt.tight_layout()
sns.stripplot(x=new_df_normal['Letter'], y=new_df_normal['Size'],
hue=new_df_normal['Case'], jitter=False, dodge=True,
size=8, ax=ax, marker='D',
palette={'upper': 'red', 'lower': 'red'})
plt.setp(ax.get_legend().get_texts(), fontsize='16') # for legend text
plt.setp(ax.get_legend().get_title(), fontsize='18') # for legend title
ax.set_xlabel("Letter", fontsize=20)
ax.set_ylabel("Size", fontsize=20)
ax.set_ylim(0, 105)
ax.tick_params(labelsize=20)
ax2 = ax.twinx()
sns.stripplot(x=new_df_bold['Letter'], y=new_df_bold['Size'],
hue=new_df_bold['Case'], jitter=False, dodge=True,
size=8, ax=ax2, marker='D',
palette={'upper': 'green', 'lower': 'green'})
ax.legend_.remove()
ax2.legend_.remove()
ax2.set_xlabel("", fontsize=20)
ax2.set_ylabel("", fontsize=20)
ax2.set_ylim(0, 105)
ax2.tick_params(labelsize=20)
Is it possible to add those category labels ("bold" and "normal") for each column?
Using seaborn’s scatter plot you could access to the style (or even size) parameter. But you might not end up with your intended layout in the end. scatterplot documentation.
Or you could use the catplot and play with rows and columns. seaborn doc for catplot
Unfortunately Seaborn does not natively provide what you are looking for : another level of nesting beyond the hue parameter in stripplot (see stripplot documentation. Some seaborn tickets are opened that might be related, eg this ticket. But I’ve come accros some similar feature requests in seaborn that were refused, see this ticket
One last possibility is to dive into the matplotlib primitives to manipulate your seaborn diagram (since seaborn is just on top of matplotlib). Needless to say it would require a lot of effort, and might end-up nullifying seaborn in the first place ;)
Set dodge=True enables this:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.violinplot(x="day", y="total_bill", hue="smoker",
data=tips, palette="muted")
sns.stripplot(x="day", y="total_bill", hue="smoker",
data=tips, palette="muted", dodge=True)
EDIT:
And with the df provided by the OP:
df = pd.read_csv('./ongenz.tsv', sep='\t')
sns.stripplot(x=df['Letter'], y=df['Size'], data=df, hue=df['Case'], dodge=True)