I am plotting a point plot to show the relationship between "workclass", "sex", "occupation" and "Income exceed 50K or not". However, the result is a mess. The legends are stick together, Female and Male are both shown in blue colors in the legend etc.
#Co-relate categorical features
grid = sns.FacetGrid(train, row='occupation', size=6, aspect=1.6)
grid.map(sns.pointplot, 'workclass', 'exceeds50K', 'sex', palette='deep', markers = ["o", "x"] )
grid.add_legend()
Please advise how to fit the size of the plot. Thanks!
It sounds like 'exceeds50k' is a categorical variable. Your y variable needs to be continuous for a point plot. So assuming this is your dataset:
import pandas as pd
import seaborn as sns
df =pd.read_csv("https://raw.githubusercontent.com/katreparitosh/Income-Predictor-Model/master/Database/adult.csv")
We simplify some categories to plot for example sake:
df['native.country'] = [i if i == 'United-States' else 'others' for i in df['native.country'] ]
df['race'] = [i if i == 'White' else 'others' for i in df['race'] ]
df.head()
age workclass fnlwgt education education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country income
0 90 ? 77053 HS-grad 9 Widowed ? Not-in-family White Female 0 4356 40 United-States <=50K
1 82 Private 132870 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United
If the y variable is categorical, you might want to use a barplot:
sns.catplot(hue='income',x='sex', palette='deep',data=df,
col='native.country',
row='race',kind='count',height=3,aspect=1.6)
If it is continuous, for example age, you can see it works:
grid = sns.FacetGrid(df, row='race', height=3, aspect=1.6)
grid.map(sns.pointplot, 'native.country', 'age', 'sex', palette='deep', markers = ["o", "x"] )
grid.add_legend()
Related
I was given a task where I'm supposed to plot a element based on another column element.
For further information here's the code:
# TODO: Plot the Male employee first name on 'Y' axis while Male salary is on 'X' axis
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel("C:\\users\\HP\\Documents\\Datascience task\\Employee.xlsx")
print(data.head(5))
Output:
First Name Last Name Gender Age Experience (Years) Salary
0 Arnold Carter Male 21 10 8344
1 Arthur Farrell Male 20 7 6437
2 Richard Perry Male 28 3 8338
3 Ellia Thomas Female 26 4 8870
4 Jacob Kelly Male 21 4 548
How to plot the 'First Name' column vs the 'Salary' column of the first 5 rows of where the 'Gender' is Male.
First generate the male rows separately and extract first name and salary for plotting.
The below code identifies first five male employees and converts their first name and salary as x and y lists.
x = list(df[df['Gender'] == "Male"][:5]['Fname'])
y = list(df[df['Gender'] == "Male"][:5]['Salary'])
print(x)
print(y)
Output:
['Arnold', 'Arthur', 'Richard', 'Jacob']
[8344, 6437, 8338, 548]
Note that there're only 4 male available in the df.
Then we can plot any chart as we require;
plt.bar(x, y, color = ['r', 'g', 'b', 'y']);
Output:
seaborn can help as well
import seaborn as sns
import matplotlib.plotly as plt
sns.barplot( x=df[(df['Gender'] == "Male")]['First Name'][:5] , y = df[(df['Gender'] == "Male")]['Salary'][:5] )
plt.xlabel('First Names')
plt.ylabel('Salary')
plt.title('Barplot of Male Employees')
plt.show()
I have two columns in my data frame:
winner opening_shortname
0 White Slav Defense
1 Black Nimzowitsch Defense
2 White King's Pawn Game
3 White Queen's Pawn Game
4 White Philidor Defense
... ... ...
20053 White Dutch Defense
20054 Black Queen's Pawn
20055 White Queen's Pawn Game
20056 White Pirc Defense
20057 Black Queen's Pawn Game
I want to create the plot below, the top 10 opening and its winner colour proportion (%).
topk = 10
z = df.groupby(['opening_shortname', 'winner']).size().unstack()
ax = z.loc[z.sum(1).sort_values().tail(topk).index].plot.barh(color=['black', 'white'], edgecolor='black')
ax.xaxis.set_visible(False)
This sorts by prevalence of opening and limits to the top k (e.g. 10 in the OP's question). The "proportion (%)" mention in the question is ambiguous: the plot provided clearly shows decreasing totals from the top opening to the next ones, and the horizontal axis is removed.
Anyway, on the sample data you provided:
Assuming your dataframe is name df, you can groupby+count+unstack. Then sort on the sum and take the top 10 to plot:
df2 = (df.assign(count=1)
.groupby(['winner', 'opening_shortname'])
.count()
.unstack(level=0)
.droplevel(0, axis=1)
)
# plot part
idx = df2.sum(axis=1).sort_values().head(10).index
(df2.div(df2.sum(axis=1), axis=0) # calculate the proportion
.fillna(0)
.loc[idx, ['White', 'Black']]
.plot.barh(color=['w', 'k'], edgecolor='k')
)
output:
First of all, you should re-shape your dataframe through:
df = df.groupby(by = ['opening_shortname', 'winner']).size().reset_index().rename(columns = {'opening_shortname': 'opening_shortname', 'winner': 'winner', 0: 'count'}).sort_values(['count', 'opening_shortname', 'winner'], ascending = False, ignore_index = True)
So you will get a dataframe like (fake data):
opening_shortname winner count
0 Queen's Pawn Game White 141
1 Queen's Pawn Game Black 132
2 Queen's Pawn White 57
3 Queen's Pawn Black 57
4 King's Pawn Game Black 57
5 Dutch Defense Black 53
6 Sicilian Defense White 51
7 Sicilian Defense Black 50
8 Nimzowitsch Defense White 46
9 Nimzowitsch Defense Black 45
10 Philidor Defense Black 44
11 Slav Defense White 43
12 Pirc Defense White 42
13 Slav Defense Black 39
14 Pirc Defense Black 38
15 King's Pawn Game White 38
16 Dutch Defense White 36
17 Philidor Defense White 31
Then you can plot your data, for example using seaborn.barplot:
sns.barplot(ax = ax, data = df, x = 'count', y = 'opening_shortname', hue = 'winner', palette = ['white', 'black'], edgecolor = 'black')
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r'data/data.csv')
df = df.groupby(by = ['opening_shortname', 'winner']).size().reset_index().rename(columns = {'opening_shortname': 'opening_shortname', 'winner': 'winner', 0: 'count'}).sort_values(['count', 'opening_shortname', 'winner'], ascending = False, ignore_index = True)
fig, ax = plt.subplots()
sns.barplot(ax = ax, data = df, x = 'count', y = 'opening_shortname', hue = 'winner', palette = ['white', 'black'], edgecolor = 'black')
plt.show()
If, in place of count, you want to plot the relative proportion, then you can add one line to the above code:
df['count'] = df['count']/df.groupby('opening_shortname')['count'].transform('sum')
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r'data/data.csv')
df = df.groupby(by = ['opening_shortname', 'winner']).size().reset_index().rename(columns = {'opening_shortname': 'opening_shortname', 'winner': 'winner', 0: 'count'}).sort_values(['count', 'opening_shortname', 'winner'], ascending = False, ignore_index = True)
df['count'] = df['count']/df.groupby('opening_shortname')['count'].transform('sum')
fig, ax = plt.subplots()
sns.barplot(ax = ax, data = df, x = 'count', y = 'opening_shortname', hue = 'winner', palette = ['white', 'black'], edgecolor = 'black')
plt.show()
I have a problem about drawing a nested pie graph in Matplotlib in Python. I wrote some codes to handle with this process but I have an issue related with design and label
I'd like to draw a kind of this nested pie graph. (from the uppermost layer of the nested to its innermost is SEX, ALIGN with covering their counts)
Here is my dataframe which is shown below.
ALIGN SEX count
2 Bad Characters Male Characters 1542
5 Good Characters Male Characters 1419
3 Good Characters Female Characters 714
0 Bad Characters Female Characters 419
8 Neutral Characters Male Characters 254
6 Neutral Characters Female Characters 138
1 Bad Characters Genderless Characters 9
4 Good Characters Genderless Characters 4
7 Neutral Characters Genderless Characters 3
9 Reformed Criminals Male Characters 2
Here is my code snippets related with showing nested pie graph which is shown below.
fig, ax = plt.subplots(figsize=(24,12))
size = 0.3
ax.pie(dc_df_ALIGN_SEX.groupby('SEX')['count'].sum(), radius=1,
labels=dc_df_ALIGN_SEX['SEX'].drop_duplicates(),
autopct='%1.1f%%',
wedgeprops=dict(width=size, edgecolor='w'))
ax.pie(dc_df_ALIGN_SEX['count'], radius=1-size, labels = dc_df_ALIGN_SEX["ALIGN"],
wedgeprops=dict(width=size, edgecolor='w'))
ax.set(aspect="equal", title='Pie plot with `ax.pie`')
plt.show()
How can I design 4 row and 4 column and put each one in each slot and showing labels in legend area?
Since the question has been changed, I'm posting a new answer.
First, I slightly simplified your DataFrame:
import pandas as pd
df = pd.DataFrame([['Bad', 'Male', 1542],
['Good', 'Male', 1419],
['Good', 'Female', 714],
['Bad', 'Female', 419],
['Neutral', 'Male', 254],
['Neutral', 'Female', 138],
['Bad', 'Genderless', 9],
['Good', 'Genderless', 4],
['Neutral', 'Genderless', 3],
['Reformed', 'Male', 2]])
df.columns = ['ALIGN', 'SEX', 'n']
For the numbers in the outer ring, we can use a simple groupby, as you did:
outer = df.groupby('SEX').sum()
But for the numbers in the inner ring, we need to group by both categorical variables, which results in a MultiIndex:
inner = df.groupby(['SEX', 'ALIGN']).sum()
inner
n
SEX ALIGN
Female Bad 419
Good 714
Neutral 138
Genderless Bad 9
Good 4
Neutral 3
Male Bad 1542
Good 1419
Neutral 254
Reformed 2
We can extract the appropriate labels from the MultiIndex with its get_level_values() method:
inner_labels = inner.index.get_level_values(1)
Now you can turn the above values into one-dimensional arrays and plug them into your plot calls:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(24,12))
size = 0.3
ax.pie(outer.values.flatten(), radius=1,
labels=outer.index,
autopct='%1.1f%%',
wedgeprops=dict(width=size, edgecolor='w'))
ax.pie(inner.values.flatten(), radius=1-size,
labels = inner_labels,
wedgeprops=dict(width=size, edgecolor='w'))
ax.set(aspect="equal", title='Pie plot with `ax.pie`')
plt.show()
You define the function percentage_growth(l) in a way that supposes its argument l to be a list (or some other one-dimensional object). But then (to assign colors) you call this function on dc_df_ALIGN_SEX, which is apparently your DataFrame. So the function (in the first iteration of its loop) tries to evaluate dc_df_ALIGN_SEX[0], which throws the key error, because that is not a proper way to index the DataFrame.
Perhaps you want to do something like percentage_growth(dc_df_ALIGN_SEX['count']) instead?
I am coming from R ggplot2 background and, and bit confused in matplotlib plot
here my dataframe
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
language county count
0 en us 32
1 cs ch 432
2 es sp 43
3 pt br 55
4 hi in 6
5 en fr 23
6 es ar 455
7 es pr 23
Now I want to plot
A stacked bar chart where x axis show language and y axis show complete count, the big total height show total count for that language and stacked bar show number of countries for that language
A side by side, with same parameters only countries show side by side instead of stacked one
Most of the example show it directly using dataframe and matplotlib plot but I want to plot it in sequential script so I have more control over it, also can edit whatever I want, something like this script
ind = np.arange(df.languages.nunique())
width = 0.35
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(ind, df.languages, width, color='r')
ax.bar(ind, df.count, width,bottom=df.languages, color='b')
ax.set_ylabel('Count')
ax.set_title('Score y language and country')
ax.set_xticks(ind, df.languages)
ax.set_yticks(np.arange(0, 81, 10))
ax.legend(labels=[df.countries])
plt.show()
btw, my panda pivot code for same plotting
df.pivot(index = "Language", columns = "Country", values = "count").plot.bar(figsize=(15,10))
plt.xticks(rotation = 0,fontsize=18)
plt.xlabel('Language' )
plt.ylabel('Count ')
plt.legend(fontsize='large', ncol=2,handleheight=1.5)
plt.show()
import matplotlib.pyplot as plt
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
modified = {}
modified['language'] = np.unique(df.language)
country_count = []
total_count = []
for x in modified['language']:
country_count.append(len(df[df['language']==x]))
total_count.append(df[df['language']==x]['count'].sum())
modified['country_count'] = country_count
modified['total_count'] = total_count
mod_df = pd.DataFrame(modified)
print(mod_df)
ind = mod_df.language
width = 0.35
p1 = plt.bar(ind,mod_df.total_count, width)
p2 = plt.bar(ind,mod_df.country_count, width,
bottom=mod_df.total_count)
plt.ylabel("Total count")
plt.xlabel("Languages")
plt.legend((p1[0], p2[0]), ('Total Count', 'Country Count'))
plt.show()
First,modify the dataframe to below dataframe.
language country_count total_count
0 cs 1 432
1 en 2 55
2 es 3 521
3 hi 1 6
4 pt 1 55
This is the plot:
As the value of country count is small, you cannot clearly see the stacked country count.
import seaborn as sns
import matplotlib.pyplot as plt
figure, axis = plt.subplots(1,1,figsize=(10,5))
sns.barplot(x="language",y="count",data=df,ci=None)#,hue='county')
axis.set_title('Score y language and country')
axis.set_ylabel('Count')
axis.set_xlabel("Language")
sns.countplot(x=df.language,data=df)
I have a data frame that looks like -
id age_bucket state gender duration category1 is_active
1 (40, 70] Jammu and Kashmir m 123 ABB 1
2 (17, 24] West Bengal m 72 ABB 0
3 (40, 70] Bihar f 109 CA 0
4 (17, 24] Bihar f 52 CA 1
5 (24, 30] MP m 23 ACC 1
6 (24, 30] AP m 103 ACC 1
7 (30, 40] West Bengal f 182 GF 0
I want to create a bar plot with how many people are active for each age_bucket and state (top 10). For for gender and category1 I want to create a pie chart with the proportion of active people. The top of the bar should display the total count for active and inactive members and similarly % should be display on pie chart based on is_active.
How to do it in python using seaborn or matplotlib?
I have done so far -
import seaborn as sns
%matplotlib inline
sns.barplot(x='age_bucket',y='is_active',data=df)
sns.barplot(x='category1',y='is_active',data=df)
It sounds like you want to count the observations rather than plotting a value from a column along the yaxis. In seaborn, the function for this is countplot():
sns.countplot('age_bucket', hue='is_active', data=df)
Since the returned object is a matplotlib axis, you could assign it to a variable (e.g. ax) and then use ax.annotate to place text in the the figure manually:
ax = sns.countplot('age_bucket', hue='is_active', data=df)
ax.annotate('1 1', (0, 1), ha='center', va='bottom', fontsize=12)
Seaborn has no way of creating pie charts, so you would need to use matplotlib directly. However, it is often easier to tell counts and proportions from bar charts so I would generally recommend that you stick to those unless you have a specific constraint that forces you to use a pie chart.