I have a dataset that looks something like this
status
age_group
failure
18-25
failure
26-30
failure
18-25
success
41-50
and so on...
sns.countplot(y='status', hue='age_group', data=data)
When i countplot the full dataset I get this
dataset countplot hued by age_group
The question is the following, how do I plot a graph that is adjusted by the n of occurences of each age_group directly with seaborn? because without it, the graph is really misleading, as for example, the >60 age group appears the most simply because it has more persons within that age_group. I searched the documentation but it does not have any built-in function for this case.
Thanks in advance.
The easiest way to show the proportions, is via sns.histogram(..., multiple='fill'). To force an order for the age groups and the status, creating ordered categories can help.
Here is some example code, tested with seaborn 0.11.1:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
import pandas as pd
data = pd.DataFrame({'status': np.random.choice(['Success', 'Failure'], 100, p=[.7, .3]),
'age_group': np.random.choice(['18-45', '45-60', '> 60'], 100, p=[.2, .3, .5])})
data['age_group'] = pd.Categorical(data['age_group'], ordered=True, categories=['18-45', '45-60', '> 60'])
data['status'] = pd.Categorical(data['status'], ordered=True, categories=['Failure', 'Success'])
ax = sns.histplot(y='age_group', hue='status', multiple='fill', data=data)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel('Percentage')
plt.show()
Now, to create the exact plot of the question, some pandas manupulations might create the following dataframe:
count the values for each age group and status
divide these by the total for each age group
Probably some shortcuts can be taken, but this is how I tried to juggle with pandas (edit from comment by #PatrickFitzGerald: using pd.crosstab()):
# df = data.groupby(['status', 'age_group']).agg(len).reset_index(level=0) \
# .pivot(columns='status').droplevel(level=0, axis=1)
# totals = df.sum(axis=1)
# df['Success'] /= totals
# df['Failure'] /= totals
df = pd.crosstab(data['age_group'], data['status'], normalize='index')
df1 = df.melt(var_name='status', value_name='percentage', ignore_index=False).reset_index()
ax = sns.barplot(y='status', x='percentage', hue='age_group', palette='rocket', data=df1)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel('Percentage')
ax.set_ylabel('')
plt.show()
Related
I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.
What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())
I want to have x-axis = 'brand', y-axis = 'count', and 2 series for 'online_order' (True & False)
How can I do this on Python (using Jupyter?)
Right now, my Y axis comes on a scale of 0-1. I want to ensure that the Y axis is automated based on the values
This is the result I am getting :
I'm guessing the plot was made with something like the following:
Since the plot code is not included, it's just a guess.
df.groupby(['brand', 'online_order'])['count'].size().unstack().plot.bar(legend=True)
The issue is, size is not the value in 'count', it's .Groupby.size which computes group sizes, of which there is 1 of each.
Using seaborn
The easiest way to get the desired plot is using seaborn, which is a high-level API for matplolib.
Use seaborn.barplot with hue='online_order'.
The dataframe does not need to be reshaped.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# test data
df = pd.DataFrame({'brand': ['Solex', 'Solex', 'Giant Bicycles', 'Giant Bicycles'], 'online_order': [False, True, True, False], 'count': [2122, 2047, 1640, 1604]})
# plot
plt.figure(figsize=(7, 5))
sns.barplot(x='brand', y='count', hue='online_order', data=df)
Using pandas.DataFrame.pivot
.pivot changes the shape of the dataframe to accommodate the plot API
This option also uses pandas.DataFrame.plot.bar
df.pivot('brand', 'online_order', 'count').plot.bar()
If the data is a csv file, you can import matplotlib and pandas to create a graph and view the data.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("file name here")
plt.bar(data.brand,data.count)
plt.xlabel("brand")
plt.ylabel("count")
I have the following dataset, code and plot:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
data = [['tom', 10,15], ['matt', 13,10]]
df3 = pd.DataFrame(data, columns = ['Name', 'Attempts','L4AverageAttempts'])
f,ax = plt.subplots(nrows=1,figsize=(16,9))
sns.barplot(x='Attempts',y='Name',data=df3)
plt.show()
How can get a marker of some description (dot, *, shape, etc) to show that tomhas averaged 15 (so is below his average) and matt has averaged 10 so is above average. So a marker basxed off the L4AverageAttempts value for each person.
I have looked into axvline but that seems to be only a set number rather than a specific value for each y axis category. Any help would be much appreciated! thanks!
You can simply plot a scatter plot on top of your bar plot using L4AverageAttempts as the x value:
You can use seaborn.scatterplot for this. Make sure to set the zorder parameter so that the markers appear on top of the bars.
import seaborn as sns
import pandas as pd
data = [['tom', 10,15], ['matt', 13,10]]
df3 = pd.DataFrame(data, columns = ['Name', 'Attempts','L4AverageAttempts'])
f,ax = plt.subplots(nrows=1,figsize=(16,9))
sns.barplot(x='Attempts',y='Name',data=df3)
sns.scatterplot(x='L4AverageAttempts', y="Name", data=df3, zorder=10, color='k', edgecolor='k')
plt.show()
I know that seaborn.countplot has the attribute order which can be set to determine the order of the categories. But what I would like to do is have the categories be in order of descending count. I know that I can accomplish this by computing the count manually (using a groupby operation on the original dataframe, etc.) but I am wondering if this functionality exists with seaborn.countplot. Surprisingly, I cannot find an answer to this question anywhere.
This functionality is not built into seaborn.countplot as far as I know - the order parameter only accepts a list of strings for the categories, and leaves the ordering logic to the user.
This is not hard to do with value_counts() provided you have a DataFrame though. For example,
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='darkgrid')
titanic = sns.load_dataset('titanic')
sns.countplot(x = 'class',
data = titanic,
order = titanic['class'].value_counts().index)
plt.show()
Most often, a seaborn countplot is not really necessary. Just plot with pandas bar plot:
import seaborn as sns; sns.set(style='darkgrid')
import matplotlib.pyplot as plt
df = sns.load_dataset('titanic')
df['class'].value_counts().plot(kind="bar")
plt.show()
My issue is very specific, i guess, but i can't seem to find a proper solution, and im clueless with the error output that i get.
Anyway, i have a pandas dataframe loaded from an sqlite database.
data_frame = pd.read_sql_query(
"SELECT (total_comb + total_comb_rc) as total_comb, p_val, w_length from {tn}".format(
tn=table_name), conn)
With that loaded, i group the data by the 'w_length' value.
for i, group in data_frame.groupby('w_length'):
Now, i want to plot a scatter plot for each group created with seaborn lmplot.
for i, group in data_frame.groupby('w_length'):
sns.lmplot(x=group['total_comb'], y=group['p_val'],
data=group,
fit_reg=False)
sns.despine()
plt.savefig('test_scatter'+i+'.png', dpi=400)
But for some reason im getting, this output.
'[ 6.95485628e-02 3.53641178e-01 3.46862200e+06 4.11684800e+06] not in index'
and no plot file.
I know im doing something wrong, but i cant seem to figure it out.
pd: i know i can do something like this.
sns.lmplot(x='total_comb', y='p_val',
data=data_frame,
fit_reg=False,
hue="w_length", x_jitter=.1, col="w_length", col_wrap=3, size=4)
but i also need the separeted plots for each 'w_length'.
Thanks!!
Supposing the problem is not due to the data collection from the sql database, it's probably due to the fact that you call
sns.lmplot(x=group['total_comb'], y=group['p_val'], data=group)
instead of
sns.lmplot(x='total_comb', y='p_val', data=group)
Here is a working example, which produces two separate plots:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np; np.random.seed(42)
x = np.arange(24)
y = np.random.randint(1,10, len(x))
cat = np.random.choice(["A", "B"], size=len(x))
df = pd.DataFrame({"x": x, "y": y, "cat": cat})
for i, group in df.groupby('cat'):
sns.lmplot(x="x", y="y", data=group, fit_reg=False)
plt.savefig(__file__+str(i)+".png")
plt.show()