Plot HIST of a pandas DataframeGroupbySeries - python

I'm working with a small dataframe with this codecademy: data
I'm trying to print data to make a small analysis with the following code:
data = pd.read_csv('insurance.csv')
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
data.groupby('sex').region.hist()
The code returns a pandas series where the first element is subplot1 and the second subplot2.
The code plots them on the same figure, and I'm unable to plot them separately.

To produce a histogram for each column based on gender:
'children' and 'smoker' look different because the number is discrete with only 6 and 2 unique values, respectively.
data.groupby('sex').hist(layout=(1, 4), figsize=(12, 4), ec='k', grid=False) alone will produce the graph, but without an easy way to add a title.
Producing the correct visualization often involves reshaping the data for the plotting API.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.2, seaborn 0.11.2
import pandas as pd
# load data
data = pd.read_csv('insurance.csv')
# convert smoker from a string to int value; hist doesn't work on object type columns
data.smoker = data.smoker.map({'no': 0, 'yes': 1})
# group each column by sex; data.groupby(['sex', 'region']) is also an option
for gender, df in data.groupby('sex'):
# plot a hist for each column
axes = df.hist(layout=(1, 5), figsize=(15, 4), ec='k', grid=False)
# extract the figure object from the array of axes
fig = axes[0][0].get_figure()
# add the gender as the title
fig.suptitle(gender)
In regards to data.groupby('sex').region.hist() in the OP, this is a count plot, which shows counts of gender for each region; it is not a histogram.
pandas.crosstab by default computes a frequency table of the factors
ax = pd.crosstab(data.region, data.sex).plot(kind='bar', rot=0)
ax.legend(title='gender', bbox_to_anchor=(1, 1.02), loc='upper left')
Use seaborn.displot
This requires converting the data from a wide to long format, which is done with pandas.DataFrame.melt
import pandas as pd
import seaborn as sns
data = pd.read_csv('insurance.csv')
data.smoker = data.smoker.map({'no': 0, 'yes': 1})
# convert the dataframe from a wide to long form
df = data.melt(id_vars=['sex', 'region'])
# plot
p = sns.displot(data=df, kind='hist', x='value', col='variable', row='region', hue='sex',
multiple='dodge', common_bins=False, facet_kws={'sharey': False, 'sharex': False})

Related

Grouping variables to plot multilevel bar chart

Can you please help me with the following. I have a dataset with a variable - number of products (Prod) that takes discrete values from 1 to 3 (included). Then I have a variable (Gender) 1 for males, 0 for females. I want to plot a multilevel bar chart where on the x-axis I have number of products (Prod) and on the y-axis I have total value of these products that are grouped by the Gender. I need to create a 'count' variable that counts how many observations of each 'Prod' are in each 'Gender' category. To group and plot the variables I use the following code (which does not work):
#Group the variables
grouped_gender['count'] = main_data.groupby(['Prod', 'Gender'])[['Prod']].count()
grouped_gender = pd.DataFrame(grouped_gender)
#Plot
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 7))
barplot2 = sns.barplot(
data=grouped_gender,
x='Prod',
y='count',
hue='Gender',
orient='v',
ax = axes,
ci=None,
dodge=False
)
Can you please help me to identify the problem?
Assuming you can put your DataFrame in a similar state as mine
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
grouped_gender = pd.DataFrame(
{
"Man": [50, 70],
"Woman": [90, 30]
},
index=["Product1", "Product2"]
)
grouped_gender.plot(kind="bar", stacked=True)
plt.title("Products sales")
plt.xlabel("Products")
plt.ylabel("Sales")
plt.show()
This produces the following result
Use countplot on the original dataset:
# sample dataset
df = sns.load_dataset('tips')
# `day` plays `Prod`, `sex` plays `Gender`
sns.countplot(x='day', hue='sex', data=df)
Output:
Note: if you want the data, not just the plot, use:
counts = pd.crosstab(df['day'], df['sex'])
# then to plot bar chart
# counts.plot.bar()
which gives you:
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18

multi colored rel or line plot in seaborn

Im the dataframe below, I have unique compIds which can have multiple capei and multiple date. This is primarily a time series dataset.
date capei compId
0 200401 25.123777 31946.0
1 200401 15.844910 29586.0
2 200401 20.524131 32507.0
3 200401 15.844910 29586.0
4 200401 15.844910 29586.0
... ... ... ...
73226 202011 9.372320 2817.0
73227 202011 9.372320 2817.0
73228 202011 22.334842 28581.0
73229 202011 10.761727 31946.0
73230 202011 30.205348 15029.0
With the following visualization code, I get the plot but the color of the line plots are not different. I wanted different colors.
import seaborn as sns
a4_dims = (15, 5)
sns.set_palette("vlag")
**plot**
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
sns.relplot(x="date", ax=ax, y="capei", style='compId', kind='line',data=fDf, palette=sns.color_palette("Spectral", as_cmap=True) )
It generates image like this
However I am expecting plot as like
The compId in the picture generated figure 1 can be Month equivalent in figure 2.
Figure 2 is a screenshot from here.
How would be able to have different colors for compId in the Figure 1.
First of all, you should reformat your dataframe:
convert 'date' from str to datetime:
fDf['date'] = pd.to_datetime(fDf['date'], format='%Y%m')
convert 'compId' from float to str in order to be used as a categorical axis(1):
fDf['compId'] = fDf['compId'].apply(str)
Now you can pass 'compId' to seaborn.relplot as hue and/or style parameter, depending on your preferencies.
Complete Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
fDf = pd.read_csv(r'data/data.csv')
fDf['date'] = pd.to_datetime(fDf['date'], format='%Y%m')
fDf['compId'] = fDf['compId'].apply(str)
sns.set_palette("vlag")
sns.set_style('ticks')
sns.relplot(x="date", y="capei", style='compId', hue='compId', kind='line',data=fDf, estimator=None)
plt.show()
(plot drawn with a fake-dataframe)
This passage may or may not be necessary; given the context I suggest it is.
If you keep compId as numerical type, then the hue in the plot will be proportional to 'compId'value. This means 0.4 will have a color very different from 31946.0; but 31946.0 and 32507.0 will be practically indistinguishable by color.
If you convert compId to str, then the hue won't depent of compId numerical value, so the colors will be equally spaced among categories.
fDf['compId'] as it is
fDf['compId'].apply(str)

How to put Seaborn FacetGrid on plt.subplots grid and change title or xlabel

I am trying to create a distplot/kdeplot with hue as the target for multiple columns in my dataset. I am doing this by using kdeplot with a facetgrid and using the ax parameter to make it plot on on my plt.subplot figures. However, I am unable to get the title to change on the individual plot. I have tried g.fig.suptitle and ax[0,0].set_title but neither are working
df = sns.load_dataset('titanic')
fig, ax = plt.subplots(2,2,figsize=(10,5))
g = sns.FacetGrid(df, hue="survived")
g = g.map(sns.kdeplot, "age",ax=ax[0,0])
g.fig.suptitle('age')
h = g.map(sns.kdeplot, "fare",ax=ax[0,1])
h.fig.suptitle('fare')
ax[0,0].set_title ='age'
ax[1,0].set_xlabel='fare'
I am getting this image. It seems to be changing a plot at the bottom which I don't want
By the way, I don't have to use Seaborn, matplotlib is also fine
If you want two density plots for age and fare, you can try to pivot it long and then apply the facetGrid:
import seaborn as sns
df = sns.load_dataset('titanic')
So the long format looks like this, it allows you to split on the variable column:
df[['survived','age','fare']].melt(id_vars='survived')
survived variable value
0 0 age 22.00
1 1 age 38.00
2 1 age 26.00
3 1 age 35.00
4 0 age 35.00
And we plot:
g = sns.FacetGrid(df[['survived','age','fare']].melt(id_vars='survived'),
col="variable",sharex=False,sharey=False,hue="survived")
g.map(sns.kdeplot,"value")

Random empty spaces/bars in seaborn distribution plot

GOAL: I want to make a distribution function for registered dogs' ages in 2017 in Zurich from the 'Dogs of Zurich' dataset (Kaggle) (with Python). The variable I'm working with - 'GEBURTSJAHR_HUND' - gives the birth year for every registered dog as an int.
I have converted it to a 'dog_age' variable (= 2017 - birth_date) and want to plot the distribution function. See image below for sorted list of group size per age.
Size of dog age groups
PROBLEM: I'm running into is the fact that my distribution function's x axis has empty spaces/bars in it. Every age is shown on the graph, but in between some of these ages are empty bars.
Example: 1 and 2 are full bars, but between them is an empty space. Between 2 and 3, there is no empty space, but between 3 and 4 there is. Seemingly random which values have white spaces between them.
What my problematic distribution plot looks like at the moment
TRIED: I have previously tried three things to fix this.
plt.xticks(...)
Unfortunately this only changed the aesthetics of the x axis.
Tried ax = sns.distplot followed by ax.xaxis ticker lines, but this did not have the expected result.
ax.xaxis.set_major_locator(ticker.MultipleLocator())
ax.xaxis.set_major_formatter(ticker.ScalarFormatter(0))
Maybe problem is with 'dog_age' variable?
Used the original birth_date variable, but this had the same problem.
CODE:
dfnew = pd.read_csv(dog17_filepath,index_col='HALTER_ID')
dfnew.dropna(subset = ["ALTER"], inplace=True)
dfnew['dog_age'] = 2017 - dfnew['GEBURTSJAHR_HUND']
b = dfnew['dog_age']
sns.set_style("darkgrid")
plt.figure(figsize=(15,5))
sns.distplot(a=b,hist=True)
plt.xticks(np.arange(min(b), max(b)+1, 1))
plt.xlabel('Age Dog', fontsize=12)
plt.title('Distribution of age of dogs', fontsize=20)
plt.show()
Thanks in advance,
Arthur
The problem is that the age column is discrete: it only contains a short range of integers. Default the histogram divides the range of values (float) into a fixed number of bins, which usually don't align well with those integers. To get an appropriate histogram, the bins needs to be set explicitly, for example having a bin bound at every half.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
dfnew = pd.read_csv('hundehalter.csv')
dfnew.dropna(subset=["ALTER"], inplace=True)
dfnew['dog_age'] = 2017 - dfnew['GEBURTSJAHR_HUND']
b = dfnew['dog_age'][(dfnew['dog_age'] >= 0) & (dfnew['dog_age'] <= 25)]
sns.set_style("darkgrid")
plt.figure(figsize=(15, 5))
sns.distplot(a=b, hist=True, bins=np.arange(min(b)-0.5, max(b)+1, 1))
plt.xticks(np.arange(min(b), max(b) + 1, 1))
plt.xlabel('Age Dog', fontsize=12)
plt.title('Distribution of age of dogs', fontsize=20)
plt.xlim(min(b), max(b) + 1)
plt.show()

Trying to plot a bar chart with age categories issue. Seaborn and Pandas df

HI all I have the following groups of data:
sumcosts = df.groupby('AgeGroup').Costs.sum()
print(sumcosts):
AgeGroup
18-25 536295.37
25-35 1784085.88
35-45 2395250.62
45-55 5483060.33
55-65 11652094.30
65-75 9633490.63
75+ 5186867.32
Name: Costs, dtype: float64
countoftrips = df.groupby('AgeGroup').Booking.nunique()
print(countoftrips):
AgeGroup
18-25 139
25-35 398
35-45 379
45-55 738
55-65 1417
65-75 995
75+ 545
Name: Booking, dtype: int64
When trying to plot these i have used the following:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
sns.set()
fig, ax1 = plt.subplots()
sns.barplot(data=sumcosts, palette="rocket", ax=ax1)
ax2 = ax1.twinx()
sns.lineplot(data=countoftrips, palette="rocket", ax=ax2)
plt.show()
the output is this:
The line section looks correct but the bar chart has obviously stoppoed in the first age bracket. Any ideas on how to correct? I tried to define the x='Agegroup' and y='Costs' but then got errors and this is the most progress I can get to. Thanks very much!
your barplot appears to be showing the sum of all costs, not just those of the 18-25 age group. The fact this bar is appearing under the x-axis label for the 18-25 group is only b/c of the positioning of your axis for the line plot - which makes it confusing.
I created a dummy data set of 1000 rows in a .csv to graph this
example, but my values are different - so the plots will look visually
different, everything else will work the same for you.
Jupyter Notebook Setup:
(images added to reflect outputs)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# Read in dataset 'df', showing the header
df = pd.read_csv('./data-raw.csv')
df.head()
Assuming you have no NaN values in your data ... otherwise you can use dropna() to remove them.
# Check if there are any NaN values in the all_stocks dataframe
print('Number of NaN values in the columns of our DataFrame:\n', df.isnull().sum())
# Remove any rows that contain NaN values using dropna (as applicable)
data.dropna(axis=0, inplace=True)
Your sumcosts and countoftrips are not a requirement for creating your plots, and I believe are the cause of your plotting error for the bar graph. I've included them here, but are not using them when creating the plot.
Plot Type:
It is also important to keep in mind that a bar plot shows only the mean (or other estimator, i.e std) value, but in many cases, it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.
Solution:
This is assuming you want to have the line and bar plot layered over each other, as in your example:
# This plot has both graphs on the axis you outlined in your code,
# I used the ci = None parameter to remove the confidence intervals to
# make the combined plot easier to read (optional)
fig, ax1 = plt.subplots()
sb.barplot(data = df, x = 'AgeGroup', y = 'Costs', ci = None,
ax = ax1, palette = 'rocket', order = ['18-25',
'25-35','35-45','45-55','55-65', '65-75', '75+']);
ax2 = ax1.twinx()
sb.lineplot(data = df, x = 'AgeGroup', y = 'Booking', ax = ax2, ci = None);
plt.xlabel('Age Group Ranges');
plt.show()
Here is an alternative you could try, also using subplot, but separating the two plots.
# Adjusting the plot size just to make it easier to read here:
plt.figure(figsize = [14, 4])
#Bar Chart on Left
plt.subplot(1, 2, 1) # 1 row, 2 cols, subplot 1
sb.barplot(data = df, x = 'AgeGroup', y = 'Costs', palette = 'rocket',
ci = 'sd', order = ['18-25', '25-35', '35-45',
'45-55','55-65', '65-75', '75+']);
plt.xlabel('Age Group Ranges')
plt.ylabel('Costs')
# Line Chart on Right
plt.subplot(1, 2, 2) # 1 row, 2 cols, subplot 2
sb.lineplot(data = df, x = 'AgeGroup', y = 'Booking', ci = None)
plt.xlabel('Age Group Ranges')
plt.ylabel('Bookings');
Hope you find helpful!

Categories