Can you please help me with the following. I have a dataset with a variable - number of products (Prod) that takes discrete values from 1 to 3 (included). Then I have a variable (Gender) 1 for males, 0 for females. I want to plot a multilevel bar chart where on the x-axis I have number of products (Prod) and on the y-axis I have total value of these products that are grouped by the Gender. I need to create a 'count' variable that counts how many observations of each 'Prod' are in each 'Gender' category. To group and plot the variables I use the following code (which does not work):
#Group the variables
grouped_gender['count'] = main_data.groupby(['Prod', 'Gender'])[['Prod']].count()
grouped_gender = pd.DataFrame(grouped_gender)
#Plot
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 7))
barplot2 = sns.barplot(
data=grouped_gender,
x='Prod',
y='count',
hue='Gender',
orient='v',
ax = axes,
ci=None,
dodge=False
)
Can you please help me to identify the problem?
Assuming you can put your DataFrame in a similar state as mine
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
grouped_gender = pd.DataFrame(
{
"Man": [50, 70],
"Woman": [90, 30]
},
index=["Product1", "Product2"]
)
grouped_gender.plot(kind="bar", stacked=True)
plt.title("Products sales")
plt.xlabel("Products")
plt.ylabel("Sales")
plt.show()
This produces the following result
Use countplot on the original dataset:
# sample dataset
df = sns.load_dataset('tips')
# `day` plays `Prod`, `sex` plays `Gender`
sns.countplot(x='day', hue='sex', data=df)
Output:
Note: if you want the data, not just the plot, use:
counts = pd.crosstab(df['day'], df['sex'])
# then to plot bar chart
# counts.plot.bar()
which gives you:
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18
Related
I want to create a multiple bar chart that shows the distribution of income according to the 'Marital_Status'
ID Year_of_Birth Highest_Qualification Marital_Status Income \
0 5524 1957 Graduation Single 58138.0
1 2174 1954 Graduation Single 46344.0
2 4141 1965 Graduation Relationship 71613.0
3 6182 1984 Graduation Relationship 26646.0
4 5324 1981 PhD Relationship 58293.0
...
I tried it with seaborn
sns.displot(df['Income'], bins = 20, x=bin, hue = "Marital_Status", kde = False, multiple="stack")
plt.show
But it didn't work.
Could not interpret value `Marital_Status` for parameter `hue`
This was the most promising idea I had, but I don't get it to work...
The problem is probably in setting data argument of sns.displot to df['Income']. Seaborn needs to know that there are other columns to correctly plot the distribution of Income according to Marital_Status. Please checkout the documentation of Seaborn for more details.
Your question can be solved using seaborn.histplot:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
input_data = {"ID": [5524, 2174, 4141, 6182, 5324],
"Year_of_Birth":[1957, 1954, 1965, 1984, 1981],
"Highest_Qualification": ['Graduation'] * 4 + ['PhD'],
"Marital_Status":["Single"]*2 + ["Relationship"] * 3,
"Income": [58138.0, 46344.0, 71613.0, 26646.0, 58293.0]}
df = pd.DataFrame(data=input_data)
sns.histplot(data=df, bins = 20, x='Income', hue = "Marital_Status", multiple="stack")
plt.show()
It can be also solved using seaborn.displot (call displot instead of histplot in the above code) :
sns.displot(df, bins = 20, x='Income', hue = "Marital_Status", multiple="stack")
This gives:
I'm working with a small dataframe with this codecademy: data
I'm trying to print data to make a small analysis with the following code:
data = pd.read_csv('insurance.csv')
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
data.groupby('sex').region.hist()
The code returns a pandas series where the first element is subplot1 and the second subplot2.
The code plots them on the same figure, and I'm unable to plot them separately.
To produce a histogram for each column based on gender:
'children' and 'smoker' look different because the number is discrete with only 6 and 2 unique values, respectively.
data.groupby('sex').hist(layout=(1, 4), figsize=(12, 4), ec='k', grid=False) alone will produce the graph, but without an easy way to add a title.
Producing the correct visualization often involves reshaping the data for the plotting API.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.2, seaborn 0.11.2
import pandas as pd
# load data
data = pd.read_csv('insurance.csv')
# convert smoker from a string to int value; hist doesn't work on object type columns
data.smoker = data.smoker.map({'no': 0, 'yes': 1})
# group each column by sex; data.groupby(['sex', 'region']) is also an option
for gender, df in data.groupby('sex'):
# plot a hist for each column
axes = df.hist(layout=(1, 5), figsize=(15, 4), ec='k', grid=False)
# extract the figure object from the array of axes
fig = axes[0][0].get_figure()
# add the gender as the title
fig.suptitle(gender)
In regards to data.groupby('sex').region.hist() in the OP, this is a count plot, which shows counts of gender for each region; it is not a histogram.
pandas.crosstab by default computes a frequency table of the factors
ax = pd.crosstab(data.region, data.sex).plot(kind='bar', rot=0)
ax.legend(title='gender', bbox_to_anchor=(1, 1.02), loc='upper left')
Use seaborn.displot
This requires converting the data from a wide to long format, which is done with pandas.DataFrame.melt
import pandas as pd
import seaborn as sns
data = pd.read_csv('insurance.csv')
data.smoker = data.smoker.map({'no': 0, 'yes': 1})
# convert the dataframe from a wide to long form
df = data.melt(id_vars=['sex', 'region'])
# plot
p = sns.displot(data=df, kind='hist', x='value', col='variable', row='region', hue='sex',
multiple='dodge', common_bins=False, facet_kws={'sharey': False, 'sharex': False})
I am trying to create a distplot/kdeplot with hue as the target for multiple columns in my dataset. I am doing this by using kdeplot with a facetgrid and using the ax parameter to make it plot on on my plt.subplot figures. However, I am unable to get the title to change on the individual plot. I have tried g.fig.suptitle and ax[0,0].set_title but neither are working
df = sns.load_dataset('titanic')
fig, ax = plt.subplots(2,2,figsize=(10,5))
g = sns.FacetGrid(df, hue="survived")
g = g.map(sns.kdeplot, "age",ax=ax[0,0])
g.fig.suptitle('age')
h = g.map(sns.kdeplot, "fare",ax=ax[0,1])
h.fig.suptitle('fare')
ax[0,0].set_title ='age'
ax[1,0].set_xlabel='fare'
I am getting this image. It seems to be changing a plot at the bottom which I don't want
By the way, I don't have to use Seaborn, matplotlib is also fine
If you want two density plots for age and fare, you can try to pivot it long and then apply the facetGrid:
import seaborn as sns
df = sns.load_dataset('titanic')
So the long format looks like this, it allows you to split on the variable column:
df[['survived','age','fare']].melt(id_vars='survived')
survived variable value
0 0 age 22.00
1 1 age 38.00
2 1 age 26.00
3 1 age 35.00
4 0 age 35.00
And we plot:
g = sns.FacetGrid(df[['survived','age','fare']].melt(id_vars='survived'),
col="variable",sharex=False,sharey=False,hue="survived")
g.map(sns.kdeplot,"value")
HI all I have the following groups of data:
sumcosts = df.groupby('AgeGroup').Costs.sum()
print(sumcosts):
AgeGroup
18-25 536295.37
25-35 1784085.88
35-45 2395250.62
45-55 5483060.33
55-65 11652094.30
65-75 9633490.63
75+ 5186867.32
Name: Costs, dtype: float64
countoftrips = df.groupby('AgeGroup').Booking.nunique()
print(countoftrips):
AgeGroup
18-25 139
25-35 398
35-45 379
45-55 738
55-65 1417
65-75 995
75+ 545
Name: Booking, dtype: int64
When trying to plot these i have used the following:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
sns.set()
fig, ax1 = plt.subplots()
sns.barplot(data=sumcosts, palette="rocket", ax=ax1)
ax2 = ax1.twinx()
sns.lineplot(data=countoftrips, palette="rocket", ax=ax2)
plt.show()
the output is this:
The line section looks correct but the bar chart has obviously stoppoed in the first age bracket. Any ideas on how to correct? I tried to define the x='Agegroup' and y='Costs' but then got errors and this is the most progress I can get to. Thanks very much!
your barplot appears to be showing the sum of all costs, not just those of the 18-25 age group. The fact this bar is appearing under the x-axis label for the 18-25 group is only b/c of the positioning of your axis for the line plot - which makes it confusing.
I created a dummy data set of 1000 rows in a .csv to graph this
example, but my values are different - so the plots will look visually
different, everything else will work the same for you.
Jupyter Notebook Setup:
(images added to reflect outputs)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# Read in dataset 'df', showing the header
df = pd.read_csv('./data-raw.csv')
df.head()
Assuming you have no NaN values in your data ... otherwise you can use dropna() to remove them.
# Check if there are any NaN values in the all_stocks dataframe
print('Number of NaN values in the columns of our DataFrame:\n', df.isnull().sum())
# Remove any rows that contain NaN values using dropna (as applicable)
data.dropna(axis=0, inplace=True)
Your sumcosts and countoftrips are not a requirement for creating your plots, and I believe are the cause of your plotting error for the bar graph. I've included them here, but are not using them when creating the plot.
Plot Type:
It is also important to keep in mind that a bar plot shows only the mean (or other estimator, i.e std) value, but in many cases, it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.
Solution:
This is assuming you want to have the line and bar plot layered over each other, as in your example:
# This plot has both graphs on the axis you outlined in your code,
# I used the ci = None parameter to remove the confidence intervals to
# make the combined plot easier to read (optional)
fig, ax1 = plt.subplots()
sb.barplot(data = df, x = 'AgeGroup', y = 'Costs', ci = None,
ax = ax1, palette = 'rocket', order = ['18-25',
'25-35','35-45','45-55','55-65', '65-75', '75+']);
ax2 = ax1.twinx()
sb.lineplot(data = df, x = 'AgeGroup', y = 'Booking', ax = ax2, ci = None);
plt.xlabel('Age Group Ranges');
plt.show()
Here is an alternative you could try, also using subplot, but separating the two plots.
# Adjusting the plot size just to make it easier to read here:
plt.figure(figsize = [14, 4])
#Bar Chart on Left
plt.subplot(1, 2, 1) # 1 row, 2 cols, subplot 1
sb.barplot(data = df, x = 'AgeGroup', y = 'Costs', palette = 'rocket',
ci = 'sd', order = ['18-25', '25-35', '35-45',
'45-55','55-65', '65-75', '75+']);
plt.xlabel('Age Group Ranges')
plt.ylabel('Costs')
# Line Chart on Right
plt.subplot(1, 2, 2) # 1 row, 2 cols, subplot 2
sb.lineplot(data = df, x = 'AgeGroup', y = 'Booking', ci = None)
plt.xlabel('Age Group Ranges')
plt.ylabel('Bookings');
Hope you find helpful!
I have a dataframe for example the following, where Material, A and B are all column headings:
Material A B
0 Iron 20.30000 5.040409
1 Antimony 0.09200 0.019933
2 Chromium 1.70000 0.237762
3 Copper 8.10000 2.522951
I want to be able to have a 2x2 subplots consisting of bar graphs based on the 4 rows. The heading of each of the 4 subplots would be the material. Each subplot would have two bars for each value of A and B, each bar is in the subplot have a colour associated to A and B. Finally also would be nice to have a legend showing the colour and what it represents i.e. A and B.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import matplotlib.style as style
sns.set_style("darkgrid")
fig, ax = plt.subplots(2,2)
#Enter for loop
I think a for loop would be the best way to do it, but simply cannot figure out the for loop. Thanks.
fig, axs = plt.subplots(2,2, constrained_layout=True)
for ax,(idx,row) in zip(axs.flat, df.iterrows()):
row[['A','B']].plot.bar(ax=ax, color=['C0','C1'])
ax.set_title(row['Material'])
proxy = ax.bar([0,0],[0,0], color=['C0','C1'])
fig.legend(proxy,['A','B'], bbox_to_anchor=(1,1), loc='upper right')
Note that the same result can be achieved using pandas only, but first you need to reshape your data
df2 = df.set_index('Material').T
>>
Material Iron Antimony Chromium Copper
A 20.300000 0.092000 1.700000 8.100000
B 5.040409 0.019933 0.237762 2.522951
df2.plot(kind='bar', subplots=True, layout=(2,2), legend=False, color=[['C0','C1']])
You can do in this way:
df = df.set_index('Material')
fig = plt.figure(figsize=(10,8))
for i, (name, row) in enumerate(df.iterrows()):
ax = plt.subplot(2,2, i+1)
ax.set_title(row.name)
ax.get_xaxis().set_visible(False)
df.iloc[i].plot.bar(color=['C0', 'C1'])
fig.legend(ax.bar([0,0],[0,0], color=['C0','C1']),['A','B'], loc=5)
plt.show()