I have a dataframe which looks like below:
df:
RY MAJ_CAT Value
2016 Cause Unknown 0.00227
2016 Vegetation 0.04217
2016 Vegetation 0.04393
2016 Vegetation 0.07878
2016 Defective Equip 0.00137
2018 Cause Unknown 0.00484
2018 Defective Equip 0.01546
2020 Defective Equip 0.05169
2020 Defective Equip 0.00515
2020 Cause Unknown 0.00050
I want to plot the distribution of the value over the given years. So I used distplot of seaborn by using following code:
year_2016 = df[df['RY']==2016]
year_2018 = df[df['RY']==2018]
year_2020 = df[df['RY']==2020]
sns.distplot(year_2016['value'].values, hist=False,rug=True)
sns.distplot(year_2018['value'].values, hist=False,rug=True)
sns.distplot(year_2020['value'].values, hist=False,rug=True)
In the next step I want to plot the same value distribution over the given year w.r.t MAJ_CAT. So I decided to use Facetgrid of seaborn, below is the code :
g = sns.FacetGrid(df,col='MAJ_CAT')
g = g.map(sns.distplot,df[df['RY']==2016]['value'].values, hist=False,rug=True))
g = g.map(sns.distplot,df[df['RY']==2018]['value'].values, hist=False,rug=True))
g = g.map(sns.distplot,df[df['RY']==2020]['value'].values, hist=False,rug=True))
However, when it ran the above command, it throws the following error:
KeyError: "None of [Index([(0.00227, 0.04217, 0.043930000000000004, 0.07877999999999999, 0.00137, 0.0018800000000000002, 0.00202, 0.00627, 0.00101, 0.07167000000000001, 0.01965, 0.02775, 0.00298, 0.00337, 0.00088, 0.04049, 0.01957, 0.01012, 0.12065, 0.23699, 0.03639, 0.00137, 0.03244, 0.00441, 0.06748, 0.00035, 0.0066099999999999996, 0.00302, 0.015619999999999998, 0.01571, 0.0018399999999999998, 0.03425, 0.08046, 0.01695, 0.02416, 0.08975, 0.0018800000000000002, 0.14743, 0.06366000000000001, 0.04378, 0.043, 0.02997, 0.0001, 0.22799, 0.00611, 0.13960999999999998, 0.38871, 0.018430000000000002, 0.053239999999999996, 0.06702999999999999, 0.14103, 0.022719999999999997, 0.011890000000000001, 0.00186, 0.00049, 0.13947, 0.0067, 0.00503, 0.00242, 0.00137, 0.00266, 0.38638, 0.24068, 0.0165, 0.54847, 1.02545, 0.01889, 0.32750999999999997, 0.22526, 0.24516, 0.12791, 0.00063, 0.0005200000000000001, 0.00921, 0.07665, 0.00116, 0.01042, 0.27046, 0.03501, 0.03159, 0.46748999999999996, 0.022090000000000002, 2.2972799999999998, 0.69021, 0.22529000000000002, 0.00147, 0.1102, 0.03234, 0.05799, 0.11744, 0.00896, 0.09556, 0.03202, 0.01347, 0.00923, 0.0034200000000000003, 0.041530000000000004, 0.04848, 0.00062, 0.0031100000000000004, ...)], dtype='object')] are in the [columns]"
I am not sure where am I making the mistake. Could anyone please help me in fixing the issue?
setup the dataframe
import pandas as pd
import numpy as np
import seaborn as sns
# setup dataframe of synthetic data
np.random.seed(365)
data = {'RY': np.random.choice([2016, 2018, 2020], size=400),
'MAJ_CAT': np.random.choice(['Cause Unknown', 'Vegetation', 'Defective Equip'], size=400),
'Value': np.random.random(size=400) }
df = pd.DataFrame(data)
Updated Answer
From seaborn v0.11
Use sns.displot with kind='kde' and rug=True
Is a figure-level interface for drawing distribution plots onto a FacetGrid.
Plotting all 'MAJ_CAT' together
sns.displot(data=df, x='Value', hue='RY', kind='kde', palette='tab10', rug=True)
Plotting 'MAJ_CAT' separately
sns.displot(data=df, col='MAJ_CAT', x='Value', hue='RY', kind='kde', palette='tab10', rug=True)
Original Answer
In seaborn v0.11, distplot is deprecated
distplot
Consolidate the original code to generate the distplot
for year in df.RY.unique():
values = df.Value[df.RY == year]
sns.distplot(values, hist=False, rug=True)
facetgrid
properly configure the mapping and add hue to FacetGrid
g = sns.FacetGrid(df, col='MAJ_CAT', hue='RY')
p1 = g.map(sns.distplot, 'Value', hist=False, rug=True).add_legend()
Related
I want to create a multiple bar chart that shows the distribution of income according to the 'Marital_Status'
ID Year_of_Birth Highest_Qualification Marital_Status Income \
0 5524 1957 Graduation Single 58138.0
1 2174 1954 Graduation Single 46344.0
2 4141 1965 Graduation Relationship 71613.0
3 6182 1984 Graduation Relationship 26646.0
4 5324 1981 PhD Relationship 58293.0
...
I tried it with seaborn
sns.displot(df['Income'], bins = 20, x=bin, hue = "Marital_Status", kde = False, multiple="stack")
plt.show
But it didn't work.
Could not interpret value `Marital_Status` for parameter `hue`
This was the most promising idea I had, but I don't get it to work...
The problem is probably in setting data argument of sns.displot to df['Income']. Seaborn needs to know that there are other columns to correctly plot the distribution of Income according to Marital_Status. Please checkout the documentation of Seaborn for more details.
Your question can be solved using seaborn.histplot:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
input_data = {"ID": [5524, 2174, 4141, 6182, 5324],
"Year_of_Birth":[1957, 1954, 1965, 1984, 1981],
"Highest_Qualification": ['Graduation'] * 4 + ['PhD'],
"Marital_Status":["Single"]*2 + ["Relationship"] * 3,
"Income": [58138.0, 46344.0, 71613.0, 26646.0, 58293.0]}
df = pd.DataFrame(data=input_data)
sns.histplot(data=df, bins = 20, x='Income', hue = "Marital_Status", multiple="stack")
plt.show()
It can be also solved using seaborn.displot (call displot instead of histplot in the above code) :
sns.displot(df, bins = 20, x='Income', hue = "Marital_Status", multiple="stack")
This gives:
I'm working with a small dataframe with this codecademy: data
I'm trying to print data to make a small analysis with the following code:
data = pd.read_csv('insurance.csv')
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
data.groupby('sex').region.hist()
The code returns a pandas series where the first element is subplot1 and the second subplot2.
The code plots them on the same figure, and I'm unable to plot them separately.
To produce a histogram for each column based on gender:
'children' and 'smoker' look different because the number is discrete with only 6 and 2 unique values, respectively.
data.groupby('sex').hist(layout=(1, 4), figsize=(12, 4), ec='k', grid=False) alone will produce the graph, but without an easy way to add a title.
Producing the correct visualization often involves reshaping the data for the plotting API.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.2, seaborn 0.11.2
import pandas as pd
# load data
data = pd.read_csv('insurance.csv')
# convert smoker from a string to int value; hist doesn't work on object type columns
data.smoker = data.smoker.map({'no': 0, 'yes': 1})
# group each column by sex; data.groupby(['sex', 'region']) is also an option
for gender, df in data.groupby('sex'):
# plot a hist for each column
axes = df.hist(layout=(1, 5), figsize=(15, 4), ec='k', grid=False)
# extract the figure object from the array of axes
fig = axes[0][0].get_figure()
# add the gender as the title
fig.suptitle(gender)
In regards to data.groupby('sex').region.hist() in the OP, this is a count plot, which shows counts of gender for each region; it is not a histogram.
pandas.crosstab by default computes a frequency table of the factors
ax = pd.crosstab(data.region, data.sex).plot(kind='bar', rot=0)
ax.legend(title='gender', bbox_to_anchor=(1, 1.02), loc='upper left')
Use seaborn.displot
This requires converting the data from a wide to long format, which is done with pandas.DataFrame.melt
import pandas as pd
import seaborn as sns
data = pd.read_csv('insurance.csv')
data.smoker = data.smoker.map({'no': 0, 'yes': 1})
# convert the dataframe from a wide to long form
df = data.melt(id_vars=['sex', 'region'])
# plot
p = sns.displot(data=df, kind='hist', x='value', col='variable', row='region', hue='sex',
multiple='dodge', common_bins=False, facet_kws={'sharey': False, 'sharex': False})
I am not sure where to start, I am trying to get a graph that looks like shown below:
The y_axis represents the sum of Quantity sold per Year and per Region as shown on the graph
This dataset can be used:
dataset = {'Year': [2019,2020,2020,2019,2019,2020,2017,2017,2018,2020,2018,2016],
'Quantity': [100,50,25,30,40,50,200,600,20,40,100,20],
'Regions': ['Europe','Asia','Africa','Africa','Other','Asia','Africa','Other','America','America','Europe','Europe']}
df = pd.DataFrame(data=dataset)
Also I am not sure if it's better to use matplotlib or seaborn in that case.
What I tried, ut it doesn't seem to be working:
dfc_vol = dfc[['Year','Quantity','Regions']]
plt.rcParams['figure.figsize']=(10,6)
dfc_vol.plot(kind='bar', stacked=True)
plt.legend(bbox_to_anchor=(1.1,0.5))
Perhaps one option using pivot_table to reshape the data first:
df_pt = df.pivot_table(index='Year', columns='Regions', values='Quantity')
df_pt.plot.bar(stacked=True, figsize=(10, 6))
Output:
With seaborn >= 0.11.0:
sns.histplot(
data=df,
x="Year", hue="Regions", weights="Quantity",
multiple="stack", discrete=True, shrink=.9
)
For a simple time series:
import pandas as pd
df = pd.DataFrame({'dt':['2020-01-01', '2020-01-02', '2020-01-04', '2020-01-05', '2020-01-06'], 'foo':[1,2, 4,5,6]})
df['dt'] = pd.to_datetime(df.dt)
df['dt_label']= df['dt'].dt.strftime('%Y-%m-%d %a')
df = df.set_index('dt')
#display(df)
df['foo'].plot()
x =plt.xticks(ticks=df.reset_index().dt.values, labels=df.dt_label, rotation=90, horizontalalignment='right')
How can I highlight the x-axis labels for weekends?
edit
Pandas Plots: Separate color for weekends, pretty printing times on x axis
suggests:
def highlight_weekends(ax, timeseries):
d = timeseries.dt
ranges = timeseries[d.dayofweek >= 5].groupby(d.year * 100 + d.weekofyear).agg(['min', 'max'])
for i, tmin, tmax in ranges.itertuples():
ax.axvspan(tmin, tmax, facecolor='orange', edgecolor='none', alpha=0.1)
but applying it with
highlight_weekends(ax, df.reset_index().dt)
will not change the plot
I've extended your sample data a little so we can can make sure that we can highlight more than a single weekend instance.
In this solution I create a column 'weekend', which is a column of bools indicating whether the corresponding date was at a weekend.
We then loop over these values and make a call to ax.axvspan
import pandas as pd
import matplotlib.pyplot as plt
# Add a couple of extra dates to sample data
df = pd.DataFrame({'dt': ['2020-01-01',
'2020-01-02',
'2020-01-04',
'2020-01-05',
'2020-01-06',
'2020-01-07',
'2020-01-09',
'2020-01-10',
'2020-01-11',
'2020-01-12']})
# Fill in corresponding observations
df['foo'] = range(df.shape[0])
df['dt'] = pd.to_datetime(df.dt)
df['dt_label']= df['dt'].dt.strftime('%Y-%m-%d %a')
df = df.set_index('dt')
ax = df['foo'].plot()
plt.xticks(ticks=df.reset_index().dt.values,
labels=df.dt_label,
rotation=90,
horizontalalignment='right')
# Create an extra column which highlights whether or not a date occurs at the weekend
df['weekend'] = df['dt_label'].apply(lambda x: x.endswith(('Sat', 'Sun')))
# Loop over weekend pairs (Saturdays and Sundays), and highlight
for i in range(df['weekend'].sum() // 2):
ax.axvspan(df[df['weekend']].index[2*i],
df[df['weekend']].index[2*i+1],
alpha=0.5)
Here is a solution that uses the fill_between plotting function and the x-axis units so that weekends can be highlighted independently from the DatetimeIndex and the frequency of the data.
The x-axis limits are used to compute the range of time covered by the plot in terms of days, which is the unit used for matplotlib dates. Then a weekends mask is computed and passed to the where argument of the fill_between function. The masks are processed as right-exclusive so in this case, they must contain Mondays for the highlights to be drawn up to Mondays 00:00. Because plotting these highlights can alter the x-axis limits when weekends occur near the limits, the x-axis limits are set back to the original values after plotting.
Note that contrary to axvspan, the fill_between function needs the y1 and y2 arguments. For some reason, using the default y-axis limits leaves a small gap between the plot frame and the tops and bottoms of the weekend highlights. This issue is solved by running ax.set_ylim(*ax.get_ylim()) just after creating the plot.
Here is a complete example based on the provided sample code and using an extended dataset similar to the answer provided by jwalton:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import matplotlib.dates as mdates
# Create sample dataset
dt = pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-04', '2020-01-05',
'2020-01-06', '2020-01-07', '2020-01-09', '2020-01-10',
'2020-01-11', '2020-01-14'])
df = pd.DataFrame(dict(foo=range(len(dt))), index=dt)
# Draw pandas plot: setting x_compat=True converts the pandas x-axis units to
# matplotlib date units. This is not necessary for this particular example but
# it is necessary for all cases where the dataframe contains a continuous
# DatetimeIndex (for example ones created with pd.date_range) that uses a
# frequency other than daily
ax = df['foo'].plot(x_compat=True, figsize=(6,4), ylabel='foo')
ax.set_ylim(*ax.get_ylim()) # reset y limits to display highlights without gaps
# Highlight weekends based on the x-axis units
xmin, xmax = ax.get_xlim()
days = np.arange(np.floor(xmin), np.ceil(xmax)+2) # range of days in date units
weekends = [(dt.weekday()>=5)|(dt.weekday()==0) for dt in mdates.num2date(days)]
ax.fill_between(days, *ax.get_ylim(), where=weekends, facecolor='k', alpha=.1)
ax.set_xlim(xmin, xmax) # set limits back to default values
# Create and format x tick for each data point
plt.xticks(df.index.values, df.index.strftime('%d\n%a'), rotation=0, ha='center')
plt.title('Weekends are highlighted from SAT 00:00 to MON 00:00', pad=15, size=12);
You can find more examples of this solution in the answers I have posted here and here.
I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!
There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()