Getting labels for legend after graphing pivot of dataframe in pandas

Getting labels for legend after graphing pivot of dataframe in pandas - python

I am trying to have my plot show a legend where the column each value came from would have a label. I did not separate the plt.plot() from the pivot step but want to know if it is still possible to have a legend. One does not show up at all and if I add
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'), label='DC')
it just uses that string as every label, if I add df_EPErrorPercentByWeekAndDC['DC'] then it just shows one letter of it per legend item. Here is the code I have:
print("### Graphing Error Rates by Week and DC EP ###")
# remove percent sign from percent in place
df_EPErrorPercentByWeekAndDC['Error Percent'] = df_EPErrorPercentByWeekAndDC['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation = 90)
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'))
plt.legend()
plt.savefig('EPErrorPercentByWeekAndDC.png', bbox_inches="tight", dpi=500)
plt.close()
and I cant share any of the data but it is in the format of a pivot table with columns with state names and each column is full of percentages, the graph works fine but the legend isnt there.

Either save the pivoted frame then specify the legend as the columns:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation=90)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
plt.plot(pivoted_df)
plt.legend(pivoted_df.columns, title=pivoted_df.columns.name)
plt.tight_layout()
plt.show()
Or use DataFrame.plot directly on the pivoted DataFrame which will handle the legend automatically:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
ax = pivoted_df.plot(rot=90)
plt.tight_layout()
plt.show()
Sample DataFrame and imports used:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'hellofresh delivery week': np.repeat(
pd.date_range('2021-05-01', periods=5, freq='W'), 2),
'DC': ['A', 'B'] * 5,
'Error Percent': pd.Series(np.random.randint(10, 100, 10), dtype=str) + '%'
})
df:
hellofresh delivery week DC Error Percent
0 2021-05-02 A 88%
1 2021-05-02 B 71%
2 2021-05-09 A 26%
3 2021-05-09 B 83%
4 2021-05-16 A 18%
5 2021-05-16 B 72%
6 2021-05-23 A 37%
7 2021-05-23 B 40%
8 2021-05-30 A 90%
9 2021-05-30 B 17%
pivoted_df:
DC A B
hellofresh delivery week
2021-05-02 88.0 71.0
2021-05-09 26.0 83.0
2021-05-16 18.0 72.0
2021-05-23 37.0 40.0
2021-05-30 90.0 17.0

Related

How to bar plot each row of a dataframe

The data frame looks like:
import pandas as pd
import numpy as np # used for the nan values
data = {'card_name': ['Diamonds', 'Clovers', 'HorseShoe'], '$20': [1000.0, 10.0, np.nan], '$25': [500.0, np.nan, 1873.0], '$30': [25, 213, 4657], '$40': [np.nan, 2199.0, np.nan], '$50': [1500.0, np.nan, 344.0], '$70': [np.nan, 43.0, 239.0], '$75': [30.0, 2.0, np.nan], '$100': [1.0, np.nan, 748.0]}
df = pd.DataFrame(data)
card_name $20 $25 $30 $40 $50 $70 $75 $100
0 Diamonds 1000 500 25 NaN 1500 NaN 30 1
1 Clovers 10 NaN 213 2199 NaN 43 2 NaN
2 HorseShoe NaN 1873 4657 NaN 344 239 NaN 748
The figure under the dollar signed column names is how many prizes there are on the corresponding card_name.
I'm trying to graph each card_name and show how many prizes there are for all the column's.
I'm using Python and Pandas with Matplotlib/Seaborn

The shape of the required dataframe depends on which plot API is being used to plot.
pandas and seaborn are both dependent upon matplotlib, but require a different shape to get the same result.
pandas
Set 'card_name' as the index, and then transpose the dataframe with .T.
Plot the dataframe directly with pandas.DataFrame.plot and kind='bar'. The index is plotted as the axis.
# set the index and transpose
dft = df.set_index('card_name').T
# display(dft)
card_name Diamonds Clovers HorseShoe
$20 1000.0 10.0 NaN
$25 500.0 NaN 1873.0
$30 25.0 213.0 4657.0
$40 NaN 2199.0 NaN
$50 1500.0 NaN 344.0
$70 NaN 43.0 239.0
$75 30.0 2.0 NaN
$100 1.0 NaN 748.0
# plot
dft.plot(kind='bar', rot=0)
seaborn
Convert the dataframe from a wide to long format using pandas.DataFrame.melt
Plot the data with seaborn.barplot, or with seaborn.catplot and kind='bar', then use hue= to specify the column to color by.
# convert the dataframe to long format
dfm = df.melt(id_vars='card_name')
# display(dfm.head())
card_name variable value
0 Diamonds $20 1000.0
1 Clovers $20 10.0
2 HorseShoe $20 NaN
3 Diamonds $25 500.0
4 Clovers $25 NaN
ax = sns.barplot(data=dfm, x='variable', y='value', hue='card_name')
subplots
pandas
add the parameter subplots=True
# using the previously transformed dataframe dft
axes = dft.plot(kind='bar', rot=0, subplots=True, figsize=(6, 10))
seaborn
It's easier to use .catplot to get subplots by specifying the row= and/or col= parameter.
# using the previously transformed dataframe dfm
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', row='card_name', height=3, aspect=1.5)

Seaborn boxplot with grouped data into categories with count column

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.

sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

Multi-year time series charge with shaded range in python

I have these charts that I've created in Excel from dataframes of a structure like such:
so that the chart can be created like this, stacking the 5-Year Range area on top of the Min range (no fill) so that the range area can be shaded. The min/max/range/avg columns all calculate off of 2016-2020.
I know that I can plot lines for multiple years on the same axis by using a date index and applying month labels, but is there a way to replicate the shading of this chart, more specifically if my dataframes are in a simple date index-value format, like so:
Quantity
1/1/2016 6
2/1/2016 4
3/1/2016 1
4/1/2016 10
5/1/2016 7
6/1/2016 10
7/1/2016 10
8/1/2016 2
9/1/2016 1
10/1/2016 2
11/1/2016 3
… …
1/1/2020 4
2/1/2020 8
3/1/2020 3
4/1/2020 5
5/1/2020 8
6/1/2020 6
7/1/2020 6
8/1/2020 7
9/1/2020 8
10/1/2020 5
11/1/2020 4
12/1/2020 3
1/1/2021 9
2/1/2021 7
3/1/2021 7
I haven't been able to find anything similar in the plot libraries.

Two step process
restructure DF so that years are columns, rows indexed by uniform date time
plot using matplotlib
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# straight date as index, quantity as column
d = pd.date_range("1-Jan-2016", "1-Mar-2021", freq="MS")
df = pd.DataFrame({"Quantity":np.random.randint(1, 10, len(d))}, index=d)
# re-structure as multi-index, make year column
# add calculated columns
dfg = (df.set_index(pd.MultiIndex.from_arrays([df.index.map(lambda d: dt.date(dt.date.today().year, d.month, d.day)),
df.index.year], names=["month","year"]))
.unstack("year")
.droplevel(0, axis=1)
.assign(min=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].min(axis=1),
max=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].max(axis=1),
avg=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].mean(axis=1).round(1),
)
)
fig, ax = plt.subplots(1, figsize=[14,4])
# now plot all the parts
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5y range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="r")
ax.plot(dfg.index, dfg[2021], label="2021", c="g")
ax.plot(dfg.index, dfg.avg, label="5 yr avg", c="y", ls=(0,(1,2)), lw=3)
# adjust axis
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.legend(loc = 'best')

I want to plot multiple variables from a dataframe using matplotlib but the final plot looks so weird

I have a dataframe containing stocks of multiple companies and a date column. I want to plot these stocks values on the y axis and date on the x axis in the same plot.
Each stock starts from a different value (for example amazon starts from $3103 whereas apple starts from $112)
When I do that my plot looks like this
Here's my plot code:
import matplotlib.pyplot as plt
%matplotlib inline
fig2 = plt.figure()
ax2 = fig2.add_subplot('111')
ax2.plot(plot_stocks['Amazon High'])
ax2.plot(plot_stocks['Apple High'])
ax2.plot(plot_stocks['Facebook High'])
ax2.plot(plot_stocks['Microsoft High'])
Here's a sample of the data:
Date Amazon High Apple High Facebook High Microsoft High
0 2020-12-04 $3198.21 $122.8608 $283.46 $215.38
1 2020-12-03 $3228.64 $123.78 $286.65 $216.3757
2 2020-12-02 $3232 $123.37 $291.78 $215.47
3 2020-12-01 $3248.95 $123.4693 $289.3 $217.32
4 2020-11-30 $3228.39 $120.97 $277.7 $214.76
5 2020-11-27 $3216.19 $117.49 $279.13 $216.27
6 2020-11-25 $3198 $116.75 $280.18 $215.29
7 2020-11-24 $3134.25 $115.85 $277.8199 $214.25
8 2020-11-23 $3139.745 $117.6202 $270.9471 $212.29
9 2020-11-20 $3132.89 $118.77 $273 $213.285
One thing i forgot to mention is that the High columns are string and I couldn't change them even after I did this:
plot_stocks['Amazon High'] = plot_stocks['Amazon High'].replace(r'$', '')

You can manipulate your data by creating a multi-index with pd.MultiIndex.from_tuples():
This makes the plot automatically give you the output you are looking for:
Pandas Setup:
plot_stocks['Date'] = pd.to_datetime(plot_stocks['Date'])
plot_stocks = plot_stocks.replace('\$', '', regex=True)
plot_stocks.iloc[:,1:] = plot_stocks.iloc[:,1:].astype(float)
plot_stocks = plot_stocks.set_index('Date')
plot_stocks.columns = pd.MultiIndex.from_tuples([(col.split()[1], col.split()[0])
for col in plot_stocks.columns])
plot_stocks
Out[1]:
High
Amazon Apple Facebook Microsoft
Date
2020-12-04 3198.210 122.8608 283.4600 215.3800
2020-12-03 3228.640 123.7800 286.6500 216.3757
2020-12-02 3232.000 123.3700 291.7800 215.4700
2020-12-01 3248.950 123.4693 289.3000 217.3200
2020-11-30 3228.390 120.9700 277.7000 214.7600
2020-11-27 3216.190 117.4900 279.1300 216.2700
2020-11-25 3198.000 116.7500 280.1800 215.2900
2020-11-24 3134.250 115.8500 277.8199 214.2500
2020-11-23 3139.745 117.6202 270.9471 212.2900
2020-11-20 3132.890 118.7700 273.0000 213.2850
Matplotlib Code:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
company_list = [col[1] for col in plot_stocks.columns]
plot_stocks.plot(ax=ax, title=f'Daily Stock Prices ({(", ").join(company_list)})')
ax.legend(title='Companies', labels=company_list)
ax.set_ylabel('Stock Prices')
plt.show()

Pandas (and seaborn) violinplot of state vs. year

I'm learning Pandas, (watching these helpful videos) and currently playing around with a UFO sighting table
import pandas as pd
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
ufo.head()
Now, I'd like to use Seaborn to make a violinplot of each state (on the x-axis) and the year (on the y-axis). Hence the plot shows the frequency density of sightings at any given year, in any given state.
If I use
ufo.State.value_counts()
I can get a Pandas Series of all the counts for each state. But how do I separate this data by year? I somehow need to get data with the ufo sightings per year per state?
Am I on the right track to create a Seaborn violinplot? Or going in completely the wrong direction?

According to the example shown in violinplot documentation of the following example:
ax = sns.violinplot(x="day", y="total_bill", data=tips)
You can directly assign your desired columns into x-axis by supplying the column name into x= and y-axis to the y= parameter. The following code shows the data structure of tips variable.
In [ ]: tips.head()
Out[ ]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Your question is to plot using violinplot, having x-axis to show ufo.State and y-axis to show ufo.Year. Therefore, I believe ufo.State.value_counts() is unnecessary, or even groupby since the ufo data is already well described and satisfy violinplot's parameter format.
You can achieve it by directly supplying both ufo.columnName into x= and y=. See the code below:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
City Colors Reported Shape Reported State \
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY
Time Year
0 1930-06-01 22:00:00 1930
1 1930-06-30 20:00:00 1930
2 1931-02-15 14:00:00 1931
3 1931-06-01 13:00:00 1931
4 1933-04-18 19:00:00 1933
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.violinplot(x=ufo.State, y=ufo.Year)
# ax = sns.violinplot(x='State', y='Year', data=ufo) # Works the same with the code one line above
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting labels for legend after graphing pivot of dataframe in pandas - python

Related

How to bar plot each row of a dataframe

Seaborn boxplot with grouped data into categories with count column

Multi-year time series charge with shaded range in python

I want to plot multiple variables from a dataframe using matplotlib but the final plot looks so weird

Pandas (and seaborn) violinplot of state vs. year

Categories

Resources