How to bar plot each row of a dataframe - python

The data frame looks like:
import pandas as pd
import numpy as np # used for the nan values
data = {'card_name': ['Diamonds', 'Clovers', 'HorseShoe'], '$20': [1000.0, 10.0, np.nan], '$25': [500.0, np.nan, 1873.0], '$30': [25, 213, 4657], '$40': [np.nan, 2199.0, np.nan], '$50': [1500.0, np.nan, 344.0], '$70': [np.nan, 43.0, 239.0], '$75': [30.0, 2.0, np.nan], '$100': [1.0, np.nan, 748.0]}
df = pd.DataFrame(data)
card_name $20 $25 $30 $40 $50 $70 $75 $100
0 Diamonds 1000 500 25 NaN 1500 NaN 30 1
1 Clovers 10 NaN 213 2199 NaN 43 2 NaN
2 HorseShoe NaN 1873 4657 NaN 344 239 NaN 748
The figure under the dollar signed column names is how many prizes there are on the corresponding card_name.
I'm trying to graph each card_name and show how many prizes there are for all the column's.
I'm using Python and Pandas with Matplotlib/Seaborn

The shape of the required dataframe depends on which plot API is being used to plot.
pandas and seaborn are both dependent upon matplotlib, but require a different shape to get the same result.
pandas
Set 'card_name' as the index, and then transpose the dataframe with .T.
Plot the dataframe directly with pandas.DataFrame.plot and kind='bar'. The index is plotted as the axis.
# set the index and transpose
dft = df.set_index('card_name').T
# display(dft)
card_name Diamonds Clovers HorseShoe
$20 1000.0 10.0 NaN
$25 500.0 NaN 1873.0
$30 25.0 213.0 4657.0
$40 NaN 2199.0 NaN
$50 1500.0 NaN 344.0
$70 NaN 43.0 239.0
$75 30.0 2.0 NaN
$100 1.0 NaN 748.0
# plot
dft.plot(kind='bar', rot=0)
seaborn
Convert the dataframe from a wide to long format using pandas.DataFrame.melt
Plot the data with seaborn.barplot, or with seaborn.catplot and kind='bar', then use hue= to specify the column to color by.
# convert the dataframe to long format
dfm = df.melt(id_vars='card_name')
# display(dfm.head())
card_name variable value
0 Diamonds $20 1000.0
1 Clovers $20 10.0
2 HorseShoe $20 NaN
3 Diamonds $25 500.0
4 Clovers $25 NaN
ax = sns.barplot(data=dfm, x='variable', y='value', hue='card_name')
subplots
pandas
add the parameter subplots=True
# using the previously transformed dataframe dft
axes = dft.plot(kind='bar', rot=0, subplots=True, figsize=(6, 10))
seaborn
It's easier to use .catplot to get subplots by specifying the row= and/or col= parameter.
# using the previously transformed dataframe dfm
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', row='card_name', height=3, aspect=1.5)

Related

Seaborn boxplot with grouped data into categories with count column

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

How to color bars based on a separate pandas column

I need to plot a barchat and to apply a color according to the "Attribute" column of my dataframe
x axis = Shares
y axis = Price
fig, ax = plt.subplots()
ax.barh(df['Share'],df['Price'], align='center')
ax.set_xlabel('Shares')
ax.set_ylabel('Price')
ax.set_title('Bar Chart & Colors')
plt.show()
Thanks for your help !
There are two easy ways to plot the bars with separate colors for 'Attribute'
Transform the dataframe with .pivot and then plot with pandas.DataFrame.plot and specify kind='barh' for a horizontal bar plot
The index will be the x-axis if using kind='bar', and will be the y-axis if using kind='barh'
The columns of the transformed dataframe will each be plotted with a separate color.
pandas uses matplotlib as the default plotting backend.
Use seaborn.barplot with hue='Attribute' and orient='h'. This option works with the dataframe in a long format, as shown in the OP.
seaborn is a high-level API for matplotlib
Tested with pandas 1.3.0, seaborn 0.11.1, and matplotlib 3.4.2
Imports and DataFrame
import pandas as pd
import seaborn as sns
# test dataframe
data = {'Price': [110, 105, 119, 102, 111, 117, 110, 110], 'Share': [110, -50, 22, 79, 29, -2, 130, 140], 'Attribute': ['A', 'B', 'C', 'D', 'A', 'B', 'B', 'C']}
df = pd.DataFrame(data)
Price Share Attribute
0 110 110 A
1 105 -50 B
2 119 22 C
3 102 79 D
4 111 29 A
5 117 -2 B
6 110 130 B
7 110 140 C
pandas.DataFrame.plot
# transform the dataframe with .pivot
dfp = df.pivot(index='Price', columns='Attribute', values='Share')
Attribute A B C D
Price
102 NaN NaN NaN 79.0
105 NaN -50.0 NaN NaN
110 110.0 130.0 140.0 NaN
111 29.0 NaN NaN NaN
117 NaN -2.0 NaN NaN
119 NaN NaN 22.0 NaN
# plot
ax = dfp.plot(kind='barh', title='Bar Chart of Colors', figsize=(6, 4))
ax.set(xlabel='Shares')
ax.legend(title='Attribute', bbox_to_anchor=(1, 1), loc='upper left')
ax.grid(axis='x')
with stacked=True
ax = dfp.plot(kind='barh', stacked=True, title='Bar Chart of Colors', figsize=(6, 4))
seaborn.barplot
Note the order of the y-axis values are reversed compared to the previous plot
ax = sns.barplot(data=df, x='Share', y='Price', hue='Attribute', orient='h')
ax.set(xlabel='Shares', title='Bar Chart of Colors')
ax.legend(title='Attribute', bbox_to_anchor=(1, 1), loc='upper left')
ax.grid(axis='x')

Getting labels for legend after graphing pivot of dataframe in pandas

I am trying to have my plot show a legend where the column each value came from would have a label. I did not separate the plt.plot() from the pivot step but want to know if it is still possible to have a legend. One does not show up at all and if I add
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'), label='DC')
it just uses that string as every label, if I add df_EPErrorPercentByWeekAndDC['DC'] then it just shows one letter of it per legend item. Here is the code I have:
print("### Graphing Error Rates by Week and DC EP ###")
# remove percent sign from percent in place
df_EPErrorPercentByWeekAndDC['Error Percent'] = df_EPErrorPercentByWeekAndDC['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation = 90)
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'))
plt.legend()
plt.savefig('EPErrorPercentByWeekAndDC.png', bbox_inches="tight", dpi=500)
plt.close()
and I cant share any of the data but it is in the format of a pivot table with columns with state names and each column is full of percentages, the graph works fine but the legend isnt there.
Either save the pivoted frame then specify the legend as the columns:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation=90)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
plt.plot(pivoted_df)
plt.legend(pivoted_df.columns, title=pivoted_df.columns.name)
plt.tight_layout()
plt.show()
Or use DataFrame.plot directly on the pivoted DataFrame which will handle the legend automatically:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
ax = pivoted_df.plot(rot=90)
plt.tight_layout()
plt.show()
Sample DataFrame and imports used:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'hellofresh delivery week': np.repeat(
pd.date_range('2021-05-01', periods=5, freq='W'), 2),
'DC': ['A', 'B'] * 5,
'Error Percent': pd.Series(np.random.randint(10, 100, 10), dtype=str) + '%'
})
df:
hellofresh delivery week DC Error Percent
0 2021-05-02 A 88%
1 2021-05-02 B 71%
2 2021-05-09 A 26%
3 2021-05-09 B 83%
4 2021-05-16 A 18%
5 2021-05-16 B 72%
6 2021-05-23 A 37%
7 2021-05-23 B 40%
8 2021-05-30 A 90%
9 2021-05-30 B 17%
pivoted_df:
DC A B
hellofresh delivery week
2021-05-02 88.0 71.0
2021-05-09 26.0 83.0
2021-05-16 18.0 72.0
2021-05-23 37.0 40.0
2021-05-30 90.0 17.0

How to make multiple scatter subplots with sharing one-axis?

date
name
amount
1
harry
100
1
joe
20
2
harry
50
3
joe
60
3
lee
25
4
lee
60
4
harry
200
4
joe
90
I was trying to share 'date' axis (x-axis) with 432 person name. Image was too large to show.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec=pd.read_csv('december.csv')
sns.lmplot(x='date', y='amount',
data= dec, fit_reg=False, hue='name', legend=True, palette='Set1')
This code is giving one graph with 432 hue. But I want 432 graphs. How to do it?
Using the same code you wrote, but instead of putting hue='name', you put col='name' and it should give you the expected behavior:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec = pd.DataFrame(
[
[1,'harry',100],
[1,'joe',20],
[2,'harry',50],
[3,'joe',60],
[3,'lee',25],
[4,'lee',60],
[4,'harry',200],
[4,'joe',90],
],
columns=['date','name','amount'],
)
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
If you want to break the rows, you can define a column wrapper with col_wrap (number of plots per row):
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
col_wrap=1,
legend=True,
palette='Set1',
)
EDIT: using the groupby() method, you can easily get aggregates such as number of dots per plot and total amount per group.
The main idea is to group the records in the dec dataframe by name (has it was implicitly done in the plot above).
Continuing on the code above, you can have a preview of the groupby operation using the describe method:
dec.groupby('name').describe()
Out[2]:
date amount
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
name
harry 3.0 2.333333 1.527525 1.0 1.50 2.0 3.00 4.0 3.0 116.666667 76.376262 50.0 75.00 100.0 150.00 200.0
joe 3.0 2.666667 1.527525 1.0 2.00 3.0 3.50 4.0 3.0 56.666667 35.118846 20.0 40.00 60.0 75.00 90.0
lee 2.0 3.500000 0.707107 3.0 3.25 3.5 3.75 4.0 2.0 42.500000 24.748737 25.0 33.75 42.5 51.25 60.0
Using the pandas groupby method, we group records by 'name' and pick any column (here: 'amount') to get the count (the count is the same aggregate for each column, as it counts each individual occurence of each different 'name'):
counts = dec.groupby('name')['amount'].count()
counts
Out[3]:
name
harry 3
joe 3
lee 2
Name: amount, dtype: int64
To get the total amount, we do the same, we pick the 'amount' column and call the sum() method instead of the count() method:
total_amounts = dec.groupby('name')['amount'].sum()
total_amounts
Out[4]:
name
harry 350
joe 170
lee 85
Name: amount, dtype: int64
We now have two series indexed by 'name' containing the information we want: counts and total_amounts.
We're gonna use these two series to build a title for each subplot:
plot = sns.lmplot(
x='date',
y='amount',
data=dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
for name in plot.axes_dict:
sublot_title = f'name = {name}, number of dots = {counts[name]}, total amount = {total_amounts[name]}'
plot.axes_dict[name].set_title(sublot_title)
plot.fig
It prints:

Plot datetime.date / time series in a pandas dataframe

I created a pandas dataframe from some value counts on particular calendar dates. Here is how I did it:
time_series = pd.DataFrame(df['Operation Date'].value_counts().reset_index())
time_series.columns = ['date', 'count']
Basically, it is two columns, the first "date" is a column with datetime.date objects and the second column, "count" are simply integer values. Now, I'd like to plot a scatter or a KDE to represent how the value changes over the calendar days.
But when I try:
time_series.plot(kind='kde')
plt.show()
I get a plot where the x-axis is from -50 to 150 as if it is parsing the datetime.date objects as integers somehow. Also, it is yielding two identical plots rather than just one.
Any idea how I can plot them and see the calendars day along the x-axis?
you sure you got datetime? i just tried this and it worked fine:
df = date count
7 2012-06-11 16:51:32 1.0
3 2012-09-28 08:05:14 12.0
19 2012-10-01 18:01:47 4.0
2 2012-10-03 15:18:23 29.0
6 2012-12-22 19:50:43 4.0
1 2013-02-19 19:54:03 28.0
9 2013-02-28 16:08:40 17.0
12 2013-03-12 08:42:55 6.0
4 2013-04-04 05:27:27 6.0
17 2013-04-18 09:40:37 29.0
11 2013-05-17 16:34:51 22.0
5 2013-07-07 14:32:59 16.0
14 2013-10-22 06:56:29 13.0
13 2014-01-16 23:08:46 20.0
15 2014-02-25 00:49:26 10.0
18 2014-03-19 15:58:38 25.0
0 2014-03-31 05:53:28 16.0
16 2014-04-01 09:59:32 27.0
8 2014-04-27 12:07:41 17.0
10 2014-09-20 04:42:39 21.0
df = df.sort_values('date', ascending=True)
plt.plot(df['date'], df['count'])
plt.xticks(rotation='vertical')
EDIT:
if you want a scatter plot you can:
plt.plot(df['date'], df['count'], '*')
plt.xticks(rotation='vertical')
If the column is datetime dtype (not object), then you can call plot() directly on the dataframe. You don't need to sort by date either, it's done behind the scenes if x-axis is datetime.
df['date'] = pd.to_datetime(df['date'])
df.plot(x='date', y='count', kind='scatter', rot='vertical');
You can also pass many arguments to make the plot nicer (add titles, change figsize and fontsize, rotate ticklabels, set subplots axis etc.) See the docs for full list of possible arguments.
df.plot(x='date', y='count', kind='line', rot=45, legend=None,
title='Count across time', xlabel='', fontsize=10, figsize=(12,4));
You can even use another column to color scatter plots. In the example below, the months are used to assign color. Tip: To get the full list of possible colormaps, pass any gibberish string to colormap and the error message will show you the full list.
df.plot(x='date', y='count', kind='scatter', rot=90, c=df['date'].dt.month, colormap='tab20', sharex=False);

Categories