Plotting grouped data in same plot using Pandas - python

In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?

Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:

import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())

Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.

There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot

Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()

Related

Dynamic pandas subplots with matplotlib

I need help creating subplots in matplotlib dynamically from a pandas dataframe.
The data I am using is from data.word.
I have already created the viz but the plots have been created manually.
The reason why I need it dynamically is because I am going to apply a filter dynamically (in Power BI) and i need the graph to adjust to the filter.
This is what i have so far:
I imported the data and got it in the shape i need:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as
# read file from makeover monday year 2018 week 48
df = pd.read_csv(r'C:\Users\Ruth Pozuelo\Documents\python_examples\data\2018w48.csv', usecols=["city", "category","item", "cost"], index_col=False, decimal=",")
df.head()
this is the table:
I then apply the filter that will come from Power BI dynamically:
df = df[df.category=='Party night']
and then I count the number of plots based on the number of items I get after I apply the filter:
itemCount = df['item'].nunique() #number of plots
If I then plot the subplots:
fig, ax = plt.subplots( nrows=1, ncols=itemCount ,figsize=(30,10), sharey=True)
I get the skeleton:
So far so good!
But now i am suck on how to feed the x axis to the loop to generate the subcategories. I am trying something like below, but nothing works.
#for i, ax in enumerate(axes.flatten()):
# ax.plot(??,cityValues, marker='o',markersize=25, lw=0, color="green") # The top-left axes
As I already have the code for the look and feel of the chart, annotations,ect, I would love to be able to use the plt.subplots method and I prefer not use seaborn if possible.
Any ideas on how to get his working?
Thanks in advance!
The data was presented to us and we used it as the basis for our code. I prepared a list of columns and a list of coloring and looped through them. axes.rabel() is more memory efficient than axes.fatten(). This is because the list contains an object for each subplot, allowing for centralized configuration.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
url='https://raw.githubusercontent.com/Curbal-Data-Labs/Matplotlib-Labs/master/2018w48.csv'
dataset = pd.read_csv(url)
dataset.drop_duplicates(['city','item'], inplace=True)
dataset.pivot_table(index='city', columns='item', values='cost', aggfunc='sum', margins = True).sort_values('All', ascending=True).drop('All', axis=1)
df = dataset.pivot_table(index='city', columns='item', values='cost', aggfunc='sum', margins = True).sort_values('All', ascending=True).drop('All', axis=1).sort_values('All', ascending=False, axis=1).drop('All').reset_index()
# comma replace
for c in df.columns[1:]:
df[c] = df[c].str.replace(',','.').astype(float)
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(30,10), sharey=True)
colors = ['green','blue','red','black','brown']
col_names = ['Dinner','Drinks at Dinner','2 Longdrinks','Club entry','Cinema entry']
for i, (ax,col,c) in enumerate(zip(axes.ravel(), col_names, colors)):
ax.plot(df.loc[:,col], df['city'], marker='o', markersize=25, lw=0, color=c)
ax.set_title(col)
for i,j in zip(df[col], df['city']):
ax.annotate('$'+str(i), xy=(i, j), xytext=(i-4,j), color="white", fontsize=8)
ax.set_xticks([])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.grid(True, axis='y', linestyle='solid', linewidth=2)
ax.grid(True, axis='x', linestyle='solid', linewidth=0.2)
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
ax.set_xlim(xmin=0, xmax=160)
ax.xaxis.set_major_formatter('${x:1.0f}')
ax.tick_params(labelsize=8, top=False, left=False)
plt.show()
Working Example below. I used seaborn to plot the bars but the idea is the same you can loop through the facets and increase a count. Starting from -1 so that your first count = 0, and use this as the axis label.
import seaborn as sns
fig, ax = plt.subplots( nrows=1, ncols=itemCount ,figsize=(30,10), sharey=True)
df['Cost'] = df['Cost'].astype(float)
count = -1
variables = df['Item'].unique()
fig, axs = plt.subplots(1,itemCount , figsize=(25,70), sharex=False, sharey= False)
for var in variables:
count += 1
sns.barplot(ax=axs[count],data=df, x='Cost', y='City')

Matrix of scatterplots by month-year

My data is in a dataframe of two columns: y and x. The data refers to the past few years. Dummy data is below:
np.random.seed(167)
rng = pd.date_range('2017-04-03', periods=365*3)
df = pd.DataFrame(
{"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)])
}, index=rng
)
In first attempt, I plotted a scatterplot with Seaborn using the following code:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'])
plot_scatter(data=df, title='dummy title', figsize=(10,7))
However, I would like to generate a 4x3 matrix including 12 scatterplots, one for each month with year as hue. I thought I could create a third column in my dataframe that tells me the year and I tried the following:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'],
hue=data.iloc[:, 2])
df['year'] = df.index.year
plot_scatter(data=df, title='dummy title', figsize=(10,7))
While this allows me to see the years, it still shows all the data in the same scatterplot instead of creating multiple scatterplots, one for each month, so it's not offering the level of detail I need.
I could slice the data by month and build a for loop that plots one scatterplot per month but I actually want a matrix where all the scatterplots use similar axis scales. Does anyone know an efficient way to achieve that?
To create multiple subplots at once, seaborn introduces figure-level functions. The col= argument indicates which column of the dataframe should be used to identify the subplots. col_wrap= can be used to tell how many subplots go next to each other before starting an additional row.
Note that you shouldn't create a figure, as the function creates its own new figure. It uses the height= and aspect= arguments to tell the size of the individual subplots.
The code below uses a sns.relplot() on the months. An extra column for the months is created; it is made categorical to fix an order.
To remove the month= in the title, you can loop through the generated axes (a recent seaborn version is needed for axes_dict). With sns.set(font_scale=...) you can change the default sizes of all texts.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(167)
dates = pd.date_range('2017-04-03', periods=365 * 3, freq='D')
df = pd.DataFrame({"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)])
}, index=dates)
df['year'] = df.index.year
month_names = pd.date_range('2017-01-01', periods=12, freq='M').strftime('%B')
df['month'] = pd.Categorical.from_codes(df.index.month - 1, month_names)
sns.set(font_scale=1.7)
g = sns.relplot(kind='scatter', data=df, x='x', y='y', hue='year', col='month', col_wrap=4, height=4, aspect=1)
# optionally remove the `month=` in the title
for name, ax in g.axes_dict.items():
ax.set_title(name)
plt.setp(g.axes, xlabel='', ylabel='') # remove all x and y labels
g.axes[-2].set_xlabel('x', loc='left') # set an x label at the left of the second to last subplot
g.axes[4].set_ylabel('y') # set a y label to 5th subplot
plt.subplots_adjust(left=0.06, bottom=0.06) # set some more spacing at the left and bottom
plt.show()

How do I plot multiple lines within the same graph and each represents targeted group by using matplotlib? [duplicate]

In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()

Plotting two subplots in one figure

I have two PCA plots: one for training data and testing test. Using seaborn, I'd like to combine those two and plot like subplots.
sns.FacetGrid(finalDf_test, hue="L", height=6).map(plt.scatter, 'PC1_test', 'PC2_test').add_legend()
sns.FacetGrid(finalDf_train, hue="L", height=6).map(plt.scatter, 'PC1_train', 'PC2_train').add_legend()
Can someone help on that?
FacetGrid is a figure-level function that creates one or more subplots, depending on its col= and row= parameters. In this case, only one subplot is created.
As FacetGrid works on only one dataframe, you could concatenate your dataframes, introducing a new column to diferentiate test and train. Also, the "PC1" and "PC2" columns of both dataframes should get the same name.
An easier approach is to use matplotlib to create the figure and then call sns.scatterplot(...., ax=...) for each of the subplots.
It would look like:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# create some dummy data
l = np.random.randint(0,2,500)
p1 = np.random.rand(500)*10
p2 = p1 + np.random.randn(500) + l
finalDf_test = pd.DataFrame({'PC1_test': p1[:100], 'PC2_test': p2[:100], 'L':l[:100] })
finalDf_train = pd.DataFrame({'PC1_train': p1[100:], 'PC2_train': p2[100:], 'L':l[100:] })
sns.set()
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6), sharex=True, sharey=True)
sns.scatterplot(data=finalDf_test, x='PC1_test', y='PC2_test', hue='L', ax=ax1)
sns.scatterplot(data=finalDf_train, x='PC1_train', y='PC2_train', hue='L', ax=ax2)
plt.show()
Concatenating the dataframes could look as follows:
sns.set()
finalDf_total = pd.concat({'test': finalDf_test.rename(columns={'PC1_test': 'PC1', 'PC2_test': 'PC2' }),
'train':finalDf_train.rename(columns={'PC1_train': 'PC1', 'PC2_train': 'PC2' })})
finalDf_total.index.rename(['origin', None], inplace=True) # rename the first index column to "origin"
finalDf_total.reset_index(level=0, inplace=True) # convert the first index to a regular column
sns.FacetGrid(finalDf_total, hue='L', height=6, col='origin').map(plt.scatter, 'PC1', 'PC2').add_legend()
plt.show()
The same combined dataframe could also be used for example in lmplot:
sns.lmplot(data=finalDf_total, x='PC1', y='PC2', hue='L', height=6, col='origin')

Seaborn countplot with second axis with ordered data

I am trying to create a countplot with a lineplot over it as practice for some data visualisation I will be doing in work. I am looking at the kickstarter data on kaggle Link here
I run a countplot with a hue on the state of the project (successful, failed, canceled) and both of these are ordered
filter_list = ['failed', 'successful', 'canceled']
df2 = df[df.state.isin(filter_list)]
fig = plt.gcf()
fig.set_size_inches( 16, 10)
sns.countplot(x='main_category', hue='state', data=df2, order = df2['main_category'].value_counts().index,
hue_order = df2['state'].value_counts().index)
This comes out as follows:
I then create my second axis and add a lineplot
fig, ax = plt.subplots()
fig.set_size_inches( 16, 10)
ax = sns.countplot(x='main_category', hue='state', data=df, ax=ax, order = df2['main_category'].value_counts().index,
hue_order = df2['state'].value_counts().index)
ax2 = ax.twinx()
sns.lineplot(x='main_category', y='backers', data=df2, ax =ax2)
But this changes the column labels as seen below:
It appears that the data is the same its just the order of columns is different. I am still learning so my code may be inefficent or some of it redundant but any help would be appreciated. The only other things are how df is created which is as follows:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set(style="white", color_codes=True)
import matplotlib.pyplot as plt
df = pd.read_csv('ks.csv')
df = df.drop(['ID'], axis = 1)
df.head()
I don't think lineplot is what you are looking for. lineplot is supposed to be used with numeric data, not categorical. I'm even surprised this worked at all.
I think you are looking for pointplot instead
filter_list = ['failed', 'successful', 'canceled']
df2 = df[df.state.isin(filter_list)]
order = df2['main_category'].value_counts().index
fig = plt.figure()
ax1 = sns.countplot(x='main_category', hue='state', data=df2, order=order,
hue_order=filter_list)
ax2 = ax1.twinx()
sns.pointplot(x='main_category', y='backers', data=df2, ax=ax2, order=order)
Note that used like that, pointplot will show the average number of backers across categories. If that's not what you want, you can pass another aggregation function using the estimator= paramater
eg
sns.pointplot(x='main_category', y='backers', data=df2, ax=ax2, order=order, estimator=np.sum)

Categories