Seaborn countplot with second axis with ordered data - python

I am trying to create a countplot with a lineplot over it as practice for some data visualisation I will be doing in work. I am looking at the kickstarter data on kaggle Link here
I run a countplot with a hue on the state of the project (successful, failed, canceled) and both of these are ordered
filter_list = ['failed', 'successful', 'canceled']
df2 = df[df.state.isin(filter_list)]
fig = plt.gcf()
fig.set_size_inches( 16, 10)
sns.countplot(x='main_category', hue='state', data=df2, order = df2['main_category'].value_counts().index,
hue_order = df2['state'].value_counts().index)
This comes out as follows:
I then create my second axis and add a lineplot
fig, ax = plt.subplots()
fig.set_size_inches( 16, 10)
ax = sns.countplot(x='main_category', hue='state', data=df, ax=ax, order = df2['main_category'].value_counts().index,
hue_order = df2['state'].value_counts().index)
ax2 = ax.twinx()
sns.lineplot(x='main_category', y='backers', data=df2, ax =ax2)
But this changes the column labels as seen below:
It appears that the data is the same its just the order of columns is different. I am still learning so my code may be inefficent or some of it redundant but any help would be appreciated. The only other things are how df is created which is as follows:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set(style="white", color_codes=True)
import matplotlib.pyplot as plt
df = pd.read_csv('ks.csv')
df = df.drop(['ID'], axis = 1)
df.head()

I don't think lineplot is what you are looking for. lineplot is supposed to be used with numeric data, not categorical. I'm even surprised this worked at all.
I think you are looking for pointplot instead
filter_list = ['failed', 'successful', 'canceled']
df2 = df[df.state.isin(filter_list)]
order = df2['main_category'].value_counts().index
fig = plt.figure()
ax1 = sns.countplot(x='main_category', hue='state', data=df2, order=order,
hue_order=filter_list)
ax2 = ax1.twinx()
sns.pointplot(x='main_category', y='backers', data=df2, ax=ax2, order=order)
Note that used like that, pointplot will show the average number of backers across categories. If that's not what you want, you can pass another aggregation function using the estimator= paramater
eg
sns.pointplot(x='main_category', y='backers', data=df2, ax=ax2, order=order, estimator=np.sum)

Related

Dynamic pandas subplots with matplotlib

I need help creating subplots in matplotlib dynamically from a pandas dataframe.
The data I am using is from data.word.
I have already created the viz but the plots have been created manually.
The reason why I need it dynamically is because I am going to apply a filter dynamically (in Power BI) and i need the graph to adjust to the filter.
This is what i have so far:
I imported the data and got it in the shape i need:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as
# read file from makeover monday year 2018 week 48
df = pd.read_csv(r'C:\Users\Ruth Pozuelo\Documents\python_examples\data\2018w48.csv', usecols=["city", "category","item", "cost"], index_col=False, decimal=",")
df.head()
this is the table:
I then apply the filter that will come from Power BI dynamically:
df = df[df.category=='Party night']
and then I count the number of plots based on the number of items I get after I apply the filter:
itemCount = df['item'].nunique() #number of plots
If I then plot the subplots:
fig, ax = plt.subplots( nrows=1, ncols=itemCount ,figsize=(30,10), sharey=True)
I get the skeleton:
So far so good!
But now i am suck on how to feed the x axis to the loop to generate the subcategories. I am trying something like below, but nothing works.
#for i, ax in enumerate(axes.flatten()):
# ax.plot(??,cityValues, marker='o',markersize=25, lw=0, color="green") # The top-left axes
As I already have the code for the look and feel of the chart, annotations,ect, I would love to be able to use the plt.subplots method and I prefer not use seaborn if possible.
Any ideas on how to get his working?
Thanks in advance!
The data was presented to us and we used it as the basis for our code. I prepared a list of columns and a list of coloring and looped through them. axes.rabel() is more memory efficient than axes.fatten(). This is because the list contains an object for each subplot, allowing for centralized configuration.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
url='https://raw.githubusercontent.com/Curbal-Data-Labs/Matplotlib-Labs/master/2018w48.csv'
dataset = pd.read_csv(url)
dataset.drop_duplicates(['city','item'], inplace=True)
dataset.pivot_table(index='city', columns='item', values='cost', aggfunc='sum', margins = True).sort_values('All', ascending=True).drop('All', axis=1)
df = dataset.pivot_table(index='city', columns='item', values='cost', aggfunc='sum', margins = True).sort_values('All', ascending=True).drop('All', axis=1).sort_values('All', ascending=False, axis=1).drop('All').reset_index()
# comma replace
for c in df.columns[1:]:
df[c] = df[c].str.replace(',','.').astype(float)
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(30,10), sharey=True)
colors = ['green','blue','red','black','brown']
col_names = ['Dinner','Drinks at Dinner','2 Longdrinks','Club entry','Cinema entry']
for i, (ax,col,c) in enumerate(zip(axes.ravel(), col_names, colors)):
ax.plot(df.loc[:,col], df['city'], marker='o', markersize=25, lw=0, color=c)
ax.set_title(col)
for i,j in zip(df[col], df['city']):
ax.annotate('$'+str(i), xy=(i, j), xytext=(i-4,j), color="white", fontsize=8)
ax.set_xticks([])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.grid(True, axis='y', linestyle='solid', linewidth=2)
ax.grid(True, axis='x', linestyle='solid', linewidth=0.2)
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
ax.set_xlim(xmin=0, xmax=160)
ax.xaxis.set_major_formatter('${x:1.0f}')
ax.tick_params(labelsize=8, top=False, left=False)
plt.show()
Working Example below. I used seaborn to plot the bars but the idea is the same you can loop through the facets and increase a count. Starting from -1 so that your first count = 0, and use this as the axis label.
import seaborn as sns
fig, ax = plt.subplots( nrows=1, ncols=itemCount ,figsize=(30,10), sharey=True)
df['Cost'] = df['Cost'].astype(float)
count = -1
variables = df['Item'].unique()
fig, axs = plt.subplots(1,itemCount , figsize=(25,70), sharex=False, sharey= False)
for var in variables:
count += 1
sns.barplot(ax=axs[count],data=df, x='Cost', y='City')

How to sync color between Seaborn and pandas pie plot

I am struggling with syncing colors between [seaborn.countplot] and [pandas.DataFrame.plot] pie plot.
I found a similar question on SO, but it does not work with pie chart as it throws an error:
TypeError: pie() got an unexpected keyword argument 'color'
I searched on the documentation sites, but all I could find is that I can set a colormap and palette, which was also not in sync in the end:
Result of using the same colormap and palette
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('https://andybek.com/pandas-sat')
cat_vars = ['Borough', 'SAT Section']
for var in list(cat_vars):
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
df[var].value_counts().plot(kind='pie', autopct=lambda v: f'{v:.2f}%', ax=ax[0])
cplot = sns.countplot(data=df, x=var, ax=ax[1])
for patch in cplot.patches:
cplot.annotate(
format(patch.get_height()),
(
patch.get_x() + patch.get_width() / 2,
patch.get_height()
)
)
plt.show()
Illustration of the problem
As you can see, colors are not in sync with labels.
I added the argument order to the sns.countplot(). This would change how seaborn selects the values and as a consequence the colours between both plots will mach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('https://andybek.com/pandas-sat')
cat_vars = ['Borough', 'SAT Section']
for var in list(cat_vars):
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
df[var].value_counts().plot(kind='pie', autopct=lambda v: f'{v:.2f}%', ax=ax[0])
cplot = sns.countplot(data=df, x=var, ax=ax[1],
order=df[var].value_counts().index)
for patch in cplot.patches:
cplot.annotate(
format(patch.get_height()),
(
patch.get_x() + patch.get_width() / 2,
patch.get_height()
)
)
plt.show()
Explanation: Colors are selected by order. So, if the columns in the sns.countplot have a different order than the other plot, both plots will have different columns for the same label.
Using default colors
Using the same dataframe for the pie plot and for the seaborn plot might help. As the values are already counted for the pie plot, that same dataframe could be plotted directly as a bar plot. That way, the order of the values stays the same.
Note that seaborn by default makes the colors a bit less saturated. To get the same colors as in the pie plot, you can use saturation=1 (default is .75). To add text above the bars, the latest matplotlib versions have a new function bar_label.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv('https://andybek.com/pandas-sat')
cat_vars = ['Borough', 'SAT Section']
for var in list(cat_vars):
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
counts_df = df[var].value_counts()
counts_df.plot(kind='pie', autopct=lambda v: f'{v:.2f}%', ax=ax[0])
sns.barplot(x=counts_df.index, y=counts_df.values, saturation=1, ax=ax[1])
ax[1].bar_label(ax[1].containers[0])
#Customized colors
If you want to use a customized list of colors, you can use the colors= keyword in pie() and palette= in seaborn.
To make things fit better, you can replace spaces by newlines (so "Staten Island" will use two lines). plt.tight_layout() will rearrange spacings to make titles and texts fit nicely into the figure.

How do I plot multiple lines within the same graph and each represents targeted group by using matplotlib? [duplicate]

In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()

Plotting two subplots in one figure

I have two PCA plots: one for training data and testing test. Using seaborn, I'd like to combine those two and plot like subplots.
sns.FacetGrid(finalDf_test, hue="L", height=6).map(plt.scatter, 'PC1_test', 'PC2_test').add_legend()
sns.FacetGrid(finalDf_train, hue="L", height=6).map(plt.scatter, 'PC1_train', 'PC2_train').add_legend()
Can someone help on that?
FacetGrid is a figure-level function that creates one or more subplots, depending on its col= and row= parameters. In this case, only one subplot is created.
As FacetGrid works on only one dataframe, you could concatenate your dataframes, introducing a new column to diferentiate test and train. Also, the "PC1" and "PC2" columns of both dataframes should get the same name.
An easier approach is to use matplotlib to create the figure and then call sns.scatterplot(...., ax=...) for each of the subplots.
It would look like:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# create some dummy data
l = np.random.randint(0,2,500)
p1 = np.random.rand(500)*10
p2 = p1 + np.random.randn(500) + l
finalDf_test = pd.DataFrame({'PC1_test': p1[:100], 'PC2_test': p2[:100], 'L':l[:100] })
finalDf_train = pd.DataFrame({'PC1_train': p1[100:], 'PC2_train': p2[100:], 'L':l[100:] })
sns.set()
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6), sharex=True, sharey=True)
sns.scatterplot(data=finalDf_test, x='PC1_test', y='PC2_test', hue='L', ax=ax1)
sns.scatterplot(data=finalDf_train, x='PC1_train', y='PC2_train', hue='L', ax=ax2)
plt.show()
Concatenating the dataframes could look as follows:
sns.set()
finalDf_total = pd.concat({'test': finalDf_test.rename(columns={'PC1_test': 'PC1', 'PC2_test': 'PC2' }),
'train':finalDf_train.rename(columns={'PC1_train': 'PC1', 'PC2_train': 'PC2' })})
finalDf_total.index.rename(['origin', None], inplace=True) # rename the first index column to "origin"
finalDf_total.reset_index(level=0, inplace=True) # convert the first index to a regular column
sns.FacetGrid(finalDf_total, hue='L', height=6, col='origin').map(plt.scatter, 'PC1', 'PC2').add_legend()
plt.show()
The same combined dataframe could also be used for example in lmplot:
sns.lmplot(data=finalDf_total, x='PC1', y='PC2', hue='L', height=6, col='origin')

Plotting grouped data in same plot using Pandas

In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()

Categories