Plot all categories in boxplot or violinplot - python

I have a matplotlib figure with a few violinplots on it (although this question would apply to any similar plot or other dataframe situation, not just violinplots). I currently run my code and it spits out the figure, with one violinplot per category. The code looks something like the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(data=np.random.random_integers(low=0,high=1000,size=(100,1)),
columns=['row0']
)
df['r0_range']='temp' #create a new column 'r0_range', give it a preliminary value
#make assignments depending on value of row0
df['r0_range'][df['row0']<=250]='[0,250]'
df['r0_range'][df['row0']>250]='(250,500]'
df['r0_range'][df['row0']>500]='(500,750]'
df['r0_range'][df['row0']>750]='(750,1000]'
fig1, ax1 = plt.subplots(1,1)
ax1 = sns.violinplot(data=df, x='r0_range', y='row0', inner=None, ax=ax1)
Which pops out the following:
I want to include on my figure a fifth violinplot that represents all of the data in all of the categories. Is there an elegant way to do that without having to copy the row0 data into new rows of the dataframe?

Perhaps something like this will do what you are looking for:
df = pd.DataFrame(data=np.random.randint(0, 1001, 100), columns=['row0'])
g = df.groupby(pd.cut(df['row0'], [0, 250, 500, 750, 1000]))
for name, data in g.groups.items():
df[name] = df.loc[data]['row0']
sns.violinplot(data=df, inner=None, ax=ax1)

Related

Improving time series subplots with Matplotlib Python

I am trying to make subplots from multiple columns of a pandas dataframe. Following code is somehow working, but I would like to improve it by moving all the legends to outside of plots (to the right) and add est_fmc variable to each plot.
L = new_df_honeysuckle[["Avg_1h_srf_mc", "Avg_1h_prof_mc", "Avg_10h_fuel_stick", "Avg_100h_debri_mc", "Avg_Daviesia_mc",
"Avg_Euclaypt_mc", "obs_fmc_average", "obs_fmc_max", "est_fmc"]].resample("1M").mean().interpolate().plot(figsize=(10,15),
subplots=True, linewidth = 3, yticks = (0, 50, 100, 150, 200))
plt.legend(loc='center left', markerscale=6, bbox_to_anchor=(1, 0.4))
Any help highly appreciated.
Since the plotting function of pandas does not allow for fine control, it is easiest to use the subplotting function of mpl and handle it through loop processing.' It was unclear whether you wanted to add the 'est_fmc' line or annotate it, so I added the line. For annotations, see this.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
import numpy as np
import itertools
columns = ["Avg_1h_srf_mc", "Avg_1h_prof_mc", "Avg_10h_fuel_stick", "Avg_100h_debri_mc", "Avg_Daviesia_mc", "Avg_Euclaypt_mc", "obs_fmc_average", "obs_fmc_max",'est_fmc']
date_rng = pd.date_range('2017-01-01','2020-02-01', freq='1m')
df = pd.DataFrame({'date':pd.to_datetime(date_rng)})
for col in columns:
tmp = np.random.randint(0,200,(37,))
df = pd.concat([df, pd.Series(tmp, name=col, index=df.index)], axis=1)
fig, axs = plt.subplots(len(cols[:-1]), 1, figsize=(10,15), sharex=True)
fig.subplots_adjust(hspace=0.5)
colors = mcolors.TABLEAU_COLORS
for i,(col,cname) in enumerate(zip(columns[:-1], itertools.islice(colors.keys(),9))):
axs[i].plot(df['date'], df[col], label=col, color=cname)
axs[i].plot(df['date'], df['est_fmc'], label='est_fmc', color='tab:olive')
axs[i].set_yticks([0, 50, 100, 150, 200])
axs[i].grid()
axs[i].legend(loc='upper left', bbox_to_anchor=(1.02, 1.0))
plt.show()

Plotting two subplots in one figure

I have two PCA plots: one for training data and testing test. Using seaborn, I'd like to combine those two and plot like subplots.
sns.FacetGrid(finalDf_test, hue="L", height=6).map(plt.scatter, 'PC1_test', 'PC2_test').add_legend()
sns.FacetGrid(finalDf_train, hue="L", height=6).map(plt.scatter, 'PC1_train', 'PC2_train').add_legend()
Can someone help on that?
FacetGrid is a figure-level function that creates one or more subplots, depending on its col= and row= parameters. In this case, only one subplot is created.
As FacetGrid works on only one dataframe, you could concatenate your dataframes, introducing a new column to diferentiate test and train. Also, the "PC1" and "PC2" columns of both dataframes should get the same name.
An easier approach is to use matplotlib to create the figure and then call sns.scatterplot(...., ax=...) for each of the subplots.
It would look like:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# create some dummy data
l = np.random.randint(0,2,500)
p1 = np.random.rand(500)*10
p2 = p1 + np.random.randn(500) + l
finalDf_test = pd.DataFrame({'PC1_test': p1[:100], 'PC2_test': p2[:100], 'L':l[:100] })
finalDf_train = pd.DataFrame({'PC1_train': p1[100:], 'PC2_train': p2[100:], 'L':l[100:] })
sns.set()
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6), sharex=True, sharey=True)
sns.scatterplot(data=finalDf_test, x='PC1_test', y='PC2_test', hue='L', ax=ax1)
sns.scatterplot(data=finalDf_train, x='PC1_train', y='PC2_train', hue='L', ax=ax2)
plt.show()
Concatenating the dataframes could look as follows:
sns.set()
finalDf_total = pd.concat({'test': finalDf_test.rename(columns={'PC1_test': 'PC1', 'PC2_test': 'PC2' }),
'train':finalDf_train.rename(columns={'PC1_train': 'PC1', 'PC2_train': 'PC2' })})
finalDf_total.index.rename(['origin', None], inplace=True) # rename the first index column to "origin"
finalDf_total.reset_index(level=0, inplace=True) # convert the first index to a regular column
sns.FacetGrid(finalDf_total, hue='L', height=6, col='origin').map(plt.scatter, 'PC1', 'PC2').add_legend()
plt.show()
The same combined dataframe could also be used for example in lmplot:
sns.lmplot(data=finalDf_total, x='PC1', y='PC2', hue='L', height=6, col='origin')

Modifying Seaborn Violinplot to show means not median

Currently displaying some data with Seaborn / Pandas. I'm looking to overlay the mean of each category (x=ks2) - but can't figure out how to do this with Seaborn.
I can remove the inner="box" - but want to replace that with a marker for the mean of each category.
Ideally, then link each mean calculated...
Any pointers greatly received.
Cheers
Science.csv has 9k+ entries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load the dataset
# df = pd.read_csv("science.csv") << loaded from csv
df = pd.DataFrame({'ks2': [1, 1, 2,3,3,4],
'science': [40, 50, 34,20,0,44]})
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="ks2", y="science", data=df, split=True,
inner="box",linewidth=2)
sns.despine(left=True)
plt.savefig('plot.png')
try:
from numpy import mean
then overlay sns.pointplot with estimator=mean
sns.pointplot(x = 'ks2', y='science', data=df, estimator=mean)
then play with linestyles

Seaborn countplot with second axis with ordered data

I am trying to create a countplot with a lineplot over it as practice for some data visualisation I will be doing in work. I am looking at the kickstarter data on kaggle Link here
I run a countplot with a hue on the state of the project (successful, failed, canceled) and both of these are ordered
filter_list = ['failed', 'successful', 'canceled']
df2 = df[df.state.isin(filter_list)]
fig = plt.gcf()
fig.set_size_inches( 16, 10)
sns.countplot(x='main_category', hue='state', data=df2, order = df2['main_category'].value_counts().index,
hue_order = df2['state'].value_counts().index)
This comes out as follows:
I then create my second axis and add a lineplot
fig, ax = plt.subplots()
fig.set_size_inches( 16, 10)
ax = sns.countplot(x='main_category', hue='state', data=df, ax=ax, order = df2['main_category'].value_counts().index,
hue_order = df2['state'].value_counts().index)
ax2 = ax.twinx()
sns.lineplot(x='main_category', y='backers', data=df2, ax =ax2)
But this changes the column labels as seen below:
It appears that the data is the same its just the order of columns is different. I am still learning so my code may be inefficent or some of it redundant but any help would be appreciated. The only other things are how df is created which is as follows:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set(style="white", color_codes=True)
import matplotlib.pyplot as plt
df = pd.read_csv('ks.csv')
df = df.drop(['ID'], axis = 1)
df.head()
I don't think lineplot is what you are looking for. lineplot is supposed to be used with numeric data, not categorical. I'm even surprised this worked at all.
I think you are looking for pointplot instead
filter_list = ['failed', 'successful', 'canceled']
df2 = df[df.state.isin(filter_list)]
order = df2['main_category'].value_counts().index
fig = plt.figure()
ax1 = sns.countplot(x='main_category', hue='state', data=df2, order=order,
hue_order=filter_list)
ax2 = ax1.twinx()
sns.pointplot(x='main_category', y='backers', data=df2, ax=ax2, order=order)
Note that used like that, pointplot will show the average number of backers across categories. If that's not what you want, you can pass another aggregation function using the estimator= paramater
eg
sns.pointplot(x='main_category', y='backers', data=df2, ax=ax2, order=order, estimator=np.sum)

Subplot of Subplots Matplotlib / Seaborn

I am trying to create a grid of subplots. each subplot will look like the one that is on this site.
https://python-graph-gallery.com/24-histogram-with-a-boxplot-on-top-seaborn/
If I have 10 different sets of this style of plot I want to make them into a 5x2 for example.
I have read through the documentation of Matplotlib and cannot seem to figure out how do it. I can loop the subplots and have each output but I cannot make it into the rows and columns
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)),columns=list('ABCDEFGHIJ'))
for c in df :
# Cut the window in 2 parts
f, (ax_box,
ax_hist) = plt.subplots(2,
sharex=True,
gridspec_kw={"height_ratios":(.15, .85)},
figsize = (10, 10))
# Add a graph in each part
sns.boxplot(df[c], ax=ax_box)
ax_hist.hist(df[c])
# Remove x axis name for the boxplot
plt.show()
the results would just take this loop and put them into a set of rows and columns in this case 5x2
You have 10 columns, each of which creates 2 subplots: a box plot and a histogram. So you need a total of 20 figures. You can do this by creating a grid of 2 rows and 10 columns
Complete answer: (Adjust the figsize and height_ratios as per taste)
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
f, axes = plt.subplots(2, 10, sharex=True, gridspec_kw={"height_ratios":(.35, .35)},
figsize = (12, 5))
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)),columns=list('ABCDEFGHIJ'))
for i, c in enumerate(df):
sns.boxplot(df[c], ax=axes[0,i])
axes[1,i].hist(df[c])
plt.tight_layout()
plt.show()

Categories