Currently displaying some data with Seaborn / Pandas. I'm looking to overlay the mean of each category (x=ks2) - but can't figure out how to do this with Seaborn.
I can remove the inner="box" - but want to replace that with a marker for the mean of each category.
Ideally, then link each mean calculated...
Any pointers greatly received.
Cheers
Science.csv has 9k+ entries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load the dataset
# df = pd.read_csv("science.csv") << loaded from csv
df = pd.DataFrame({'ks2': [1, 1, 2,3,3,4],
'science': [40, 50, 34,20,0,44]})
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="ks2", y="science", data=df, split=True,
inner="box",linewidth=2)
sns.despine(left=True)
plt.savefig('plot.png')
try:
from numpy import mean
then overlay sns.pointplot with estimator=mean
sns.pointplot(x = 'ks2', y='science', data=df, estimator=mean)
then play with linestyles
Related
I have a dataframe and I'm using seaborn pairplot to plot one target column vs the rest of the columns.
Code is below,
import seaborn as sns
import matplotlib.pyplot as plt
tgt_var = 'AB'
var_lst = ['A','GH','DL','GT','MS']
pp = sns.pairplot(data=df,
y_vars=[tgt_var],
x_vars=var_lst)
pp.fig.set_figheight(6)
pp.fig.set_figwidth(20)
The var_lst is not a static list, I just provided an example.
What I need is to plot tgt_var on Y axis and each var_lst on x axis.
I'm able to do this with above code, but I also want to use log scale on X axis only if the var_lst item is 'GH' or 'MS', for the rest normal scale. Is there any way to achieve this?
Iterate pp.axes.flat and set xscale="log" if the xlabel matches "GH" or "MS":
log_columns = ["GH", "MS"]
for ax in pp.axes.flat:
if ax.get_xlabel() in log_columns:
ax.set(xscale="log")
Full example with the iris dataset where the petal columns are xscale="log":
import seaborn as sns
df = sns.load_dataset("iris")
pp = sns.pairplot(df)
log_columns = ["petal_length", "petal_width"]
for ax in pp.axes.flat:
if ax.get_xlabel() in log_columns:
ax.set(xscale="log")
I would like to change this from a line of regression to a curve. Also to have the line reach either side of the graph. Here is my code:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = {'Days': [5, 10, 15, 20],
'Impact': [33.7561, 30.6281, 29.5748, 29.0482]
}
a = pd.DataFrame (data, columns = ['Days','Impact'])
print (a)
ax = sns.barplot(data=a, x='Days', y='Impact', color='lightblue' )
# put bars in background:
for c in ax.patches:
c.set_zorder(0)
# plot regplot with numbers 0,..,len(a) as x value
ax = sns.regplot(x=np.arange(0,len(a)), y=a['Impact'], marker="+")
sns.despine(offset=10, trim=False)
ax.set_ylabel("")
ax.set_xticklabels(['5', '10','15','20'])
plt.show()
Alternatively, I would prefer to do it in matplotlib as a scatter plot instead of bar chart. Here is an example in excel, but ideally to have the curve extend beyond the outside markers at least a little.
Can anyone help?
Here I am trying to separate the data with the factor male or not by plotting Age on x-axis and Fare on y-axis and I want to display two labels in the legend differentiating male and female with respective colors.Can anyone help me do this.
Code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
df['male']=df['Sex']=='male'
sc1= plt.scatter(df['Age'],df['Fare'],c=df['male'])
plt.legend()
plt.show()
You could use the seaborn library which builds on top of matplotlib to perform the exact task you require. You can scatterplot 'Age' vs 'Fare' and colour code it by 'Sex' by just passing the hue parameter in sns.scatterplot, as follows:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
# No need to call plt.legend, seaborn will generate the labels and legend
# automatically.
sns.scatterplot(df['Age'], df['Fare'], hue=df['Sex'])
plt.show()
Seaborn generates nicer plots with less code and more functionality.
You can install seaborn from PyPI using pip install seaborn.
Refer: Seaborn docs
PathCollection.legend_elements method
can be used to steer how many legend entries are to be created and how they
should be labeled.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
df['male'] = df['Sex']=='male'
sc1= plt.scatter(df['Age'], df['Fare'], c=df['male'])
plt.legend(handles=sc1.legend_elements()[0], labels=['male', 'female'])
plt.show()
Legend guide and Scatter plots with a legend for reference.
This can be achieved by segregating the data in two separate dataframe and then, label can be set for these dataframe.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
subset1 = df[(df['Sex'] == 'male')]
subset2 = df[(df['Sex'] != 'male')]
plt.scatter(subset1['Age'], subset1['Fare'], label = 'Male')
plt.scatter(subset2['Age'], subset2['Fare'], label = 'Female')
plt.legend()
plt.show()
enter image description here
I have the following dataset, code and plot:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
data = [['tom', 10,15], ['matt', 13,10]]
df3 = pd.DataFrame(data, columns = ['Name', 'Attempts','L4AverageAttempts'])
f,ax = plt.subplots(nrows=1,figsize=(16,9))
sns.barplot(x='Attempts',y='Name',data=df3)
plt.show()
How can get a marker of some description (dot, *, shape, etc) to show that tomhas averaged 15 (so is below his average) and matt has averaged 10 so is above average. So a marker basxed off the L4AverageAttempts value for each person.
I have looked into axvline but that seems to be only a set number rather than a specific value for each y axis category. Any help would be much appreciated! thanks!
You can simply plot a scatter plot on top of your bar plot using L4AverageAttempts as the x value:
You can use seaborn.scatterplot for this. Make sure to set the zorder parameter so that the markers appear on top of the bars.
import seaborn as sns
import pandas as pd
data = [['tom', 10,15], ['matt', 13,10]]
df3 = pd.DataFrame(data, columns = ['Name', 'Attempts','L4AverageAttempts'])
f,ax = plt.subplots(nrows=1,figsize=(16,9))
sns.barplot(x='Attempts',y='Name',data=df3)
sns.scatterplot(x='L4AverageAttempts', y="Name", data=df3, zorder=10, color='k', edgecolor='k')
plt.show()
I'm plotting time series data using seaborn lineplot (https://seaborn.pydata.org/generated/seaborn.lineplot.html), and plotting the median instead of mean. Example code:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median, data=fmri)
I want the error bands to show the interquartile range as opposed to the confidence interval. I know I can use ci = "sd" for standard deviation, but is there a simple way to add the IQR instead? I cannot figure it out.
Thank you!
I don't know if this can be done with seaborn alone, but here's one way to do it with matplotlib, keeping the seaborn style. The describe() method conveniently provides summary statistics for a DataFrame, among them the quartiles, which we can use to plot the medians with inter-quartile-ranges.
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
fmri_stats = fmri.groupby(['timepoint']).describe()
x = fmri_stats.index
medians = fmri_stats[('signal', '50%')]
medians.name = 'signal'
quartiles1 = fmri_stats[('signal', '25%')]
quartiles3 = fmri_stats[('signal', '75%')]
ax = sns.lineplot(x, medians)
ax.fill_between(x, quartiles1, quartiles3, alpha=0.3);
You can calculate the median within lineplot like you have done, set ci to be none and fill in using ax.fill_between()
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median,
data=fmri,ci=None)
bounds = fmri.groupby('timepoint')['signal'].quantile((0.25,0.75)).unstack()
ax.fill_between(x=bounds.index,y1=bounds.iloc[:,0],y2=bounds.iloc[:,1],alpha=0.1)
This option is possible since version 0.12 of seaborn, see here for the documentation.
pip install --upgrade seaborn
The estimator specifies the point by the name of pandas method or callable, such as 'median' or 'mean'.
The errorbar is an option to plot a distribution spread by a string, (string, number) tuple, or callable. In order to mark the median value and fill the area between the interquartile, you would need the params:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(data=fmri, x="timepoint", y="signal", estimator=np.median,
errorbar=lambda x: (np.quantile(x, 0.25), np.quantile(x, 0.75)))
You can now!
estimator="median", errobar=("pi",0.5)
https://seaborn.pydata.org/tutorial/error_bars