Using Seaborn Catplot scatterplot creates a numerically unordered y-axis - python

Using this dataset, I tried to make a categorical scatterplot with two mutation types (WES + RNA-Seq & WES) being shown on the x-axis and a small set of numerically ordered numbers spaced apart by a scale on the y-axis. Although I was able to get the x-axis the way I intended it to be, the y-axis instead used every single value in the Mutation Count column as a number on the y-axis. In addition to that, the axis is ordered from the descending order on the dataset, meaning it isn't numerically ordered either. How can I go about fixing these aspects of the graph?
The code and its output are shown below:
import seaborn as sns
g = sns.catplot(x="Sequencing Type", y="Mutation Count", hue="Sequencing Type", data=tets, height=16, aspect=0.8)

Related

Seaborn countplot not displaying correct frequncies

I am trying to create a simple count plot using Seaborn of the frequencies of 3 different categories. The plot I am being given has completely incorrect count.
I am using the correct count plot method
sns.countplot(data = r.reset_index(), x = 'cat')
gives me
This is the DataFrame I'm trying to plot :
count
high 38
low 64
medium 30
I want the graph to display the correct counts for each category, high medium and low
A countplot is going to count each occurrence of your x variable -- in this case, one observation per level.
From the API page for countplot:
Show the counts of observations in each categorical bin using bars.
A count plot can be thought of as a histogram across a categorical,
instead of quantitative, variable. The basic API and options are
identical to those for barplot(), so you can compare counts across
nested variables.
You want a simple barplot:
sns.barplot(data=df.reset_index(), x='index', y='count')

Fixing axis spacing (ticks) in Bokeh scatter plots

I'm generating scatter plots with Bokeh with differing numbers Y values for each X value. When Bokeh generates the plot, it automatically pads the x-axis spacing based on the number of values plotted. I would like for all values on the x-axis to be spaced evenly, regardless of the number of individual data points. I've looked into manually setting the ticks, but it looks like I have to set the spacing myself using this approach (ie. specify the exact positions). I would like for it to automatically set the spacing evenly as it does when plotting singular x,y value pairs. Can this be done?
Here is an example showing the behavior.
import pandas
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
days =['Mon','Mon','Mon', 'Tues', 'Tues', 'Weds','Weds','Weds','Weds']
vals = [1,3,5,2,3,6,3,2,4]
df = pandas.DataFrame({'Day': days, 'Values':vals})
source = ColumnDataSource(df)
p = figure(x_range=df['Day'].tolist())
p.circle(x='Day', y='Values', source=source)
show(p)
You are passing a list of strings as the range. This creates a categorical axis. However, the list of categories for the range is expected to be unique, with no duplicates. You are passing a list with duplicate values. This is actually invalid usage, and the result is undefined behavior. You should pass a unique list of categorical factors, in the order you want them to appear, for the range.

How do I create a multiline plot using seaborn?

I am trying out Seaborn to make my plot visually better than matplotlib. I have a dataset which has a column 'Year' which I want to plot on the X-axis and 4 Columns say A,B,C,D on the Y-axis using different coloured lines. I was trying to do this using the sns.lineplot method but it allows for only one variable on the X-axis and one on the Y-axis. I tried doing this
sns.lineplot(data_preproc['Year'],data_preproc['A'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['B'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['C'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['D'], err_style=None)
But this way I don't get a legend in the plot to show which coloured line corresponds to what. I tried checking the documentation but couldn't find a proper way to do this.
Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from its "wide format" (one column per measurement type) into long format (one column for all measurement values, one column to indicate the type) is pandas.melt. Given a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
sns.lineplot(x='Year', y='value', hue='variable',
data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be adapted to your liking.)
This:
sns.lineplot(data=data_preproc)
will do what you want.
See the documentation:
sns.lineplot(x="Year", y="signal", hue="label", data=data_preproc)
You probably need to re-organize your dataframe in a suitable way so that there is one column for the x data, one for the y data, and one which holds the label for the data point.
You can also just use matplotlib.pyplot. If you import seaborn, much of the improved design is also used for "regular" matplotlib plots. Seaborn is really "just" a collection of methods which conveniently feed data and plot parameters to matplotlib.

Discrepancy between Seaborn plotted mean and calculated mean. (Python/Pandas)

Hello there wonderful people of StackOverflow!
I have been getting to grips with Python and was starting to feel pretty confident that I knew what I was doing until this doozy came up:
I am plotting and comparing two subselections of a dataframe where "Type" = "area" and "". Seaborn plots a boxplot of these and marks the mean, but when I calculate the mean using .mean() it gives a different answer. Here's the code:
plotdata = df[df['Type'].isin(['A','B'])]
g = sns.violinplot(x="Type", y="value", data=plotdata, inner="quartile")
plt.ylim(ymin=-4, ymax=4) # This is to zoom on the plot to make the 0 line clearer
This is the resulting plot, note how the means are ~-0.1 and ~1.5
But when I calculate them with:
print(df_long[df_long['charttype'].isin(['area'])]['error'].mean())
print(df_long[df_long['charttype'].isin(['angle'])]['error'].mean())
It returns:
0.014542483333332705
-2.024809368191722
So my question is, why don't these numbers match?
Total misunderstanding of basic statistics was the problem!
Box plots (which are inside the seaborn violin plots) plot the interquartile range and the MEDIAN, whereas I was later calculating the MEAN.
Just needed to sleep on it and hey presto all becomes clear.

Compare 1 independent vs many dependent variables using seaborn pairplot in an horizontal plot

The pairplot function from seaborn allows to plot pairwise relationships in a dataset.
According to the documentation (highlight added):
By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
I could find only one example of subsetting different variables for rows and columns, here (it's the 6th plot under the Plotting pairwise relationships with PairGrid and pairplot() section). As you can see, it's plotting many independent variables (x_vars) against the same single dependent variable (y_vars) and the results are pretty nice.
I'm trying to do the same plotting a single independent variable against many dependent ones.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
ages = np.random.gamma(6,3, size=50)
data = pd.DataFrame({"age": ages,
"weight": 80*ages**2/(ages**2+10**2)*np.random.normal(1,0.2,size=ages.shape),
"height": 1.80*ages**5/(ages**5+12**5)*np.random.normal(1,0.2,size=ages.shape),
"happiness": (1-ages*0.01*np.random.normal(1,0.3,size=ages.shape))})
pp = sns.pairplot(data=data,
x_vars=['age'],
y_vars=['weight', 'height', 'happiness'])
The problem is that the subplots get arranged vertically, and I couldn't find a way to change it.
I know that then the tiling structure would not be so neat as the Y axis should be labeled at every subplot. Also, I know I could generate the plots making it by hand with something like this:
fig, axes = plt.subplots(ncols=3)
for i, yvar in enumerate(['weight', 'height', 'happiness']):
axes[i].scatter(data['age'],data[yvar])
Still, I'm learning to use the seaborn and I find interface very convenient, so I wonder if there's a way. Also, this example is pretty easy, but for more complex datasets seaborn handles for you many more things that would make the raw-matplotlib approach much more complex quite quickly (hue, to start)
You can achieve what it seems you are looking for by swapping the variable names passed to the x_vars and y_vars parameters. So revisiting the sns.pairplot portion of your code:
pp = sns.pairplot(data=data,
y_vars=['age'],
x_vars=['weight', 'height', 'happiness'])
Note that all I've done here is swap x_vars for y_vars. The plots should now be displayed horizontally:
The x-axis will now be unique to each plot with a common y-axis determined by the age column.

Categories