ggplot summarise mean value of categorical variable on y axis - python

I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions
This is the Python code to generate the plot, the dataset used can be found here:
import seaborn as sns
...
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Here is the resulting plot.
The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:
library(dplyr)
library(ggplot2)
...
train_df %>%
ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.y = mean, geom = "line") +
facet_grid(Embarked ~ .)
The output can be found here.
There are some issues:
There seems to be an empty facet, maybe from NA's in Embarked?
The points don't align with the line
The lines are different than those in the Python plot
I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().

There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.
train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")
ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.data = 'mean_cl_boot') +
geom_line(stat = 'summary', fun.y = mean) +
facet_grid(Embarked ~ .)
You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.
Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.

Related

How to plot specific features on SHAP summary plots?

I am currently trying to plot a set of specific features on a SHAP summary plot. However, I am struggling to find the code necessary to do so.
When looking at the source code on Github, the summary_plot function does seem to have a 'features' attribute. However, this does not seem to be the solution to my problem.
Could anybody help me plot a specific set of features, or is this not a viable option in the current code of SHAP.
A possible, albeit hacky, solution could be as follows, for example plotting a summary plot for a single feature in the 5th column
shap.summary_plot(shap_values[:,5:6], X.iloc[:, 5:6])
I reconstruct the shap_value to include the feature you want into the plot using below code.
shap_values = explainer.shap_values(samples)[1]
vals = np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(
list(zip(samples.columns, vals)),
columns=["col_name", "feature_importance_vals"],
)
feature_importance.sort_values(
by=["feature_importance_vals"], ascending=False, inplace=True
)
feature_importance['rank'] = feature_importance['feature_importance_vals'].rank(method='max',ascending=False)
missing_features = [
i
for i in columns_to_show
if i not in feature_importance["col_name"][:20].tolist()
]
missing_index = []
for i in missing_features:
missing_index.append(samples.columns.tolist().index(i))
missing_features_new = []
rename_col = {}
for i in missing_features:
rank = int(feature_importance[feature_importance['col_name']==i]['rank'].values)
missing_features_new.append('rank:'+str(rank)+' - '+i)
rename_col[i] = 'rank:'+str(rank)+' - '+i
column_names = feature_importance["col_name"][:20].values.tolist() + missing_features_new
feature_index = feature_importance.index[:20].tolist() + missing_index
shap.summary_plot(
shap_values[:, feature_index].reshape(
samples.shape[0], len(feature_index)
),
samples.rename(columns=rename_col)[column_names],
max_display=len(feature_index),
)
To plot only 1 feature, get the index of your feature you want to check in list of features
i = X.iloc[:,:].index.tolist().index('your_feature_name_here')
shap.summary_plot(shap_values[1][:,i:i+1], X.iloc[:, i:i+1])
To plot your selected features,
your_feature_list = ['your_feature_1','your_feature_2','your_feature_3']
your_feature_indices = [X.iloc[:,:].index.tolist().index(x) for x in your_feature_list]
shap.summary_plot(shap_values[1][:,your_feature_indices], X.iloc[:, your_feature_indices])
feel free to change "your_feature_indices" to a shorter variable name
change shap_values[1] to shap_values if you are not doing binary classification

plot scatter dots with normalised dot sizes?

Can I ask how to plot the figure? The size of each dot should correspond to its proportion at a particular time.
Can arrows or continues lines also be used to show the trend instead of discrete dots? The width of the arrow/line will correspond to its proportion at a particular time. Also, it can handle missing data, e.g. set the position of missing data as blank or use a very thin arrow/line for missing data.
Both python and R are good for me.
Raw data:
Time;Value A;Value A proportion;Value B;Value B proportion
1;5;90%;12;10%
2;7;80%;43;20%
3;7;80%;83;20%
4;8;70%;44;30%
5;10;80%;65;20%
An example of the plot is like this, but I am happy for other dot patterns.
library(ggplot2)
library(reshape2)
myDF <- read.table("~/Desktop/test.txt",header=TRUE,sep=";")
# remove "%"
myDF <- data.frame(lapply(myDF, function(x) as.numeric(sub("%", "", x))) )
meltVar <- melt(myDF,id.vars = c("Time"),measure.vars = c("Value.A","Value.B"))
meltpropr <- melt(myDF,id.vars = c("Time"),measure.vars = c("Value.A.proportion","Value.B.proportion"))
newDF <- as.data.frame(cbind(meltVar,meltpropr[,"value"]))
names(newDF) <- c("Time","variable","value","prop")
ggplot(newDF,aes(x=Time, y=value)) + geom_point(aes(colour=variable , shape=variable, size = prop))
You can play with aes and theme to get the figure as you like.

How to make the confidence interval (error bands) show on seaborn lineplot

I'm trying to create a plot of classification accuracy for three ML models, depending on the number of features used from the data (the number of features used is from 1 to 75, ranked according to a feature selection method). I did 100 iterations of calculating the accuracy output for each model and for each "# of features used". Below is what my data looks like (clsf from 0 to 2, timepoint from 1 to 75):
data
I am then calling the seaborn function as shown in documentation files.
sns.lineplot(x= "timepoint", y="acc", hue="clsf", data=ttest_df, ci= "sd", err_style = "band")
The plot comes out like this:
plot
I wanted there to be confidence intervals for each point on the x-axis, and don't know why it is not working. I have 100 y values for each x value, so I don't see why it cannot calculate/show it.
You could try your data set using Seaborn's pointplot function instead. It's specifically for showing an indication of uncertainty around a scatter plot of points. By default pointplot will connect values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via linestyles = "" for nominal data. (I used join = False in my example)
I tried to recreate your notebook to give a visual, but wasn't able to get the confidence interval in my plot exactly as you describe. I hope this is helpful for you.
sb.set(style="darkgrid")
sb.pointplot(x = 'timepoint', y = 'acc', hue = 'clsf',
data = ttest_df, ci = 'sd', palette = 'magma',
join = False);

R Plot Multiple Graph Function with For Loop

Apologies in advance, I've made a bit of a hash of this one. I have a relatively big data set which looks like this:
Here in lies the problem. I've been creating GLMs from which I take the estimates of the confounding variables and jigs the abline (if you don't know what I mean here, basically I need to calculate my line of best fit, not just shove it through the average points). This is all fine and dandy as I made a line of code which works this out for me. Sadly though, I have 19 of these graphs to produce - 1 for each row - and need to do this for six data sets.
My attempts to automate this process have been painful and depressing thus far. If anyone thinks being a biologist means cuddling pandas they are sadly wrong. I've got the code to take in variables and produce a graph one at a time, but haven't had any luck for producing them all on one frame.
Imagine roughly this, but with 19 graphs on it. that's the dream right now
![imagine roughly this, but with 19 graphs on it. that's the dream right now][2]
Unfortunately, your data is not reproducible but I think the following can be adapted.
Working with several objects like that can get very messy. This is where using list can be very helpful. You only need your x, y and intercept in the my_list object. You can then plot all your charts using layout and a loop.
my_list <- list()
for(i in 1:19){
x <- runif(10)
y <- rnorm(10)
intercept <- lm(y~x)$coefficients[1]
name <- paste('plot_',i,sep='')
tmp <- list(x=x, y=y, intercept=intercept)
my_list[[name]] <- tmp
}
layout(matrix(1:20, nrow = 4, ncol = 5, byrow = TRUE))
for(j in 1:length(my_list)) {
plot(x=my_list[[j]]$x, y=my_list[[j]]$y, main=attributes(my_list[j])$names,xlab="x-label",ylab="y-label")
abline(h=my_list[[j]]$intercept)
}
Just wanted to post the ggplot2 version of what you're trying to do to see if that might work for you as well.
I also show an example of fitting lines for multiple classes within each facet (depending on how complicated the analysis is you're conducting).
First install ggplot2 if you don't have it already:
# install.packages('ggplot2')
library(ggplot2)
Here I am just setting up some dummy data using the built-in iris dataset. I'm essentially trying to simulate having 19 distinct datasets.
set.seed(1776)
samples <- list()
num_datasets <- 19
datasets <- list(num_datasets)
# dynamically create some samples
for(i in 1:num_datasets) {
samples[[i]] <- sample(1:nrow(iris), 20)
}
# dynamically assign to many data sets (keep only 2 numeric columns)
for(i in 1:num_datasets) {
datasets[[i]] <- cbind(iris[samples[[i]], c('Petal.Length', 'Petal.Width', 'Species')], dataset_id = i)
# assign(paste0("dataset_", i), iris[samples[[i]], c('Petal.Length', 'Petal.Width')])
}
do.call is a bit tricky, but it takes in two arguments, a function, and a list of arguments to apply to that function. So I'm using rbind() on all of the distinct datasets within my datasets object (which is a list of datasets).
combined_data <- do.call(rbind, datasets)
First plot is one big scatter plot to show the data.
# all data
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(alpha = 0.2) +
ggtitle("All data")
Next is 19 individual "facets" of plots all on the same scale and in the same graphing window.
# all data faceted by dataset_id
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(alpha = 0.5) +
ggtitle("All data faceted by dataset") +
facet_wrap(~ dataset_id) +
geom_smooth(method='lm', se = F)
plot of facets with best fit lines
Finally, the data plotted in facets again, but colored by the species of the iris flower and each species has its own line of best fit.
# all data faceted by dataset_id
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width, color = Species)) +
geom_point(alpha = 0.5) +
ggtitle("All data faceted by dataset with best fit lines per species") +
facet_wrap(~ dataset_id) +
geom_smooth(method='lm', se = F)
plots of facets with best fit within categories
I see you mentioned you had your own precalculated best fit line, but I think this conceptually might get you closer to where you need to be?
Cheers!

Plotting a discrete variable over time (scarf plot)

I have time series data from a repeated-measures eyetracking experiment.
The dataset consists of a number of respondents and for each respondent, there is 48 trials.
The data set has a variable ('saccade') which is the transitions between eye-fixations and a variable ('time') which ranges for 0-1 for each trial. The transitions are classified into three different categories ('ver', 'hor' and 'diag').
Here is a script that will create a small example data set in python (one participant and two trials):
import numpy as np
import pandas as pd
saccade1 = np.array(['diag','hor','ver','hor','diag','ver','hor','diag','diag',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','hor','hor','diag',
'diag','ver','ver','ver','ver'])
time1 = np.array(range(len(saccade1)))/float(len(saccade1)-1)
trial1 = [1]*len(time1)
saccade2 = np.array(['diag','ver','hor','diag','diag','diag','hor','ver','hor',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','diag',
'diag','hor','hor','diag','diag','ver','ver','ver','ver','hor','diag','diag'])
time2 = np.array(range(len(saccade2)))/float(len(saccade2)-1)
trial2 = [2]*len(time2)
saccade = np.append(saccade1,saccade2)
time = np.append(time1,time2)
trial = np.append(trial1,trial2)
subject = [1]*len(time)
df = pd.DataFrame(index=range(len(subject)))
df['subject'] = subject
df['saccade'] = saccade
df['trial'] = trial
df['time'] = time
Alternatively I have made a csv-file with the same data which can be downloaded here
I would like to be able to make a so-called scarf plot to visualize the sequence of transitions over time, but I have no clue how to make these plots.
I would like plots (for each participant separately) where time is on the x-axis and trial is on the y-axis. For each trial I would like the transitions represented as colored "stacked" bars.
The only example I have of these kinds of plots are in the book "Eye Tracking - A comprehensive guide to methods and measures" (fig. 6.8b) link
Can anyone tell/help me in doing this?
(I can deal which python or R programming - preferably python)
Here is a solution in R using ggplot2. You need to recode time2 so that it indicates the enlapsed time instead of the total time.
library(ggplot2)
dataset <- read.csv("~/Downloads/example_data_for_scarf.csv")
dataset$trial <- factor(dataset$trial)
dataset$saccade <- factor(dataset$saccade)
dataset$time2 <- c(0, diff(dataset$time))
dataset$time2[dataset$time == 0] <- 0
ggplot(dataset, aes(x = trial, y = time2, fill = saccade)) +
geom_bar(stat = "identity") +
coord_flip()

Categories