Apologies in advance, I've made a bit of a hash of this one. I have a relatively big data set which looks like this:
Here in lies the problem. I've been creating GLMs from which I take the estimates of the confounding variables and jigs the abline (if you don't know what I mean here, basically I need to calculate my line of best fit, not just shove it through the average points). This is all fine and dandy as I made a line of code which works this out for me. Sadly though, I have 19 of these graphs to produce - 1 for each row - and need to do this for six data sets.
My attempts to automate this process have been painful and depressing thus far. If anyone thinks being a biologist means cuddling pandas they are sadly wrong. I've got the code to take in variables and produce a graph one at a time, but haven't had any luck for producing them all on one frame.
Imagine roughly this, but with 19 graphs on it. that's the dream right now
![imagine roughly this, but with 19 graphs on it. that's the dream right now][2]
Unfortunately, your data is not reproducible but I think the following can be adapted.
Working with several objects like that can get very messy. This is where using list can be very helpful. You only need your x, y and intercept in the my_list object. You can then plot all your charts using layout and a loop.
my_list <- list()
for(i in 1:19){
x <- runif(10)
y <- rnorm(10)
intercept <- lm(y~x)$coefficients[1]
name <- paste('plot_',i,sep='')
tmp <- list(x=x, y=y, intercept=intercept)
my_list[[name]] <- tmp
}
layout(matrix(1:20, nrow = 4, ncol = 5, byrow = TRUE))
for(j in 1:length(my_list)) {
plot(x=my_list[[j]]$x, y=my_list[[j]]$y, main=attributes(my_list[j])$names,xlab="x-label",ylab="y-label")
abline(h=my_list[[j]]$intercept)
}
Just wanted to post the ggplot2 version of what you're trying to do to see if that might work for you as well.
I also show an example of fitting lines for multiple classes within each facet (depending on how complicated the analysis is you're conducting).
First install ggplot2 if you don't have it already:
# install.packages('ggplot2')
library(ggplot2)
Here I am just setting up some dummy data using the built-in iris dataset. I'm essentially trying to simulate having 19 distinct datasets.
set.seed(1776)
samples <- list()
num_datasets <- 19
datasets <- list(num_datasets)
# dynamically create some samples
for(i in 1:num_datasets) {
samples[[i]] <- sample(1:nrow(iris), 20)
}
# dynamically assign to many data sets (keep only 2 numeric columns)
for(i in 1:num_datasets) {
datasets[[i]] <- cbind(iris[samples[[i]], c('Petal.Length', 'Petal.Width', 'Species')], dataset_id = i)
# assign(paste0("dataset_", i), iris[samples[[i]], c('Petal.Length', 'Petal.Width')])
}
do.call is a bit tricky, but it takes in two arguments, a function, and a list of arguments to apply to that function. So I'm using rbind() on all of the distinct datasets within my datasets object (which is a list of datasets).
combined_data <- do.call(rbind, datasets)
First plot is one big scatter plot to show the data.
# all data
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(alpha = 0.2) +
ggtitle("All data")
Next is 19 individual "facets" of plots all on the same scale and in the same graphing window.
# all data faceted by dataset_id
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(alpha = 0.5) +
ggtitle("All data faceted by dataset") +
facet_wrap(~ dataset_id) +
geom_smooth(method='lm', se = F)
plot of facets with best fit lines
Finally, the data plotted in facets again, but colored by the species of the iris flower and each species has its own line of best fit.
# all data faceted by dataset_id
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width, color = Species)) +
geom_point(alpha = 0.5) +
ggtitle("All data faceted by dataset with best fit lines per species") +
facet_wrap(~ dataset_id) +
geom_smooth(method='lm', se = F)
plots of facets with best fit within categories
I see you mentioned you had your own precalculated best fit line, but I think this conceptually might get you closer to where you need to be?
Cheers!
Related
I have a collection of timestamps that correspond roughly to the time these modules called "metal cells" are being utilized. At present, I am grabbing these by pairs and plotting them by creating a small line segment using go.Scatter and adding it to a list which later functions as the data argument for a go.Figure object.
for mcell in metal_cells:
mdf = cell_dep_grouped.get_group(mcell).reset_index(drop=True)
for i in range(0, mdf.shape[0], 2):
start_row = mdf.iloc[i]
finish_row = mdf.iloc[i+1]
buffer_dat = go.Scatter(x = [start_row.Time, finish_row.Time],
y = [mcell, mcell],
line = {'color': chem_palette[start_row.Recipe]},
legendgroup = start_row.Recipe,
name = start_row.Recipe,
showlegend = not plotted_chem[start_row.Recipe])
plot_data.append(buffer_dat)
The output looks as the picture I have attached, and I would like it to continue looking as closely to it because it clearly highlights the time the module was being utilized using the scatter segments, and downtime as implied by the empty space between segments. The main issue is I need this plot to be interactive - the main interactive feature being that, upon performing a selection, some calculations are made and some of the traces change. But because the plot consists of 80 or so go.Scatter objects, it is quite slow. Is there a more efficient way of plotting segments like these that make the plot lighter and faster?
I'm working on a project and would like to plot by data in a similar way as this example from a book:
So I would like to create a density histogram for my categorical features (left image) and than add a separate column for each value of another feature (middle and right image).
In my case the feature I want to plot is called [district_code] and I would like to create columns based on a feature called [status_group]
What I've tried so far:
sns.kdeplot(data = raw, x = "district_code"): problem, it is a line plot, not a histogram
sns.kdeplot(data = raw, x = "district_code", col = "status_group"): problem, you can't use the col argument for this plottype
sns.displot(raw, x="district_code", col = 'status_group'): problem, col argument works, but it creates a countplot, not a density plot
I would really appreciate some suggestions about the correct code I could use.
This is just an example for one of my categorical features, but I have many more I would like to plot. Any suggestions on how to turn this into a function where I could run the code for a list of categorical features would be highly appreciated.
UPDATE:
sns.displot(raw, x="source_class", stat = 'density', col = 'status_group', color = 'black'): works but looks a bit akward for some features.
How could I improve this?
Good:
Not so good:
I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions
This is the Python code to generate the plot, the dataset used can be found here:
import seaborn as sns
...
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Here is the resulting plot.
The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:
library(dplyr)
library(ggplot2)
...
train_df %>%
ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.y = mean, geom = "line") +
facet_grid(Embarked ~ .)
The output can be found here.
There are some issues:
There seems to be an empty facet, maybe from NA's in Embarked?
The points don't align with the line
The lines are different than those in the Python plot
I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().
There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.
train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")
ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.data = 'mean_cl_boot') +
geom_line(stat = 'summary', fun.y = mean) +
facet_grid(Embarked ~ .)
You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.
Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.
Can I ask how to plot the figure? The size of each dot should correspond to its proportion at a particular time.
Can arrows or continues lines also be used to show the trend instead of discrete dots? The width of the arrow/line will correspond to its proportion at a particular time. Also, it can handle missing data, e.g. set the position of missing data as blank or use a very thin arrow/line for missing data.
Both python and R are good for me.
Raw data:
Time;Value A;Value A proportion;Value B;Value B proportion
1;5;90%;12;10%
2;7;80%;43;20%
3;7;80%;83;20%
4;8;70%;44;30%
5;10;80%;65;20%
An example of the plot is like this, but I am happy for other dot patterns.
library(ggplot2)
library(reshape2)
myDF <- read.table("~/Desktop/test.txt",header=TRUE,sep=";")
# remove "%"
myDF <- data.frame(lapply(myDF, function(x) as.numeric(sub("%", "", x))) )
meltVar <- melt(myDF,id.vars = c("Time"),measure.vars = c("Value.A","Value.B"))
meltpropr <- melt(myDF,id.vars = c("Time"),measure.vars = c("Value.A.proportion","Value.B.proportion"))
newDF <- as.data.frame(cbind(meltVar,meltpropr[,"value"]))
names(newDF) <- c("Time","variable","value","prop")
ggplot(newDF,aes(x=Time, y=value)) + geom_point(aes(colour=variable , shape=variable, size = prop))
You can play with aes and theme to get the figure as you like.
That is a plot i generated using pyplot and (attempted to) adjust the text using the adjustText library which i also found here.
as you can see, it gets pretty crowded in the parts where 0 < x < 0.1. i was thinking that there's still ample space in 0.8 < y < 1.0 such that they could all fit and label the points pretty well.
my attempt was:
plt.plot(df.fpr,df.tpr,marker='.',ls='-')
texts = [plt.text(df.fpr[i],df.tpr[i], str(df.thr1[i])) for i in df.index]
adjust_text(texts,
expand_text=(2,2),
expand_points=(2,2),
expand_objects=(2,2),
force_objects = (2,20),
force_points = (0.1,0.25),
lim=150000,
arrowprops=dict(arrowstyle='-',color='red'),
autoalign='y',
only_move={'points':'y','text':'y'}
)
where my df is a pandas dataframe which can be found here
from what i understood in the docs, i tried varying the bounding boxes and the y-force by making them larger, thinking that it would push the labels further up, but it does not seem to be the case.
I'm the author of adjustText, sorry I just noticed this question. you are having this problem because you have a lot of overlapping texts with exactly the same y-coordinate. It's easy to solve by adding a tiny random shift along the y to the labels (and you do need to increase the force for texts, otherwise along one dimension it works very slowly), like so:
np.random.seed(0)
f, ax = plt.subplots(figsize=(12, 6))
plt.plot(df.fpr,df.tpr,marker='.',ls='-')
texts = [plt.text(df.fpr[i], df.tpr[i]+np.random.random()/100, str(df.thr1[i])) for i in df.index]
plt.margins(y=0.125)
adjust_text(texts,
force_text=(2, 2),
arrowprops=dict(arrowstyle='-',color='red'),
autoalign='y',
only_move={'points':'y','text':'y'},
)
Also notice that I increased the margins along the y axis, it helps a lot with the corners. The result is not quite perfect, limiting the algorithm to just one axis make life more difficult... But it's OK-ish already.
Have to mention, size of the figure is very important, I don't know what yours was.