plot scatter dots with normalised dot sizes? - python

Can I ask how to plot the figure? The size of each dot should correspond to its proportion at a particular time.
Can arrows or continues lines also be used to show the trend instead of discrete dots? The width of the arrow/line will correspond to its proportion at a particular time. Also, it can handle missing data, e.g. set the position of missing data as blank or use a very thin arrow/line for missing data.
Both python and R are good for me.
Raw data:
Time;Value A;Value A proportion;Value B;Value B proportion
1;5;90%;12;10%
2;7;80%;43;20%
3;7;80%;83;20%
4;8;70%;44;30%
5;10;80%;65;20%
An example of the plot is like this, but I am happy for other dot patterns.

library(ggplot2)
library(reshape2)
myDF <- read.table("~/Desktop/test.txt",header=TRUE,sep=";")
# remove "%"
myDF <- data.frame(lapply(myDF, function(x) as.numeric(sub("%", "", x))) )
meltVar <- melt(myDF,id.vars = c("Time"),measure.vars = c("Value.A","Value.B"))
meltpropr <- melt(myDF,id.vars = c("Time"),measure.vars = c("Value.A.proportion","Value.B.proportion"))
newDF <- as.data.frame(cbind(meltVar,meltpropr[,"value"]))
names(newDF) <- c("Time","variable","value","prop")
ggplot(newDF,aes(x=Time, y=value)) + geom_point(aes(colour=variable , shape=variable, size = prop))
You can play with aes and theme to get the figure as you like.

Related

Multiple opacities in Mapbox - Plotly for Python

I am currently working on a Data Visualization project.
I want to plot multiple lines (about 200k) that represent travels from one Subway Station to all the others. This is, all the subway stations should be connected by a straight line.
The color of the line doesn't really matter (it could well be red, blue, etc.), but opacity is what matters the most. The bigger the number of travels between two random stations, the more opacity of that particular line; and vice versa.
I feel I am close to the desired output, but can't figure a way to do it properly.
The DataFrame I am using (df = pd.read_csv(...)) consists of a series of columns, namely: id_start_station, id_end_station, lat_start_station, long_start_station, lat_end_station, long_end_station, number_of_journeys.
I got to extract the coordinates by coding
lons = []
lons = np.empty(3 * len(df))
lons[::3] = df['long_start_station']
lons[1::3] = df['long_end_station']
lons[2::3] = None
lats = []
lats = np.empty(3 * len(df))
lats[::3] = df['lat_start_station']
lats[1::3] = df['lat_end_station']
lats[2::3] = None
I then started a figure by:
fig = go.Figure()
and then added a trace by:
fig.add_trace(go.Scattermapbox(
name='Journeys',
lat=lats,
lon=lons,
mode='lines',
line=dict(color='red', width=1),
opacity= ¿?, # PROBLEM IS HERE [1]
))
[1] So I tried a few different things to pass a opacity term:
I created a new tuple for the opacity of each trace, by:
opacity = []
opacity = np.empty(3 * len(df))
opacity [::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [1::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [2::3] = None
and passed it into [1], but this error came out:
ValueError:
Invalid value of type 'numpy.ndarray' received for the 'opacity' property of scattermapbox
The 'opacity' property is a number and may be specified as:
- An int or float in the interval [0, 1]
I then thought of passing the "opacity" term into the "color" term, by using rgba's property alpha, such as: rgba(255,0,0,0.5).
So I first created a "map" of all alpha parameters:
df['alpha'] = df['number_of_journeys'] / max(df['number_of_journeys'])
and then created a function to retrieve all the alpha parameters inside a specific color:
colors_with_opacity = []
def colors_with_opacity_func(df, empty_list):
for alpha in df['alpha']:
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.append(None)
colors_with_opacity_func(df, colors_with_opacity)
and passed that into the color atribute of the Scattermapbox, but got the following error:
ValueError:
Invalid value of type 'builtins.list' received for the 'color' property of scattermapbox.line
The 'color' property is a color and may be specified as:
- A hex string (e.g. '#ff0000')
- An rgb/rgba string (e.g. 'rgb(255,0,0)')
- An hsl/hsla string (e.g. 'hsl(0,100%,50%)')
- An hsv/hsva string (e.g. 'hsv(0,100%,100%)')
- A named CSS color:
aliceblue, antiquewhite, aqua, [...] , whitesmoke,
yellow, yellowgreen
Since it is a massive amount of lines, looping / iterating through traces will carry out performance issues.
Any help will be much appreciated. I can't figure a way to properly accomplish that.
Thank you, in advance.
EDIT 1 : NEW QUESTION ADDED
I add this question here below as I believe it can help others that are looking for this particular topic.
Following Rob's helpful answer, I managed to add multiple opacities, as specified previously.
However, some of my colleagues suggested a change that would improve the visualization of the map.
Now, instead of having multiple opacities (one for each trace, according to the value of the dataframe) I would also like to have multiple widths (according to the same value of the dataframe).
This is, following Rob's answer, I would need something like this:
BINS_FOR_OPACITY=10
opacity_a = np.geomspace(0.001,1, BINS_FOR_OPACITY)
BINS_FOR_WIDTH=10
width_a = np.geomspace(1,3, BINS_FOR_WIDTH)
fig = go.Figure()
# Note the double "for" statement that follows
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_OPACITY, labels=opacity_a)):
for width, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_WIDTH, labels=width_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_width=width
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
However, the above is clearly not working, as it is making much more traces than it should do (I really can't explain why, but I guess it might be because of the double loop forced by the two for statements).
It ocurred to me that some kind of solution could be hidding in the pd.cut part, as I would need something like a double cut, but couldn't find a way to properly doing it.
I also managed to create a Pandas series by:
widths = pd.cut(df.["size"], bins=BINS_FOR_WIDTH, labels=width_a)
and iterating over that series, but got the same result as before (an excess of traces).
To emphasize and clarify myself, I don't need to have only multiple opacities or multiple widths, but I need to have them both and at the same time, which is what's causing me some troubles.
Again, any help is deeply thanked.
opacity is per trace, for markers it can be done with color using rgba(a,b,c,d) but not for lines. (Same in straight scatter plots)
to demonstrate, I have used London Underground stations (filtered to reduce number of nodes). Plus gone to extra effort of formatting data as a CSV. JSON as source has nothing to do with solution
encoded to bin number_of_journeys for inclusion into a trace with a geometric progression used for calculating and opacity
this sample data set is generating 83k sample lines
import requests
import geopandas as gpd
import plotly.graph_objects as go
import itertools
import numpy as np
import pandas as pd
from pathlib import Path
# get geometry of london underground stations
gdf = gpd.GeoDataFrame.from_features(
requests.get(
"https://raw.githubusercontent.com/oobrien/vis/master/tube/data/tfl_stations.json"
).json()
)
# limit to zone 1 and stations that have larger number of lines going through them
gdf = gdf.loc[gdf["zone"].isin(["1","2","3","4","5","6"]) & gdf["lines"].apply(len).gt(0)].reset_index(
drop=True
).rename(columns={"id":"tfl_id", "name":"id"})
# wanna join all valid combinations of stations...
combis = np.array(list(itertools.combinations(gdf.index, 2)))
# generate dataframe of all combinations of stations
gdf_c = (
gdf.loc[combis[:, 0], ["geometry", "id"]]
.assign(right=combis[:, 1])
.merge(gdf.loc[:, ["geometry", "id"]], left_on="right", right_index=True, suffixes=("_start_station","_end_station"))
)
gdf_c["lat_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.y)
gdf_c["long_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.x)
gdf_c["lat_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.y)
gdf_c["long_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.x)
gdf_c = gdf_c.drop(
columns=[
"geometry_start_station",
"right",
"geometry_end_station",
]
).assign(number_of_journeys=np.random.randint(1,10**5,len(gdf_c)))
gdf_c
f = Path.cwd().joinpath("SO.csv")
gdf_c.to_csv(f, index=False)
# there's an requirement to start with a CSV even though no sample data has been provided, now we're starting with a CSV
df = pd.read_csv(f)
# makes use of ravel simpler...
df["none"] = None
# now it's simple to generate scattermapbox... a trace per required opacity
BINS=10
opacity_a = np.geomspace(0.001,1, BINS)
fig = go.Figure()
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS, labels=opacity_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
fig.update_layout(
mapbox={
"style": "carto-positron",
"center": {'lat': 51.520214996769255, 'lon': -0.097792388774743},
"zoom": 9,
},
margin={"l": 0, "r": 0, "t": 0, "b": 0},
)

Fix the distance between the plotting area and x-axis label in plotnine / ggplot

I want to fix/set (not increase) the distance between the plotting area and the x-axis label in plotnine/ggplot.
library(ggplot2)
ggplot(diamonds)
ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price, color=cut))
I want to fix the distance between the two red bars on . I would like to be able to have x-ticklabels that take up more space (rotated, larger font etc.) without affecting where the x-axis label is located relative to the plot. I have found many examples to adjust the spacing - but not manually set it.
This might be an R specific solution, I don't know how plotnine works under the hood. In R, the height of the x-axis label is determined dynamically by the dimensions of the text, and there is no convenient way of setting this manually (afaik).
Instead, one can edit the height of that row in the gtable and then plot the result.
library(ggplot2)
library(grid)
p <- ggplot(diamonds) +
geom_point(aes(x=carat, y=price, color=cut)) +
geom_smooth(aes(x=carat, y=price, color=cut))
# Convert plot to gtable
gt <- ggplotGrob(p)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# Find row in gtable where the bottom axis is located
axis_row <- with(gt$layout, t[grep("axis-b", name)])
# Manually set the height of that row
gt$heights[axis_row] <- unit(2, "cm")
# Display new plot
grid.newpage(); grid.draw(gt)
Created on 2021-08-17 by the reprex package (v1.0.0)

Is there a way to cut only the first gap from histogram and take all the remain values in Python?

I have a data frame with fields: 'unique years', 'counts'. I plotted this data frame and i am getting the following histogram: histogram - example. I need to define a start year variable but if i have empty gaps at the starting point of histogram i need to skip them and shift the starting year. I was wondering if there is a pythonic way to do this. In the histogram - example plot, i have a not empty bin at the starting point but then i have a big gap with empty bins. So i need to find the point with a continuous not empty bins and define this point as a starting year (for the above sample i need the starting year as 1935). The n numpy.ndarray is giving me information about empty or not bins but i need a efficient way to resolve this. Thank you :)
Sample of my data frame:
import pandas as pd
data = {'unique_years': [1907, 1935, 1938, 1939, 1940],
'counts' : [11, 14, 438, 85, 8]}
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
code for the histogram plot
(n, bins, patches) = plt.hist(df.unique_years, bins=25, label='hst')
plt.show()
The issue with your question is that 'continuous' is not really well defined here. Do you mean that every year should have a non-empty count (that is fairly easy to do as you can filter your data for that prior to building your histogram), or should every consecutive bucket be non empty? If the latter, this means that you must:
Build your histogram
Filter your data on the resulting bins
Either use the filtered histogram or re-bin the remaining data, with bins sizes not guaranteed to stay the same (so it is possible that you have the same issue with the new bins!)
As it is difficult to know exactly what is relevant in your exact case, I think the best answer would be to give you a set of tools that you can use as you see fit for the exact problem that you are encountering:
I want to filter my data starting from a certain date
filtered = df.unique_years[df.unique_years > 1930]
I want to find the second non-empty bin
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
From there you can:
rebin your filtered data:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Re-binning on the filtered data
plt.hist(df.unique_years[df.unique_years >= n[second_nonempty]], bins=25)
Plot your histogram directly on the filtered bins:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Forcing the bins to take the provided values
plt.hist(df.unique_years, bins=x[second_nonempty:])
Now the 'second_nonempty' above can of course be replaced by any estimator of where you want to start, e.g.:
# Last empty bin + 1
all_bins_full_after = np.where(n == 0)[0][-1] + 1
Or anything else really
This should work to eliminate all the bins that are not consecutive. I am working mainly on the df. You can use this to plot your histogram
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
yd = df.unique_years.diff().eq(1)
df[yd|yd.shift(-1)]
this is the result you would get:

ggplot summarise mean value of categorical variable on y axis

I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions
This is the Python code to generate the plot, the dataset used can be found here:
import seaborn as sns
...
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Here is the resulting plot.
The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:
library(dplyr)
library(ggplot2)
...
train_df %>%
ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.y = mean, geom = "line") +
facet_grid(Embarked ~ .)
The output can be found here.
There are some issues:
There seems to be an empty facet, maybe from NA's in Embarked?
The points don't align with the line
The lines are different than those in the Python plot
I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().
There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.
train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")
ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.data = 'mean_cl_boot') +
geom_line(stat = 'summary', fun.y = mean) +
facet_grid(Embarked ~ .)
You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.
Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.

R Plot Multiple Graph Function with For Loop

Apologies in advance, I've made a bit of a hash of this one. I have a relatively big data set which looks like this:
Here in lies the problem. I've been creating GLMs from which I take the estimates of the confounding variables and jigs the abline (if you don't know what I mean here, basically I need to calculate my line of best fit, not just shove it through the average points). This is all fine and dandy as I made a line of code which works this out for me. Sadly though, I have 19 of these graphs to produce - 1 for each row - and need to do this for six data sets.
My attempts to automate this process have been painful and depressing thus far. If anyone thinks being a biologist means cuddling pandas they are sadly wrong. I've got the code to take in variables and produce a graph one at a time, but haven't had any luck for producing them all on one frame.
Imagine roughly this, but with 19 graphs on it. that's the dream right now
![imagine roughly this, but with 19 graphs on it. that's the dream right now][2]
Unfortunately, your data is not reproducible but I think the following can be adapted.
Working with several objects like that can get very messy. This is where using list can be very helpful. You only need your x, y and intercept in the my_list object. You can then plot all your charts using layout and a loop.
my_list <- list()
for(i in 1:19){
x <- runif(10)
y <- rnorm(10)
intercept <- lm(y~x)$coefficients[1]
name <- paste('plot_',i,sep='')
tmp <- list(x=x, y=y, intercept=intercept)
my_list[[name]] <- tmp
}
layout(matrix(1:20, nrow = 4, ncol = 5, byrow = TRUE))
for(j in 1:length(my_list)) {
plot(x=my_list[[j]]$x, y=my_list[[j]]$y, main=attributes(my_list[j])$names,xlab="x-label",ylab="y-label")
abline(h=my_list[[j]]$intercept)
}
Just wanted to post the ggplot2 version of what you're trying to do to see if that might work for you as well.
I also show an example of fitting lines for multiple classes within each facet (depending on how complicated the analysis is you're conducting).
First install ggplot2 if you don't have it already:
# install.packages('ggplot2')
library(ggplot2)
Here I am just setting up some dummy data using the built-in iris dataset. I'm essentially trying to simulate having 19 distinct datasets.
set.seed(1776)
samples <- list()
num_datasets <- 19
datasets <- list(num_datasets)
# dynamically create some samples
for(i in 1:num_datasets) {
samples[[i]] <- sample(1:nrow(iris), 20)
}
# dynamically assign to many data sets (keep only 2 numeric columns)
for(i in 1:num_datasets) {
datasets[[i]] <- cbind(iris[samples[[i]], c('Petal.Length', 'Petal.Width', 'Species')], dataset_id = i)
# assign(paste0("dataset_", i), iris[samples[[i]], c('Petal.Length', 'Petal.Width')])
}
do.call is a bit tricky, but it takes in two arguments, a function, and a list of arguments to apply to that function. So I'm using rbind() on all of the distinct datasets within my datasets object (which is a list of datasets).
combined_data <- do.call(rbind, datasets)
First plot is one big scatter plot to show the data.
# all data
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(alpha = 0.2) +
ggtitle("All data")
Next is 19 individual "facets" of plots all on the same scale and in the same graphing window.
# all data faceted by dataset_id
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(alpha = 0.5) +
ggtitle("All data faceted by dataset") +
facet_wrap(~ dataset_id) +
geom_smooth(method='lm', se = F)
plot of facets with best fit lines
Finally, the data plotted in facets again, but colored by the species of the iris flower and each species has its own line of best fit.
# all data faceted by dataset_id
ggplot(data=combined_data, aes(x=Petal.Length, y=Petal.Width, color = Species)) +
geom_point(alpha = 0.5) +
ggtitle("All data faceted by dataset with best fit lines per species") +
facet_wrap(~ dataset_id) +
geom_smooth(method='lm', se = F)
plots of facets with best fit within categories
I see you mentioned you had your own precalculated best fit line, but I think this conceptually might get you closer to where you need to be?
Cheers!

Categories