Adding X-Y offsets to data points - python

I'm looking for a way to specify an X-Y offset to plotted data points. I'm just getting into Altair, so please bear with me.
The situation: I have a dataset recording daily measurements for 30 people. Every person can register several different types of measurements every day.
Example dataset & plot, with 2 people and 2 measurement types:
import pandas as pd
df = pd.DataFrame.from_dict({"date": pd.to_datetime(pd.date_range("2019-12-01", periods=5).repeat(4)),
"person": pd.np.tile(["Bob", "Amy"], 10),
"measurement_type": pd.np.tile(["score_a", "score_a", "score_b", "score_b"], 5),
"value": 20.0*np.random.random(size=20)})
import altair as alt
alt.Chart(df, width=600, height=100) \
.mark_circle(size=150) \
.encode(x = "date",
y = "person",
color = alt.Color("value"))
This gives me this graph:
In the example above, the 2 measurement types are plotted on top of each other. I would like to add an offset to the circles depending on the "measurement_type" column, so that they can all be made visible around the date-person location in the graph.
Here's a mockup of what I want to achieve:
I've been searching the docs but haven't figured out how to do this - been experimenting with the "stack" option, with the dx and dy options, ...
I have a feeling this should just be another encoding channel (offset or alike), but that doesn't exist.
Can anyone point me in the right direction?

There is currently no concept of an offset encoding in Altair, so the best approach to this will be to combine a column encoding with a y encoding, similar to the Grouped Bar Chart example in Altair's documentation:
alt.Chart(df,
width=600, height=100
).mark_circle(
size=150
).encode(
x = "date",
row='person',
y = "measurement_type",
color = alt.Color("value")
)
You can then fine-tune the look of the result using standard chart configuration settings:
alt.Chart(df,
width=600, height=alt.Step(25)
).mark_circle(
size=150
).encode(
x = "date",
row='person',
y = alt.Y("measurement_type", title=None),
color = alt.Color("value")
).configure_facet(
spacing=10
).configure_view(
strokeOpacity=0
)

Well I don't know what result you are getting up until know, but maybe write a function whith parameters likedef chart(DotsOnXAxis, FirstDotsOnYAxis, SecondDotsOnYAxis, OffsetAmount)
and then put those variables on the right place.
If you want an offset with the dots maybe put in a system like: SecondDotsOnYAxis = FirstDotsOnYAxis + OffsetAmount

Related

In Alt HConcat , is it possible to define strokeWidth individually for each (i.e. not use configure_* , which can only be applied to Top level)?

I have a need to have a left side faceted chart (using mark_rect) that has several small-multiples/mini charts. On the right, I would like to label them by their title using mark_text.
The concatenation works fine using (left | right) , however I have a need to remove the box that is on the view from mark_text
Example code describing my is below:
import altair as alt
import pandas as pd
df = pd.DataFrame(
[['BookA',0,'Hello',3],
['BookA',0,'World',1],
['BookA',1,'Hello',1],
['BookA',1,'World',0],
['BookA',2,'Hello',4],
['BookA',2,'World',2],
['BookB',0,'Hello',0],
['BookB',0,'World',0],
['BookB',1,'Hello',2],
['BookB',1,'World',3],
['BookC',0,'Hello',1],
['BookC',0,'World',0],
['BookC',1,'Hello',1],
['BookC',1,'World',2],
['BookC',2,'Hello',3],
['BookC',2,'World',4],
['BookC',3,'Hello',0],
['BookC',3,'World',0],
], columns = ['title','line','var','val']
)
base = alt.Chart(df)
# Left chart containing heatmap
left = base.mark_rect().encode(
x=alt.X('line:O',axis=alt.Axis(tickSize=0,labels=False,title="",domain=False)),
y=alt.Y('var:O',axis=alt.Axis(tickSize=0,title="",labelFontSize = 12,labelPadding = 6,domain=False)),
color= alt.Color('val:O',scale=alt.Scale(scheme="blues"),legend=None),
facet=alt.Facet('title:N', columns=1,header=alt.Header(labelExpr="''",labelOrient="right",title=""))
).resolve_scale(
x='independent' #Make each plot have scale of x-axis only as per desired length
).properties(height = 50)
# Right chart containing Book title
right = base.mark_text(strokeOpacity = 1,
dx= 300, #Added to shift by right
fontSize = 14,fontWeight = 'normal', baseline='top').encode(
text = alt.Text('title'),
facet=alt.Facet('title:N', columns=1,header=alt.Header(labelExpr="''",labelOrient="right",title=""))
).properties(height = 51)
(left | right).configure_view(stroke='black',strokeWidth=2)
I would ideally like to remove the black bordered boxes that appear in the middle. If I set strokeWidth = 0 in configure_view, it affects the left visualization in the below image, which I do not wish to.
You could manually edit the properties of the chart object instead of using the configure_view method. This means you can add it to the view property of the chart dictionary instead of using config.view (not sure if this could have side effects):
left.view = {}
left.view['stroke'] = 'black'
left.view['strokeWidth'] = 3
(left | right).configure_view(stroke=None)
Also note that if you use alt.hconcat there is a padding parameter that can achieve what you are doing with dx now, note sure if there is an advantage practically in your case, but its purpose is to separate concatenated plots so might be useful to know of in general.

ploting a "timetable" with grouped bars for defined hours/timeboxes

I want to track my mobile devices at home so I can plot a "at home" and "not at home" diagram.
I collect the data as follows:
ip,device_name,start,end,length,date
192.168.178.123,aaa,2022-04-16 00:33:01.395443,2022-04-16 00:37:06.843443
192.168.178.123,aaa,2022-04-16 08:55:24.911787,2022-04-16 08:56:39.197196
192.168.178.123,aaa,2022-04-20 21:49:25.660712,2022-04-20 21:50:25.660712
192.168.178.123,aaa,2022-04-24 14:42:14.781557,2022-04-24 14:44:56.519343
192.168.178.234,bbb,2022-04-16 08:22:37.763442,2022-04-16 08:23:37.763442
192.168.178.234,bbb,2022-04-16 10:05:09.613899,2022-04-16 10:06:09.613899
Each entry of my csv-File represents the status of a device being not at home.
I want to have a diagram as shown
I can't figure out how to do this in plotly. I tried to find a way using time series (https://plotly.com/python/time-series/) but I think there is nothing that helps me to do what I want.
This code brings me to an output which is quite near to what I want but I can not bring the y-axis to hours and show the gaps.
data_frame = pd.read_csv("awaytimes.csv", parse_dates=['start', 'end'])
data_frame['length'] = (data_frame['end'] - data_frame['start']) / pd.Timedelta(hours=1)
fig = px.bar(data_frame,
x="date",
y="length",
color='device_name',
barmode='group',
height=400)
fig.show()
I hope one of you can give me a hint.

Multiple opacities in Mapbox - Plotly for Python

I am currently working on a Data Visualization project.
I want to plot multiple lines (about 200k) that represent travels from one Subway Station to all the others. This is, all the subway stations should be connected by a straight line.
The color of the line doesn't really matter (it could well be red, blue, etc.), but opacity is what matters the most. The bigger the number of travels between two random stations, the more opacity of that particular line; and vice versa.
I feel I am close to the desired output, but can't figure a way to do it properly.
The DataFrame I am using (df = pd.read_csv(...)) consists of a series of columns, namely: id_start_station, id_end_station, lat_start_station, long_start_station, lat_end_station, long_end_station, number_of_journeys.
I got to extract the coordinates by coding
lons = []
lons = np.empty(3 * len(df))
lons[::3] = df['long_start_station']
lons[1::3] = df['long_end_station']
lons[2::3] = None
lats = []
lats = np.empty(3 * len(df))
lats[::3] = df['lat_start_station']
lats[1::3] = df['lat_end_station']
lats[2::3] = None
I then started a figure by:
fig = go.Figure()
and then added a trace by:
fig.add_trace(go.Scattermapbox(
name='Journeys',
lat=lats,
lon=lons,
mode='lines',
line=dict(color='red', width=1),
opacity= ¿?, # PROBLEM IS HERE [1]
))
[1] So I tried a few different things to pass a opacity term:
I created a new tuple for the opacity of each trace, by:
opacity = []
opacity = np.empty(3 * len(df))
opacity [::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [1::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [2::3] = None
and passed it into [1], but this error came out:
ValueError:
Invalid value of type 'numpy.ndarray' received for the 'opacity' property of scattermapbox
The 'opacity' property is a number and may be specified as:
- An int or float in the interval [0, 1]
I then thought of passing the "opacity" term into the "color" term, by using rgba's property alpha, such as: rgba(255,0,0,0.5).
So I first created a "map" of all alpha parameters:
df['alpha'] = df['number_of_journeys'] / max(df['number_of_journeys'])
and then created a function to retrieve all the alpha parameters inside a specific color:
colors_with_opacity = []
def colors_with_opacity_func(df, empty_list):
for alpha in df['alpha']:
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.append(None)
colors_with_opacity_func(df, colors_with_opacity)
and passed that into the color atribute of the Scattermapbox, but got the following error:
ValueError:
Invalid value of type 'builtins.list' received for the 'color' property of scattermapbox.line
The 'color' property is a color and may be specified as:
- A hex string (e.g. '#ff0000')
- An rgb/rgba string (e.g. 'rgb(255,0,0)')
- An hsl/hsla string (e.g. 'hsl(0,100%,50%)')
- An hsv/hsva string (e.g. 'hsv(0,100%,100%)')
- A named CSS color:
aliceblue, antiquewhite, aqua, [...] , whitesmoke,
yellow, yellowgreen
Since it is a massive amount of lines, looping / iterating through traces will carry out performance issues.
Any help will be much appreciated. I can't figure a way to properly accomplish that.
Thank you, in advance.
EDIT 1 : NEW QUESTION ADDED
I add this question here below as I believe it can help others that are looking for this particular topic.
Following Rob's helpful answer, I managed to add multiple opacities, as specified previously.
However, some of my colleagues suggested a change that would improve the visualization of the map.
Now, instead of having multiple opacities (one for each trace, according to the value of the dataframe) I would also like to have multiple widths (according to the same value of the dataframe).
This is, following Rob's answer, I would need something like this:
BINS_FOR_OPACITY=10
opacity_a = np.geomspace(0.001,1, BINS_FOR_OPACITY)
BINS_FOR_WIDTH=10
width_a = np.geomspace(1,3, BINS_FOR_WIDTH)
fig = go.Figure()
# Note the double "for" statement that follows
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_OPACITY, labels=opacity_a)):
for width, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_WIDTH, labels=width_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_width=width
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
However, the above is clearly not working, as it is making much more traces than it should do (I really can't explain why, but I guess it might be because of the double loop forced by the two for statements).
It ocurred to me that some kind of solution could be hidding in the pd.cut part, as I would need something like a double cut, but couldn't find a way to properly doing it.
I also managed to create a Pandas series by:
widths = pd.cut(df.["size"], bins=BINS_FOR_WIDTH, labels=width_a)
and iterating over that series, but got the same result as before (an excess of traces).
To emphasize and clarify myself, I don't need to have only multiple opacities or multiple widths, but I need to have them both and at the same time, which is what's causing me some troubles.
Again, any help is deeply thanked.
opacity is per trace, for markers it can be done with color using rgba(a,b,c,d) but not for lines. (Same in straight scatter plots)
to demonstrate, I have used London Underground stations (filtered to reduce number of nodes). Plus gone to extra effort of formatting data as a CSV. JSON as source has nothing to do with solution
encoded to bin number_of_journeys for inclusion into a trace with a geometric progression used for calculating and opacity
this sample data set is generating 83k sample lines
import requests
import geopandas as gpd
import plotly.graph_objects as go
import itertools
import numpy as np
import pandas as pd
from pathlib import Path
# get geometry of london underground stations
gdf = gpd.GeoDataFrame.from_features(
requests.get(
"https://raw.githubusercontent.com/oobrien/vis/master/tube/data/tfl_stations.json"
).json()
)
# limit to zone 1 and stations that have larger number of lines going through them
gdf = gdf.loc[gdf["zone"].isin(["1","2","3","4","5","6"]) & gdf["lines"].apply(len).gt(0)].reset_index(
drop=True
).rename(columns={"id":"tfl_id", "name":"id"})
# wanna join all valid combinations of stations...
combis = np.array(list(itertools.combinations(gdf.index, 2)))
# generate dataframe of all combinations of stations
gdf_c = (
gdf.loc[combis[:, 0], ["geometry", "id"]]
.assign(right=combis[:, 1])
.merge(gdf.loc[:, ["geometry", "id"]], left_on="right", right_index=True, suffixes=("_start_station","_end_station"))
)
gdf_c["lat_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.y)
gdf_c["long_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.x)
gdf_c["lat_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.y)
gdf_c["long_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.x)
gdf_c = gdf_c.drop(
columns=[
"geometry_start_station",
"right",
"geometry_end_station",
]
).assign(number_of_journeys=np.random.randint(1,10**5,len(gdf_c)))
gdf_c
f = Path.cwd().joinpath("SO.csv")
gdf_c.to_csv(f, index=False)
# there's an requirement to start with a CSV even though no sample data has been provided, now we're starting with a CSV
df = pd.read_csv(f)
# makes use of ravel simpler...
df["none"] = None
# now it's simple to generate scattermapbox... a trace per required opacity
BINS=10
opacity_a = np.geomspace(0.001,1, BINS)
fig = go.Figure()
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS, labels=opacity_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
fig.update_layout(
mapbox={
"style": "carto-positron",
"center": {'lat': 51.520214996769255, 'lon': -0.097792388774743},
"zoom": 9,
},
margin={"l": 0, "r": 0, "t": 0, "b": 0},
)

Encoding a list column to the legend of a plot

Apologies in advance, I am not sure how to word this question best:
I am working with a large dataset, and I would like to plot Latitude and Longitude where the colour of the points (actually the opacity) is encoded to a 'FeatureType' column binded to the legend. This way I can use the legend to highlight on my map various features I am looking for.
Here is a picture of my map and legend so far
The problem is that in my dataset, the FeatureType column is a list of features that can be found there (i.e arch, bridge, etc..).
How can I make it so that the point shows up for both arch, and bridge. At the moment it creates its own category of (arch,bridge etc.), leading to over 300 combinations of about 20 different FeatureTypes.
The dataset can be found at http://atlantides.org/downloads/pleiades/dumps/pleiades-locations-latest.csv.gz
N.B: I am using altair/pandas
import altair as alt
import pandas as pd
from vega_datasets import data
df = pd.read_csv ('C://path/pleiades-locations.csv')
alt.data_transformers.enable('json')
countries = alt.topo_feature(data.world_110m.url, 'countries')
selection = alt.selection_multi(fields=['featureType'], bind='legend')
brush = alt.selection(type='interval', encodings=['x'])
map = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white'
).project('equirectangular').properties(
width=500,
height=300
)
points = alt.Chart(df).mark_circle().encode(
alt.Latitude('reprLat:Q'),
alt.Longitude('reprLong:Q'),
alt.Color('featureType:N'),
tooltip=['featureType','timePeriodsKeys:N'],
opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
).add_selection(
selection)
(map + points)
It is not possible for Altair to generate the labels you want from your current column format. You will need to turn your comma-separated string labels into lists and then explode the column so that you get one row per item in the list:
import altair as alt
import pandas as pd
from vega_datasets import data
alt.data_transformers.enable('data_server')
df = pd.read_csv('http://atlantides.org/downloads/pleiades/dumps/pleiades-locations-latest.csv.gz')[['reprLong', 'reprLat', 'featureType']]
df['featureType'] = df['featureType'].str.split(',')
df = df.explode('featureType')
countries = alt.topo_feature(data.world_110m.url, 'countries')
world_map = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white')
points = alt.Chart(df).mark_circle(size=10).encode(
alt.Latitude('reprLat:Q'),
alt.Longitude('reprLong:Q'),
alt.Color('featureType:N', legend=alt.Legend(columns=2)))
world_map + points
Note that having this many entries in the legend is not meaningful since the colors are repeated. The interactivity would help with that somewhat, but I would consider splitting this up into multiple charts. I am not sure if it is even possible to expand the legend to show those hidden 81 entries. And double check that the long lat location corresponds correctly with the world map projection you are using, they seemed to move around when I changed the projection.

Python: Add calculated lines to a scatter plot with a nested categorical x-axis

Cross-post: https://discourse.bokeh.org/t/add-calculated-horizontal-lines-corresponding-to-categories-on-the-x-axis/5544
I would like to duplicate this plot in Python:
Here is my attempt, using pandas and bokeh:
Imports:
import pandas as pd
from bokeh.io import output_notebook, show, reset_output
from bokeh.palettes import Spectral5, Turbo256
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Band, Span, FactorRange, ColumnDataSource
Create data:
fruits = ['Apples', 'Pears']
years = ['2015', '2016']
data = {'fruit' : fruits,
'2015' : [2, 1],
'2016' : [5, 3]}
fruit_df = pd.DataFrame(data).set_index("fruit")
tidy_df = (pd.DataFrame(data)
.melt(id_vars=["fruit"], var_name="year")
.assign(fruit_year=lambda df: list(zip(df['fruit'], df['year'])))
.set_index('fruit_year'))
Create bokeh plot:
p = figure(x_range=FactorRange(factors=tidy_df.index.unique()),
plot_height=400,
plot_width=400,
tooltips=[('Fruit', '#fruit'), # first string is user-defined; second string must refer to a column
('Year', '#year'),
('Value', '#value')])
cds = ColumnDataSource(tidy_df)
index_cmap = factor_cmap("fruit",
Spectral5[:2],
factors=sorted(tidy_df["fruit"].unique())) # this is a reference back to the dataframe
p.circle(x='fruit_year',
y='value',
size=20,
source=cds,
fill_color=index_cmap,
line_color=None,
)
# how do I add a median just to one categorical section?
median = Span(location=tidy_df.loc[tidy_df["fruit"] == "Apples", "value"].median(), # median value for Apples
#dimension='height',
line_color='red',
line_dash='dashed',
line_width=1.0
)
p.add_layout(median)
# how do I add this standard deviation(ish) band to just the Apples or Pears section?
band = Band(
base='fruit_year',
lower=2,
upper=4,
source=cds,
)
p.add_layout(band)
show(p)
Output:
Am I up against this issue? https://github.com/bokeh/bokeh/issues/8592
Is there any other data visualization library for Python that can accomplish this? Altair, Holoviews, Matplotlib, Plotly... ?
Band is a connected area, but your image of the desired output has two disconnected areas. Meaning, you actually need two bands. Take a look at the example here to better understand bands: https://docs.bokeh.org/en/latest/docs/user_guide/annotations.html#bands
By using Band(base='fruit_year', lower=2, upper=4, source=cds) you ask Bokeh to plot a band where for each value of fruit_year, the lower coordinate will be 2 and the upper coordinate will be 4. Which is exactly what you see on your Bokeh plot.
A bit unrelated but still a mistake - notice how your X axis is different from what you wanted. You have to specify the major category first, so replace list(zip(df['fruit'], df['year'])) with list(zip(df['year'], df['fruit'])).
Now, to the "how to" part. Since you need two separate bands, you cannot provide them with the same data source. The way to do it would be to have two extra data sources - one for each band. It ends up being something like this:
for year, sd in [('2015', 0.3), ('2016', 0.5)]:
b_df = (tidy_df[tidy_df['year'] == year]
.drop(columns=['year', 'fruit'])
.assign(lower=lambda df: df['value'].min() - sd,
upper=lambda df: df['value'].max() + sd)
.drop(columns='value'))
p.add_layout(Band(base='fruit_year', lower='lower', upper='upper',
source=ColumnDataSource(b_df)))
There are two issues left however. The first one is a trivial one - the automatic Y range (an instance of DataRange1d class by default) will not take the bands' heights into account. So the bands can easily go out of bounds and be cropped by the plot. The solution here is to use manual ranging that takes the SD values into account.
The second issue is that the width of band is limited to the X range factors, meaning that the circles will be partially outside of the band. This one is not that easy to fix. Usually a solution would be to use a transform to just shift the coordinates a bit at the edges. But since this is a categorical axis, we cannot do it. One possible solution here is to create a custom Band model that adds an offset:
class MyBand(Band):
# language=TypeScript
__implementation__ = """
import {Band, BandView} from "models/annotations/band"
export class MyBandView extends BandView {
protected _map_data(): void {
super._map_data()
const base_sx = this.model.dimension == 'height' ? this._lower_sx : this._lower_sy
if (base_sx.length > 1) {
const offset = (base_sx[1] - base_sx[0]) / 2
base_sx[0] -= offset
base_sx[base_sx.length - 1] += offset
}
}
}
export class MyBand extends Band {
__view_type__: MyBandView
static init_MyBand(): void {
this.prototype.default_view = MyBandView
}
}
"""
Just replace Band with MyBand in the code above and it should work. One caveat - you will need to have Node.js installed and the startup time will be longer for a second or two because the custom model code needs compilation. Another caveat - the custom model code knows about internals of BokehJS. Meaning, that while it's working with Bokeh 2.0.2 I can't guarantee that it will work with any other Bokeh version.

Categories