I am currently working on a Data Visualization project.
I want to plot multiple lines (about 200k) that represent travels from one Subway Station to all the others. This is, all the subway stations should be connected by a straight line.
The color of the line doesn't really matter (it could well be red, blue, etc.), but opacity is what matters the most. The bigger the number of travels between two random stations, the more opacity of that particular line; and vice versa.
I feel I am close to the desired output, but can't figure a way to do it properly.
The DataFrame I am using (df = pd.read_csv(...)) consists of a series of columns, namely: id_start_station, id_end_station, lat_start_station, long_start_station, lat_end_station, long_end_station, number_of_journeys.
I got to extract the coordinates by coding
lons = []
lons = np.empty(3 * len(df))
lons[::3] = df['long_start_station']
lons[1::3] = df['long_end_station']
lons[2::3] = None
lats = []
lats = np.empty(3 * len(df))
lats[::3] = df['lat_start_station']
lats[1::3] = df['lat_end_station']
lats[2::3] = None
I then started a figure by:
fig = go.Figure()
and then added a trace by:
fig.add_trace(go.Scattermapbox(
name='Journeys',
lat=lats,
lon=lons,
mode='lines',
line=dict(color='red', width=1),
opacity= ¿?, # PROBLEM IS HERE [1]
))
[1] So I tried a few different things to pass a opacity term:
I created a new tuple for the opacity of each trace, by:
opacity = []
opacity = np.empty(3 * len(df))
opacity [::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [1::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [2::3] = None
and passed it into [1], but this error came out:
ValueError:
Invalid value of type 'numpy.ndarray' received for the 'opacity' property of scattermapbox
The 'opacity' property is a number and may be specified as:
- An int or float in the interval [0, 1]
I then thought of passing the "opacity" term into the "color" term, by using rgba's property alpha, such as: rgba(255,0,0,0.5).
So I first created a "map" of all alpha parameters:
df['alpha'] = df['number_of_journeys'] / max(df['number_of_journeys'])
and then created a function to retrieve all the alpha parameters inside a specific color:
colors_with_opacity = []
def colors_with_opacity_func(df, empty_list):
for alpha in df['alpha']:
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.append(None)
colors_with_opacity_func(df, colors_with_opacity)
and passed that into the color atribute of the Scattermapbox, but got the following error:
ValueError:
Invalid value of type 'builtins.list' received for the 'color' property of scattermapbox.line
The 'color' property is a color and may be specified as:
- A hex string (e.g. '#ff0000')
- An rgb/rgba string (e.g. 'rgb(255,0,0)')
- An hsl/hsla string (e.g. 'hsl(0,100%,50%)')
- An hsv/hsva string (e.g. 'hsv(0,100%,100%)')
- A named CSS color:
aliceblue, antiquewhite, aqua, [...] , whitesmoke,
yellow, yellowgreen
Since it is a massive amount of lines, looping / iterating through traces will carry out performance issues.
Any help will be much appreciated. I can't figure a way to properly accomplish that.
Thank you, in advance.
EDIT 1 : NEW QUESTION ADDED
I add this question here below as I believe it can help others that are looking for this particular topic.
Following Rob's helpful answer, I managed to add multiple opacities, as specified previously.
However, some of my colleagues suggested a change that would improve the visualization of the map.
Now, instead of having multiple opacities (one for each trace, according to the value of the dataframe) I would also like to have multiple widths (according to the same value of the dataframe).
This is, following Rob's answer, I would need something like this:
BINS_FOR_OPACITY=10
opacity_a = np.geomspace(0.001,1, BINS_FOR_OPACITY)
BINS_FOR_WIDTH=10
width_a = np.geomspace(1,3, BINS_FOR_WIDTH)
fig = go.Figure()
# Note the double "for" statement that follows
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_OPACITY, labels=opacity_a)):
for width, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_WIDTH, labels=width_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_width=width
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
However, the above is clearly not working, as it is making much more traces than it should do (I really can't explain why, but I guess it might be because of the double loop forced by the two for statements).
It ocurred to me that some kind of solution could be hidding in the pd.cut part, as I would need something like a double cut, but couldn't find a way to properly doing it.
I also managed to create a Pandas series by:
widths = pd.cut(df.["size"], bins=BINS_FOR_WIDTH, labels=width_a)
and iterating over that series, but got the same result as before (an excess of traces).
To emphasize and clarify myself, I don't need to have only multiple opacities or multiple widths, but I need to have them both and at the same time, which is what's causing me some troubles.
Again, any help is deeply thanked.
opacity is per trace, for markers it can be done with color using rgba(a,b,c,d) but not for lines. (Same in straight scatter plots)
to demonstrate, I have used London Underground stations (filtered to reduce number of nodes). Plus gone to extra effort of formatting data as a CSV. JSON as source has nothing to do with solution
encoded to bin number_of_journeys for inclusion into a trace with a geometric progression used for calculating and opacity
this sample data set is generating 83k sample lines
import requests
import geopandas as gpd
import plotly.graph_objects as go
import itertools
import numpy as np
import pandas as pd
from pathlib import Path
# get geometry of london underground stations
gdf = gpd.GeoDataFrame.from_features(
requests.get(
"https://raw.githubusercontent.com/oobrien/vis/master/tube/data/tfl_stations.json"
).json()
)
# limit to zone 1 and stations that have larger number of lines going through them
gdf = gdf.loc[gdf["zone"].isin(["1","2","3","4","5","6"]) & gdf["lines"].apply(len).gt(0)].reset_index(
drop=True
).rename(columns={"id":"tfl_id", "name":"id"})
# wanna join all valid combinations of stations...
combis = np.array(list(itertools.combinations(gdf.index, 2)))
# generate dataframe of all combinations of stations
gdf_c = (
gdf.loc[combis[:, 0], ["geometry", "id"]]
.assign(right=combis[:, 1])
.merge(gdf.loc[:, ["geometry", "id"]], left_on="right", right_index=True, suffixes=("_start_station","_end_station"))
)
gdf_c["lat_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.y)
gdf_c["long_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.x)
gdf_c["lat_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.y)
gdf_c["long_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.x)
gdf_c = gdf_c.drop(
columns=[
"geometry_start_station",
"right",
"geometry_end_station",
]
).assign(number_of_journeys=np.random.randint(1,10**5,len(gdf_c)))
gdf_c
f = Path.cwd().joinpath("SO.csv")
gdf_c.to_csv(f, index=False)
# there's an requirement to start with a CSV even though no sample data has been provided, now we're starting with a CSV
df = pd.read_csv(f)
# makes use of ravel simpler...
df["none"] = None
# now it's simple to generate scattermapbox... a trace per required opacity
BINS=10
opacity_a = np.geomspace(0.001,1, BINS)
fig = go.Figure()
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS, labels=opacity_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
fig.update_layout(
mapbox={
"style": "carto-positron",
"center": {'lat': 51.520214996769255, 'lon': -0.097792388774743},
"zoom": 9,
},
margin={"l": 0, "r": 0, "t": 0, "b": 0},
)
I'm using healpy to plot the locations of galaxies on the sky from a list of RAs and Decs. So far, I think I've been able to correctly plot the galaxies, but I'd like to improve the finished product. Is there any way to bin the number of galaxies that appear in each healpy tile, rather than just coloring being based on whether there is or isn't a catalog member in the tile?
Here I show the image that I'm currently making —
right now it's only really useful for telling you where the Milky Way isn't. Here's the code I'm using.
phis = [np.deg2rad(ra) for ra in ra_list]
thetas = [np.pi / 2 - np.deg2rad(dec) for dec in dec_list]
pixel_indices = hp.ang2pix(NSIDE, thetas, phis)
m = np.zeros(hp.nside2npix(NSIDE))
m[pixel_indices] = np.ones(num_galaxies_to_plot)
hp.mollview(m, title = 'Sky Locations of GLADE Galaxies', cbar = False, rot=(180, 0, 180), cmap = 'binary')
hp.graticule()
You could use numpy.bincount to create an array of the number of galaxies per pixels and then create a map of that.
I'm looking for a way to specify an X-Y offset to plotted data points. I'm just getting into Altair, so please bear with me.
The situation: I have a dataset recording daily measurements for 30 people. Every person can register several different types of measurements every day.
Example dataset & plot, with 2 people and 2 measurement types:
import pandas as pd
df = pd.DataFrame.from_dict({"date": pd.to_datetime(pd.date_range("2019-12-01", periods=5).repeat(4)),
"person": pd.np.tile(["Bob", "Amy"], 10),
"measurement_type": pd.np.tile(["score_a", "score_a", "score_b", "score_b"], 5),
"value": 20.0*np.random.random(size=20)})
import altair as alt
alt.Chart(df, width=600, height=100) \
.mark_circle(size=150) \
.encode(x = "date",
y = "person",
color = alt.Color("value"))
This gives me this graph:
In the example above, the 2 measurement types are plotted on top of each other. I would like to add an offset to the circles depending on the "measurement_type" column, so that they can all be made visible around the date-person location in the graph.
Here's a mockup of what I want to achieve:
I've been searching the docs but haven't figured out how to do this - been experimenting with the "stack" option, with the dx and dy options, ...
I have a feeling this should just be another encoding channel (offset or alike), but that doesn't exist.
Can anyone point me in the right direction?
There is currently no concept of an offset encoding in Altair, so the best approach to this will be to combine a column encoding with a y encoding, similar to the Grouped Bar Chart example in Altair's documentation:
alt.Chart(df,
width=600, height=100
).mark_circle(
size=150
).encode(
x = "date",
row='person',
y = "measurement_type",
color = alt.Color("value")
)
You can then fine-tune the look of the result using standard chart configuration settings:
alt.Chart(df,
width=600, height=alt.Step(25)
).mark_circle(
size=150
).encode(
x = "date",
row='person',
y = alt.Y("measurement_type", title=None),
color = alt.Color("value")
).configure_facet(
spacing=10
).configure_view(
strokeOpacity=0
)
Well I don't know what result you are getting up until know, but maybe write a function whith parameters likedef chart(DotsOnXAxis, FirstDotsOnYAxis, SecondDotsOnYAxis, OffsetAmount)
and then put those variables on the right place.
If you want an offset with the dots maybe put in a system like: SecondDotsOnYAxis = FirstDotsOnYAxis + OffsetAmount
I need to plot one categorical variable over multiple numeric variables.
My DataFrame looks like this:
party media_user business_user POLI mass
0 Party_a 0.513999 0.404201 0.696948 0.573476
1 Party_b 0.437972 0.306167 0.432377 0.433618
2 Party_c 0.519350 0.367439 0.704318 0.576708
3 Party_d 0.412027 0.253227 0.353561 0.392207
4 Party_e 0.479891 0.380711 0.683606 0.551105
And I would like a scatter plot with different colors for the different variables; eg. one plot per party per [media_user, business_user, POLI, mass] each in different color.
So like this just with scatters instead of bars:
The closest I've come is this
sns.catplot(x="party", y="media_user", jitter=False, data=sns_df, height = 4, aspect = 5);
producing:
By messing around with some other graphs I found that by simply adding linestyle = '' I could remove the line and add markers. Hope this may help somebody else!
sim_df.plot(figsize = (15,5), linestyle = '', marker = 'o')
I'm trying to create a plot of classification accuracy for three ML models, depending on the number of features used from the data (the number of features used is from 1 to 75, ranked according to a feature selection method). I did 100 iterations of calculating the accuracy output for each model and for each "# of features used". Below is what my data looks like (clsf from 0 to 2, timepoint from 1 to 75):
data
I am then calling the seaborn function as shown in documentation files.
sns.lineplot(x= "timepoint", y="acc", hue="clsf", data=ttest_df, ci= "sd", err_style = "band")
The plot comes out like this:
plot
I wanted there to be confidence intervals for each point on the x-axis, and don't know why it is not working. I have 100 y values for each x value, so I don't see why it cannot calculate/show it.
You could try your data set using Seaborn's pointplot function instead. It's specifically for showing an indication of uncertainty around a scatter plot of points. By default pointplot will connect values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via linestyles = "" for nominal data. (I used join = False in my example)
I tried to recreate your notebook to give a visual, but wasn't able to get the confidence interval in my plot exactly as you describe. I hope this is helpful for you.
sb.set(style="darkgrid")
sb.pointplot(x = 'timepoint', y = 'acc', hue = 'clsf',
data = ttest_df, ci = 'sd', palette = 'magma',
join = False);