Python Altair Radial Plot for Multiple Variables - python

I'm trying to make a radial plot using altair and streamlit to display two parameters about various car manufacturers. My data frame looks like this:
make
count
value
value_int
MERCEDES-BENZ
8637
21.0283
21
FORD
9405
22.0488
22
SKODA
10617
26.6724
26
TOYOTA
10903
24.2498
24
VOLKSWAGEN
11178
25.4069
25
OPEL
11672
28.5445
28
and my code looks like this:
#d3 is the data frame above
p1 = alt.Chart(d3).encode(
theta = alt.Theta('value:Q', stack = True),
radius = alt.Radius('count', scale = alt.Scale(zero = True, rangeMin = 20)),
color = alt.Color('make:N', legend = alt.Legend(orient = 'top'))
).properties(
height = 600
)
p11 = p1.mark_arc()
st.altair_chart(p11)
and my output looks like this:
I expect to see radial bars, colored by make, that have either the count or the value (I don't really care which way round they are although value would be preferable on the radial) and the theta being the other. When I set Theta / Radius to be the same then the colour plots as expected but with less information about the data set. I tried making them integer values and this didn't change anything. I tried changing the scale from sqrt to linear (linear shown) and again no difference.
Any thoughts? It seems like it should be relatively simple to make two different columns refer to two different parameters of a graph. I expect it's something relatively simple that I've missed.
Tried changing graphical parameters, type of values being plotted, different graphical parameters within altair and streamlit. All ended up with not plotting both variables across the whole set.

The issue with your script is that the dataframe was not properly sorted in descending order by the "count" column.
That's why when the Opel arc was drawn last, it covered up the other arcs and making it appear as if there was only one color in the chart.
Here is the modified script with the data table sorted properly, I am not sure if it what you wanted:
d3 = pd.DataFrame({
'make': ['MERCEDES-BENZ', 'FORD', 'SKODA', 'TOYOTA', 'VOLKSWAGEN', 'OPEL'],
'count': [8637, 9405, 10617, 10903, 11178, 11672],
'value': [21.0283, 22.0488, 26.6724, 24.2498, 25.4069, 28.5445],
'value_int': [21, 22, 26, 24, 25, 28]
})
# Sort the data table by the "count" column in descending order
d3 = d3.sort_values('count', ascending=False)
p1 = alt.Chart(d3).encode(
theta=alt.Theta('value:Q', stack=True),
radius=alt.Radius('count', scale=alt.Scale(zero=True, rangeMin=20)),
color=alt.Color('make:N', legend=alt.Legend(orient='top'))
).properties(
height=600
)
p11 = p1.mark_arc()
st.altair_chart(p11)

Related

Multiple opacities in Mapbox - Plotly for Python

I am currently working on a Data Visualization project.
I want to plot multiple lines (about 200k) that represent travels from one Subway Station to all the others. This is, all the subway stations should be connected by a straight line.
The color of the line doesn't really matter (it could well be red, blue, etc.), but opacity is what matters the most. The bigger the number of travels between two random stations, the more opacity of that particular line; and vice versa.
I feel I am close to the desired output, but can't figure a way to do it properly.
The DataFrame I am using (df = pd.read_csv(...)) consists of a series of columns, namely: id_start_station, id_end_station, lat_start_station, long_start_station, lat_end_station, long_end_station, number_of_journeys.
I got to extract the coordinates by coding
lons = []
lons = np.empty(3 * len(df))
lons[::3] = df['long_start_station']
lons[1::3] = df['long_end_station']
lons[2::3] = None
lats = []
lats = np.empty(3 * len(df))
lats[::3] = df['lat_start_station']
lats[1::3] = df['lat_end_station']
lats[2::3] = None
I then started a figure by:
fig = go.Figure()
and then added a trace by:
fig.add_trace(go.Scattermapbox(
name='Journeys',
lat=lats,
lon=lons,
mode='lines',
line=dict(color='red', width=1),
opacity= ¿?, # PROBLEM IS HERE [1]
))
[1] So I tried a few different things to pass a opacity term:
I created a new tuple for the opacity of each trace, by:
opacity = []
opacity = np.empty(3 * len(df))
opacity [::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [1::3] = df['number_of_journeys'] / max(df['number_of_journeys'])
opacity [2::3] = None
and passed it into [1], but this error came out:
ValueError:
Invalid value of type 'numpy.ndarray' received for the 'opacity' property of scattermapbox
The 'opacity' property is a number and may be specified as:
- An int or float in the interval [0, 1]
I then thought of passing the "opacity" term into the "color" term, by using rgba's property alpha, such as: rgba(255,0,0,0.5).
So I first created a "map" of all alpha parameters:
df['alpha'] = df['number_of_journeys'] / max(df['number_of_journeys'])
and then created a function to retrieve all the alpha parameters inside a specific color:
colors_with_opacity = []
def colors_with_opacity_func(df, empty_list):
for alpha in df['alpha']:
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.extend(["rgba(255,0,0,"+str(alpha)+")"])
empty_list.append(None)
colors_with_opacity_func(df, colors_with_opacity)
and passed that into the color atribute of the Scattermapbox, but got the following error:
ValueError:
Invalid value of type 'builtins.list' received for the 'color' property of scattermapbox.line
The 'color' property is a color and may be specified as:
- A hex string (e.g. '#ff0000')
- An rgb/rgba string (e.g. 'rgb(255,0,0)')
- An hsl/hsla string (e.g. 'hsl(0,100%,50%)')
- An hsv/hsva string (e.g. 'hsv(0,100%,100%)')
- A named CSS color:
aliceblue, antiquewhite, aqua, [...] , whitesmoke,
yellow, yellowgreen
Since it is a massive amount of lines, looping / iterating through traces will carry out performance issues.
Any help will be much appreciated. I can't figure a way to properly accomplish that.
Thank you, in advance.
EDIT 1 : NEW QUESTION ADDED
I add this question here below as I believe it can help others that are looking for this particular topic.
Following Rob's helpful answer, I managed to add multiple opacities, as specified previously.
However, some of my colleagues suggested a change that would improve the visualization of the map.
Now, instead of having multiple opacities (one for each trace, according to the value of the dataframe) I would also like to have multiple widths (according to the same value of the dataframe).
This is, following Rob's answer, I would need something like this:
BINS_FOR_OPACITY=10
opacity_a = np.geomspace(0.001,1, BINS_FOR_OPACITY)
BINS_FOR_WIDTH=10
width_a = np.geomspace(1,3, BINS_FOR_WIDTH)
fig = go.Figure()
# Note the double "for" statement that follows
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_OPACITY, labels=opacity_a)):
for width, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS_FOR_WIDTH, labels=width_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_width=width
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
However, the above is clearly not working, as it is making much more traces than it should do (I really can't explain why, but I guess it might be because of the double loop forced by the two for statements).
It ocurred to me that some kind of solution could be hidding in the pd.cut part, as I would need something like a double cut, but couldn't find a way to properly doing it.
I also managed to create a Pandas series by:
widths = pd.cut(df.["size"], bins=BINS_FOR_WIDTH, labels=width_a)
and iterating over that series, but got the same result as before (an excess of traces).
To emphasize and clarify myself, I don't need to have only multiple opacities or multiple widths, but I need to have them both and at the same time, which is what's causing me some troubles.
Again, any help is deeply thanked.
opacity is per trace, for markers it can be done with color using rgba(a,b,c,d) but not for lines. (Same in straight scatter plots)
to demonstrate, I have used London Underground stations (filtered to reduce number of nodes). Plus gone to extra effort of formatting data as a CSV. JSON as source has nothing to do with solution
encoded to bin number_of_journeys for inclusion into a trace with a geometric progression used for calculating and opacity
this sample data set is generating 83k sample lines
import requests
import geopandas as gpd
import plotly.graph_objects as go
import itertools
import numpy as np
import pandas as pd
from pathlib import Path
# get geometry of london underground stations
gdf = gpd.GeoDataFrame.from_features(
requests.get(
"https://raw.githubusercontent.com/oobrien/vis/master/tube/data/tfl_stations.json"
).json()
)
# limit to zone 1 and stations that have larger number of lines going through them
gdf = gdf.loc[gdf["zone"].isin(["1","2","3","4","5","6"]) & gdf["lines"].apply(len).gt(0)].reset_index(
drop=True
).rename(columns={"id":"tfl_id", "name":"id"})
# wanna join all valid combinations of stations...
combis = np.array(list(itertools.combinations(gdf.index, 2)))
# generate dataframe of all combinations of stations
gdf_c = (
gdf.loc[combis[:, 0], ["geometry", "id"]]
.assign(right=combis[:, 1])
.merge(gdf.loc[:, ["geometry", "id"]], left_on="right", right_index=True, suffixes=("_start_station","_end_station"))
)
gdf_c["lat_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.y)
gdf_c["long_start_station"] = gdf_c["geometry_start_station"].apply(lambda g: g.x)
gdf_c["lat_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.y)
gdf_c["long_end_station"] = gdf_c["geometry_end_station"].apply(lambda g: g.x)
gdf_c = gdf_c.drop(
columns=[
"geometry_start_station",
"right",
"geometry_end_station",
]
).assign(number_of_journeys=np.random.randint(1,10**5,len(gdf_c)))
gdf_c
f = Path.cwd().joinpath("SO.csv")
gdf_c.to_csv(f, index=False)
# there's an requirement to start with a CSV even though no sample data has been provided, now we're starting with a CSV
df = pd.read_csv(f)
# makes use of ravel simpler...
df["none"] = None
# now it's simple to generate scattermapbox... a trace per required opacity
BINS=10
opacity_a = np.geomspace(0.001,1, BINS)
fig = go.Figure()
for opacity, d in df.groupby(pd.cut(df["number_of_journeys"], bins=BINS, labels=opacity_a)):
fig.add_traces(
go.Scattermapbox(
name=f"{d['number_of_journeys'].mean():.2E}",
lat=np.ravel(d.loc[:,[c for c in df.columns if "lat" in c or c=="none"]].values),
lon=np.ravel(d.loc[:,[c for c in df.columns if "long" in c or c=="none"]].values),
line_color="blue",
opacity=opacity,
mode="lines+markers",
)
)
fig.update_layout(
mapbox={
"style": "carto-positron",
"center": {'lat': 51.520214996769255, 'lon': -0.097792388774743},
"zoom": 9,
},
margin={"l": 0, "r": 0, "t": 0, "b": 0},
)

Using Healpy to make star chart

I'm using healpy to plot the locations of galaxies on the sky from a list of RAs and Decs. So far, I think I've been able to correctly plot the galaxies, but I'd like to improve the finished product. Is there any way to bin the number of galaxies that appear in each healpy tile, rather than just coloring being based on whether there is or isn't a catalog member in the tile?
Here I show the image that I'm currently making —
right now it's only really useful for telling you where the Milky Way isn't. Here's the code I'm using.
phis = [np.deg2rad(ra) for ra in ra_list]
thetas = [np.pi / 2 - np.deg2rad(dec) for dec in dec_list]
pixel_indices = hp.ang2pix(NSIDE, thetas, phis)
m = np.zeros(hp.nside2npix(NSIDE))
m[pixel_indices] = np.ones(num_galaxies_to_plot)
hp.mollview(m, title = 'Sky Locations of GLADE Galaxies', cbar = False, rot=(180, 0, 180), cmap = 'binary')
hp.graticule()
You could use numpy.bincount to create an array of the number of galaxies per pixels and then create a map of that.

Adding X-Y offsets to data points

I'm looking for a way to specify an X-Y offset to plotted data points. I'm just getting into Altair, so please bear with me.
The situation: I have a dataset recording daily measurements for 30 people. Every person can register several different types of measurements every day.
Example dataset & plot, with 2 people and 2 measurement types:
import pandas as pd
df = pd.DataFrame.from_dict({"date": pd.to_datetime(pd.date_range("2019-12-01", periods=5).repeat(4)),
"person": pd.np.tile(["Bob", "Amy"], 10),
"measurement_type": pd.np.tile(["score_a", "score_a", "score_b", "score_b"], 5),
"value": 20.0*np.random.random(size=20)})
import altair as alt
alt.Chart(df, width=600, height=100) \
.mark_circle(size=150) \
.encode(x = "date",
y = "person",
color = alt.Color("value"))
This gives me this graph:
In the example above, the 2 measurement types are plotted on top of each other. I would like to add an offset to the circles depending on the "measurement_type" column, so that they can all be made visible around the date-person location in the graph.
Here's a mockup of what I want to achieve:
I've been searching the docs but haven't figured out how to do this - been experimenting with the "stack" option, with the dx and dy options, ...
I have a feeling this should just be another encoding channel (offset or alike), but that doesn't exist.
Can anyone point me in the right direction?
There is currently no concept of an offset encoding in Altair, so the best approach to this will be to combine a column encoding with a y encoding, similar to the Grouped Bar Chart example in Altair's documentation:
alt.Chart(df,
width=600, height=100
).mark_circle(
size=150
).encode(
x = "date",
row='person',
y = "measurement_type",
color = alt.Color("value")
)
You can then fine-tune the look of the result using standard chart configuration settings:
alt.Chart(df,
width=600, height=alt.Step(25)
).mark_circle(
size=150
).encode(
x = "date",
row='person',
y = alt.Y("measurement_type", title=None),
color = alt.Color("value")
).configure_facet(
spacing=10
).configure_view(
strokeOpacity=0
)
Well I don't know what result you are getting up until know, but maybe write a function whith parameters likedef chart(DotsOnXAxis, FirstDotsOnYAxis, SecondDotsOnYAxis, OffsetAmount)
and then put those variables on the right place.
If you want an offset with the dots maybe put in a system like: SecondDotsOnYAxis = FirstDotsOnYAxis + OffsetAmount

Plotting categorical variable over multiple numeric variables in python

I need to plot one categorical variable over multiple numeric variables.
My DataFrame looks like this:
party media_user business_user POLI mass
0 Party_a 0.513999 0.404201 0.696948 0.573476
1 Party_b 0.437972 0.306167 0.432377 0.433618
2 Party_c 0.519350 0.367439 0.704318 0.576708
3 Party_d 0.412027 0.253227 0.353561 0.392207
4 Party_e 0.479891 0.380711 0.683606 0.551105
And I would like a scatter plot with different colors for the different variables; eg. one plot per party per [media_user, business_user, POLI, mass] each in different color.
So like this just with scatters instead of bars:
The closest I've come is this
sns.catplot(x="party", y="media_user", jitter=False, data=sns_df, height = 4, aspect = 5);
producing:
By messing around with some other graphs I found that by simply adding linestyle = '' I could remove the line and add markers. Hope this may help somebody else!
sim_df.plot(figsize = (15,5), linestyle = '', marker = 'o')

How to make the confidence interval (error bands) show on seaborn lineplot

I'm trying to create a plot of classification accuracy for three ML models, depending on the number of features used from the data (the number of features used is from 1 to 75, ranked according to a feature selection method). I did 100 iterations of calculating the accuracy output for each model and for each "# of features used". Below is what my data looks like (clsf from 0 to 2, timepoint from 1 to 75):
data
I am then calling the seaborn function as shown in documentation files.
sns.lineplot(x= "timepoint", y="acc", hue="clsf", data=ttest_df, ci= "sd", err_style = "band")
The plot comes out like this:
plot
I wanted there to be confidence intervals for each point on the x-axis, and don't know why it is not working. I have 100 y values for each x value, so I don't see why it cannot calculate/show it.
You could try your data set using Seaborn's pointplot function instead. It's specifically for showing an indication of uncertainty around a scatter plot of points. By default pointplot will connect values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via linestyles = "" for nominal data. (I used join = False in my example)
I tried to recreate your notebook to give a visual, but wasn't able to get the confidence interval in my plot exactly as you describe. I hope this is helpful for you.
sb.set(style="darkgrid")
sb.pointplot(x = 'timepoint', y = 'acc', hue = 'clsf',
data = ttest_df, ci = 'sd', palette = 'magma',
join = False);

Categories