Setting group order on pySankey sankey chart - python

I'm trying to use a sankey chart to show some user segmentation change using PySankey but the class order is the opposite to what I want. Is there a way for me to specify the order in which each class is posted?
Here is the code I'm using (a dummy version):
test_df = pd.DataFrame({
'curr_seg':np.repeat(['A','B','C','D'],4),
'new_seg':['A','B','C','D']*4,
'num_users':np.random.randint(low=10, high=20, size=16)
})
sankey(
left=test_df["curr_seg"], right=test_df["new_seg"],
leftWeight= test_df["num_users"], rightWeight=test_df["num_users"],
aspect=20, fontsize=20
)
Which produces this chart:
I want to have the A class first and the D class latest on both left and right axis. Does anybody know how can I set it up? Thank you very much.

There is a bug in the first line of check_data_matches_labels function, you need to change to the following:
if len(labels) > 0:
Then you can use leftLabels and rightLabels to control order.

Related

Plotting a map using Geoview and using size/ colour option

I'm trying to visualize a dataset which I've filtered down to just longitude/latitude, country name, year and a count of deaths. I'm trying to plot that using geoviews as I wish to add lot more to my dataset and interactive map would be a great add on
My code is as follows: (for_plot is the dataframe)
# Plotting the graph
Best = gv.Dataset(for_plot)
points = Best.to(gv.Points, ['longitude', 'latitude'], ['deaths', 'country'])
(gts.Wikipedia * points).opts(
opts.Points(width=600, height=350, tools=['hover'],
size='deaths', cmap='viridis'))
This creates a perfect graph put the 'size' function doesn't work. If I change size to color, graph is not generated. I'm okay with either but just need atleast one marker.
Thanks for any help
Tried to switch values for color instead of size, works with year but not deaths

how to replicate plot: density bar plot in Python

I'm working on a project and would like to plot by data in a similar way as this example from a book:
So I would like to create a density histogram for my categorical features (left image) and than add a separate column for each value of another feature (middle and right image).
In my case the feature I want to plot is called [district_code] and I would like to create columns based on a feature called [status_group]
What I've tried so far:
sns.kdeplot(data = raw, x = "district_code"): problem, it is a line plot, not a histogram
sns.kdeplot(data = raw, x = "district_code", col = "status_group"): problem, you can't use the col argument for this plottype
sns.displot(raw, x="district_code", col = 'status_group'): problem, col argument works, but it creates a countplot, not a density plot
I would really appreciate some suggestions about the correct code I could use.
This is just an example for one of my categorical features, but I have many more I would like to plot. Any suggestions on how to turn this into a function where I could run the code for a list of categorical features would be highly appreciated.
UPDATE:
sns.displot(raw, x="source_class", stat = 'density', col = 'status_group', color = 'black'): works but looks a bit akward for some features.
How could I improve this?
Good:
Not so good:

How to change titles (facet_col )in imshow (plotly)

I want to plot several images with imshow from plotly and give each image a title.
My code is
fig = px.imshow(numpy.array(img1,img2), color_continuous_scale='gray', facet_col=0, labels={'facet_col' : 'status'})
fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
fig.show()
and the result looks like
However, I would like to replace status=0 with original and status=1 with clean. Is there an easy way to achieve this result?
Thanks for any help.
I solved my problem by
fig = px.imshow(
numpy.array(img1,img2),
color_continuous_scale='gray',
facet_col=0
)
fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
for i, label in enumerate(['orignal', 'clean']):
fig.layout.annotations[i]['text'] = label
fig.show()
It would be nice, if there would be a shorter way e.g. passing the list of labels directly to the imshow command. However, I did not find any possibility to do that.
I appreciate the answer by #DerJFK, however, it won't support multirows.
This trick will fix the problem:
item_map={f'{i}':key for i, key in enumerate(['orignal', 'clean'])}
fig.for_each_annotation(lambda a: a.update(text=item_map[a.text.split("=")[1]]))
According to the docs, adding parameter 'x' with a list of names should solve it.
fig = px.imshow(numpy.array(img1,img2),
color_continuous_scale='gray',
facet_col=0,
labels={'facet_col' : 'status'},
x=['Orginal', 'Clean'])
fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
fig.show()

Python: Add calculated lines to a scatter plot with a nested categorical x-axis

Cross-post: https://discourse.bokeh.org/t/add-calculated-horizontal-lines-corresponding-to-categories-on-the-x-axis/5544
I would like to duplicate this plot in Python:
Here is my attempt, using pandas and bokeh:
Imports:
import pandas as pd
from bokeh.io import output_notebook, show, reset_output
from bokeh.palettes import Spectral5, Turbo256
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Band, Span, FactorRange, ColumnDataSource
Create data:
fruits = ['Apples', 'Pears']
years = ['2015', '2016']
data = {'fruit' : fruits,
'2015' : [2, 1],
'2016' : [5, 3]}
fruit_df = pd.DataFrame(data).set_index("fruit")
tidy_df = (pd.DataFrame(data)
.melt(id_vars=["fruit"], var_name="year")
.assign(fruit_year=lambda df: list(zip(df['fruit'], df['year'])))
.set_index('fruit_year'))
Create bokeh plot:
p = figure(x_range=FactorRange(factors=tidy_df.index.unique()),
plot_height=400,
plot_width=400,
tooltips=[('Fruit', '#fruit'), # first string is user-defined; second string must refer to a column
('Year', '#year'),
('Value', '#value')])
cds = ColumnDataSource(tidy_df)
index_cmap = factor_cmap("fruit",
Spectral5[:2],
factors=sorted(tidy_df["fruit"].unique())) # this is a reference back to the dataframe
p.circle(x='fruit_year',
y='value',
size=20,
source=cds,
fill_color=index_cmap,
line_color=None,
)
# how do I add a median just to one categorical section?
median = Span(location=tidy_df.loc[tidy_df["fruit"] == "Apples", "value"].median(), # median value for Apples
#dimension='height',
line_color='red',
line_dash='dashed',
line_width=1.0
)
p.add_layout(median)
# how do I add this standard deviation(ish) band to just the Apples or Pears section?
band = Band(
base='fruit_year',
lower=2,
upper=4,
source=cds,
)
p.add_layout(band)
show(p)
Output:
Am I up against this issue? https://github.com/bokeh/bokeh/issues/8592
Is there any other data visualization library for Python that can accomplish this? Altair, Holoviews, Matplotlib, Plotly... ?
Band is a connected area, but your image of the desired output has two disconnected areas. Meaning, you actually need two bands. Take a look at the example here to better understand bands: https://docs.bokeh.org/en/latest/docs/user_guide/annotations.html#bands
By using Band(base='fruit_year', lower=2, upper=4, source=cds) you ask Bokeh to plot a band where for each value of fruit_year, the lower coordinate will be 2 and the upper coordinate will be 4. Which is exactly what you see on your Bokeh plot.
A bit unrelated but still a mistake - notice how your X axis is different from what you wanted. You have to specify the major category first, so replace list(zip(df['fruit'], df['year'])) with list(zip(df['year'], df['fruit'])).
Now, to the "how to" part. Since you need two separate bands, you cannot provide them with the same data source. The way to do it would be to have two extra data sources - one for each band. It ends up being something like this:
for year, sd in [('2015', 0.3), ('2016', 0.5)]:
b_df = (tidy_df[tidy_df['year'] == year]
.drop(columns=['year', 'fruit'])
.assign(lower=lambda df: df['value'].min() - sd,
upper=lambda df: df['value'].max() + sd)
.drop(columns='value'))
p.add_layout(Band(base='fruit_year', lower='lower', upper='upper',
source=ColumnDataSource(b_df)))
There are two issues left however. The first one is a trivial one - the automatic Y range (an instance of DataRange1d class by default) will not take the bands' heights into account. So the bands can easily go out of bounds and be cropped by the plot. The solution here is to use manual ranging that takes the SD values into account.
The second issue is that the width of band is limited to the X range factors, meaning that the circles will be partially outside of the band. This one is not that easy to fix. Usually a solution would be to use a transform to just shift the coordinates a bit at the edges. But since this is a categorical axis, we cannot do it. One possible solution here is to create a custom Band model that adds an offset:
class MyBand(Band):
# language=TypeScript
__implementation__ = """
import {Band, BandView} from "models/annotations/band"
export class MyBandView extends BandView {
protected _map_data(): void {
super._map_data()
const base_sx = this.model.dimension == 'height' ? this._lower_sx : this._lower_sy
if (base_sx.length > 1) {
const offset = (base_sx[1] - base_sx[0]) / 2
base_sx[0] -= offset
base_sx[base_sx.length - 1] += offset
}
}
}
export class MyBand extends Band {
__view_type__: MyBandView
static init_MyBand(): void {
this.prototype.default_view = MyBandView
}
}
"""
Just replace Band with MyBand in the code above and it should work. One caveat - you will need to have Node.js installed and the startup time will be longer for a second or two because the custom model code needs compilation. Another caveat - the custom model code knows about internals of BokehJS. Meaning, that while it's working with Bokeh 2.0.2 I can't guarantee that it will work with any other Bokeh version.

plotly - multiple traces using a shared slider variable

As the title hints, I'm struggling to create a plotly chart that has multiple lines that are functions of the same slider variable.
I hacked something together using bits and pieces from the documentation: https://pastebin.com/eBixANqA. This works for one line.
Now I want to add more lines to the same chart, but this is where I'm struggling. https://pastebin.com/qZCMGeAa.
I'm getting a PlotlyListEntryError: Invalid entry found in 'data' at index, '0'
Path To Error: ['data'][0]
Can someone please help?
It looks like you were using https://plot.ly/python/sliders/ as a reference, unfortunately I don't have time to test with your code, but this should be easily adaptable. If you create each trace you want to plot in the same way that you have been:
trace1 = [dict(
type='scatter',
visible = False,
name = "trace title",
mode = 'markers+lines',
x = x[0:step],
y = y[0:step]) for step in range(len(x))]
where I note in my example my data is coming from pre-defined lists, where you are using a function, that's probably the only change you'll really need to make besides your own step size etc.
If you create a second trace in the same way, for example
trace2 = [dict(
type='scatter',
visible = False,
name = "trace title",
mode = 'markers+lines',
x = x2[0:step],
y = y2[0:step]) for step in range(len(x2))]`
Then you can put all your data together with the following
all_traces = trace1 + trace2
then you can just go ahead and plot it provided you have your layout set up correctly (it should remain unchanged from your single trace example):
fig = py.graph_objs.Figure(data=all_traces, layout=layout)
py.offline.iplot(fig)
Your slider should control both traces provided you were following https://plot.ly/python/sliders/ to get the slider working. You can combine multiple data dictionaries this way in order to have multiple plots controlled by the same slider.
I do note that if your lists of dictionaries containing data are of different length, that this gets topsy-turvy.

Categories