How to subplot histogram using multiple columns with plotly? - python

So I have data that I transformed up to this point (pic below). How can I now subplot histograms that will show me the programming languages used per job? I tried with just 2 columns at first:
px.histogram(languages_job_title_grouped, x =['Python','SQL'], facet_col = 'Role', facet_col_wrap = 4,height = 1000)
But it didn't work - it plots histogram by job, but the bars are the same for every role (2nd picture below).
How can I do it the right way?

From the context of your question, it seems like you are looking for a bar plot instead.
I.e. If I understand correctly, you are starting from a dataframe equivalent to
and trying to plot
where the facets are the index, the x-axis is each column, and the bar heights are the values in the dataframe.
The code that generates this is:
import pandas as pd
import plotly.express as px
df = pd.DataFrame(
[[0.1, 0.3, 0.5], [0.2, 0.1, 0.8], [0.5, 0.3, 0.9]],
columns=["a", "b", "c"],
index=["index1", "index2", "index3"],
)
px.bar(
df.melt(ignore_index=False).reset_index(),
facet_col="index",
x="variable",
y="value",
barmode="group",
)
The key being to reformat your DataFrame using melt before trying to plot with plotly.express.

Related

Plotly Express: Plotting individual columns of a dataframe as multiple plots (scrollable) using plotly express

I posed a question at Plotly: How to add a horizontal scrollbar to a plotly express figure? asking how to add a horizontal scrollbar to a plotly express figure for purposes of visualizing a long multivariate time series. A solution for a simple example consisting of three series having 100K points each was given as follows:
import plotly.express as px
import numpy as np
import pandas as pd
np.random.seed(123)
e = np.random.randn(100000,3)
df=pd.DataFrame(e, columns=['a','b','c'])
df['x'] = df.index
df_melt = pd.melt(df, id_vars="x", value_vars=df.columns[:-1])
fig=px.line(df_melt, x="x", y="value",color="variable")
# Add range slider
fig.update_layout(xaxis=dict(rangeslider=dict(visible=True),
type="linear"))
fig.show()
This code is nice, but I'd like to have the plots not superimposed on a single set of axes--instead one above the other as would be done with subplot. For example, signal 'a' would appear above signal 'b', which would appear above signal 'c'.
Because my actual time series have at least 50 channels, a vertical scrollbar will likely be necessary.
As far as I know, it may be possible in dash, but it does not exist in plotly. The question you quoted also suggests a range slider as a substitute for the scroll function. At the same time, the range slider is integrated with the graph, so if you don't make the slider function independent, it will disappear on scrolling, which is not a good idea. I think the solution at the moment is to have 50 channels side by side and add a slider.
import plotly.graph_objects as go
import numpy as np
import pandas as pd
np.random.seed(123)
e = np.random.randn(100000,3)
df=pd.DataFrame(e, columns=['a','b','c'])
df['x'] = df.index
df_melt = pd.melt(df, id_vars="x", value_vars=df.columns[:-1])
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_melt.query('variable == "a"')['x'],
y=df_melt.query('variable == "a"')['value'], yaxis='y'))
fig.add_trace(go.Scatter(x=df_melt.query('variable == "b"')['x'],
y=df_melt.query('variable == "b"')['value'], yaxis='y2'))
fig.add_trace(go.Scatter(x=df_melt.query('variable == "c"')['x'],
y=df_melt.query('variable == "c"')['value'], yaxis='y3'))
# Add range slider
fig.update_layout(
xaxis=dict(
rangeslider=dict(visible=True),
type="linear"),
yaxis=dict(
anchor='x',
domain=[0, 0.33],
linecolor='blue',
type='linear',
zeroline=False
),
yaxis2=dict(
anchor='x',
domain=[0.33, 0.66],
linecolor='red',
type='linear',
zeroline=False
),
yaxis3=dict(
anchor='x',
domain=[0.66, 1.0],
linecolor='green',
type='linear',
zeroline=False
),
)
fig.show()

Plotly: Grouped and Stacked barchart, different colors for stacked bars?

I have barchart that is both grouped and stacked. Now I want to color the Values that are stacked in a different color. My plot currently looks like this (with a marker of the parts that should be colored differently):
The group corresponds to a category and the stacking is based on delta between two value columns (the delta is stacked on top of the lower values).
I want to color the delta (i.e. the upper part of the stacked bar) in a slightly lighter color (here: light blue for Cat1 and light red for Cat1).
However, I cannot find an appropriate option for doing so.
My data looks like this:
import pandas as pd
data = [
["Station1", 2.5, 3.0, "Cat1"],
["Station1", 3.7, 4.2, "Cat2"],
["Station2", 1.7, 2.1, "Cat1"],
["Station2", 3.9, 4.0, "Cat2"],
]
df = pd.DataFrame(data, columns=["station", "val1", "val2", "category"])
df["delta"] = df["val2"] - df["val1"]
And here is the plotting function:
import plotly.express as px
px.bar(df, x="station", y=["val1", "delta"], color="category", barmode="group").show()
How can I color the delta differently? The solution can also be in plotly (does not have to be plotly express).
Is there a way to do the workaround of manually computing the delta column?
I've put together a suggestion that should fit your needs pretty well. At least visually as long as delta >=0. To my knowledge, grouped and stacked bar charts are still a bit tricky. But building two different figures using px.Bar and then combining them using fig._data in a go.Figure object() will let you build this:
As you will see from the code, delta has not been explicitly implemented, but rather appears visually between the opaque bars and the non-opaque bars. If this is something you can use I'll see if I can include delta in the hoverinfo.
Complete code:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
data = [
["Station1", 2.5, 3.0, "Cat1"],
["Station1", 3.7, 4.2, "Cat2"],
["Station2", 1.7, 2.1, "Cat1"],
["Station2", 3.9, 4.0, "Cat2"],
]
df = pd.DataFrame(data, columns=["station", "val1", "val2", "category"])
df["delta"] = df["val2"] - df["val1"]
fig1_data = px.bar(df, x="station", y=["val1"], color="category", barmode="group")._data
fig2_data = px.bar(df, x="station", y=['val2'], color="category", barmode="group")._data
fig1_data[0]['marker']['color'] = 'rgba(255,0,0,1)'
fig1_data[1]['marker']['color'] = 'rgba(0,0,255,1)'
fig2_data[0]['marker']['color'] = 'rgba(255,0,0,0.2)'
fig2_data[1]['marker']['color'] = 'rgba(0,0,255,0.2)'
dat = fig1_data+fig2_data
fig = go.Figure(dat)
fig3 = go.Figure(fig._data)
fig.update_layout(barmode='relative', title_text='Relative Barmode')
fig3.show()

Plotly: How to display individual value on histogram?

I am trying to make dynamic plots with plotly. I want to plot a count of data that have been aggregated (using groupby).
I want to facet the plot by color (and maybe even by column). The problem is that I want the value count to be displayed on each bar. With histogram, I get smooth bars but I can't find how to display the count:
With a bar plot I can display the count but I don't get smooth bar and the count does not appear for the whole bar but for each case composing that bar
Here is my code for the barplot
val = pd.DataFrame(data2.groupby(["program", "gender"])["experience"].value_counts())
px.bar(x=val.index.get_level_values(0), y=val, color=val.index.get_level_values(1), barmode="group", text=val)
It's basically the same for the histogram.
Thank you for your help!
px.histogram does not seem to have a text attribute. So if you're willing to do any binning before producing your plot, I would use px.Bar. Normally, you apply text to your barplot using px.Bar(... text = <something>). But this gives the results you've described with text for all subcategories of your data. But since we know that px.Bar adds data and annotations in the order that the source is organized, we can simply update text to the last subcategory applied using fig.data[-1].text = sums. The only challenge that remains is some data munging to retrieve the correct sums.
Plot:
Complete code with data example:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data
df = pd.DataFrame({'x':['a', 'b', 'c', 'd'],
'y1':[1, 4, 9, 16],
'y2':[1, 4, 9, 16],
'y3':[6, 8, 4.5, 8]})
df = df.set_index('x')
# calculations
# column sums for transposed dataframe
sums= []
for col in df.T:
sums.append(df.T[col].sum())
# change dataframe format from wide to long for input to plotly express
df = df.reset_index()
df = pd.melt(df, id_vars = ['x'], value_vars = df.columns[1:])
fig = px.bar(df, x='x', y='value', color='variable')
fig.data[-1].text = sums
fig.update_traces(textposition='inside')
fig.show()
If your first graph is with graph object librairy you can try:
# Use textposition='auto' for direct text
fig=go.Figure(data[go.Bar(x=val.index.get_level_values(0),
y=val, color=val.index.get_level_values(1),
barmode="group", text=val, textposition='auto',
)])

Plotting multiple overlapping histograms with columns from two separate pandas data frames

I've got two pandas data frames with the same column names ('Height', 'Weight', 'Alc'), but one is with dirty data and one with cleaned data.
To visualise my cleaning, I want to make a histogram with overlapping bars showing the dirty data and the cleaned data for a given variable on the same chart. Eg for 'Height' I want a chart looking like the one pictured below, with dirty 'Height' data in one colour, and cleaned 'Height' data in another.
Importantly, I want to make such a chart for all of my variables (ie 'Height', 'Weight', and 'Alc').
In order to achieve this I've tried appending the data frames and plotting the histograms from a single data frame. Unfortunately it produces just regular histograms, not overlapping ones.
Thanks for any help!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# DIRTY DATA FRAME
data = [[1.0, 10, 0], [0.0, 12, 0.4], [2.0, 918, 0.9], [2.0, 918, 0.9]]
df_dirty = pd.DataFrame(data, columns = ['Height', 'Weight', 'Alc'])
# CLEAN DATA FRAME
data2 = [[1.0, 10, np.nan], [0.0, 12, 0.4], [np.nan, np.nan, 0.9], [np.nan, np.nan, 0.9]]
df_clean = pd.DataFrame(data2, columns = ['Height', 'Weight', 'Alc'])
# ADDITION OF GROUP-BY VARIABLE
my_dfs = [df_dirty, df_clean]
my_status = ['dirty','clean']
for dataframe, status in zip(my_dfs, my_status):
dataframe['groupbyvar'] = status
# DATA FRAME FOR PLOTTING
df_plotting = df_dirty.append(df_clean, ignore_index=True)
# HISTOGRAMS FOR ALL COLUMNS IN DATA FRAME FOR PLOTTING
for col in df_plotting:
df_plotting.hist(by="groupbyvar", column=col)
EDIT:
The above chart is just a picture to explain what I want, my current code does NOT make it.

Plotly-Express: How to fix the color mapping when setting color by column name

I am using plotly express for a scatter plot. The color of the markers is defined by a variable of my dataframe, as in the example below.
import pandas as pd
import numpy as np
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df[df.species.isin(['virginica', 'setosa'])], x="sepal_width", y="sepal_length", color="species")
fig.show()
When I add another instance of this variable, the color mapping changes (First, 'virginica', is red, then green).
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",size='petal_length', hover_data=['petal_width'])
fig.show()
How can I keep the mapping of the colors when adding variables?
I found a solution. The function px.scatter has an argument color_discrete_map which is exactly what I needed. color_discrete_map takes a dictionary where the keys are the values of the species and the values are colors assigned to the species.
import plotly.express as px
df = px.data.iris()
color_discrete_map = {'virginica': 'rgb(255,0,0)', 'setosa': 'rgb(0,255,0)', 'versicolor': 'rgb(0,0,255)'}
fig = px.scatter(df[df.species.isin(['virginica', 'setosa'])], x="sepal_width", y="sepal_length", color="species", color_discrete_map=color_discrete_map)
fig.show()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", color_discrete_map=color_discrete_map)
fig.show()
Short answer:
1. Assign colors to variables with color_discrete_map :
color_discrete_map = {'virginica': 'blue', 'setosa': 'red', 'versicolor': 'green'}
or:
2. Manage the order of your data to enable the correct color cycle with:
order_df(df_input = df, order_by='species', order=['virginica', 'setosa', 'versicolor'])
... where order_df is a function that handles the ordering of long dataframes for which you'll find the complete definition in the code snippets below.
The details:
1. You can map colors to variables directly with:
color_discrete_map = {'virginica': 'blue', 'setosa': 'red', 'versicolor': 'green'}
The downside is that you'll have to specify variable names and colors. And that quickly becomes tedious if you're working with dataframes where the number of variables is not fixed. In which case it would be much more convenient to follow the default color sequence or specify one to your liking. So I would rather consider managing the order of your dataset so that you'll get the desired colormatching.
2. The source of the real challenge:
px.Scatter() will assign color to variable in the order they appear in your dataframe. Here you're using two different sourcesdf and df[df.species.isin(['virginica', 'setosa', 'versicolor'])] (let's name the latter df2). Running df2['species'].unique() will give you:
array(['setosa', 'virginica'], dtype=object)
And running df['species'] will give you:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
See that versicolor pops up in the middle? Thats's why red is no longer assigned to 'virginica', but 'versicolor' instead.
Suggested solution:
So in order to build a complete solution, you'd have to find a way to specify the order of the variables in the source dataframe. Thats very straight forward for a column with unique values. It's a bit more work for a dataframe of a long format such as this. You could do it as described in the post Changing row order in pandas dataframe without losing or messing up data. But below I've put together a very easy function that takes care of both the subset and the order of the dataframe you'd like to plot with plotly express.
Using the complete code and switching between the lines under # data subsets will give you the three following plots:
Plot 1: order=['virginica']
Plot 2: ['virginica', 'setosa']
Plot 3: order=['virginica', 'setosa', 'versicolor']
Complete code:
# imports
import pandas as pd
import plotly.express as px
# data
df = px.data.iris()
# function to subset and order a pandas
# dataframe fo a long format
def order_df(df_input, order_by, order):
df_output=pd.DataFrame()
for var in order:
df_append=df_input[df_input[order_by]==var].copy()
df_output = pd.concat([df_output, df_append])
return(df_output)
# data subsets
df_express = order_df(df_input = df, order_by='species', order=['virginica'])
df_express = order_df(df_input = df, order_by='species', order=['virginica', 'setosa'])
df_express = order_df(df_input = df, order_by='species', order=['virginica', 'setosa', 'versicolor'])
# plotly
fig = px.scatter(df_express, x="sepal_width", y="sepal_length", color="species")
fig.show()

Categories