For context: I'd like to make a plot in plotly showing the evolution of an investment portfolio where the value of each asset is plotted on top of each other. Since assets are bought and sold, not every asset should be shown for the entire range of the curve.
The below example can clarify this. Leading or trailing zeros indicate that the asset was not in the portfolio at that moment.
import pandas as pd
import plotly.express as px
import numpy as np
data = {"Asset 1": [0, 1, 2, 3, 4, 5], "Asset 2": [0, 0, 2, 3, 2, 2], "Asset 3": [1, 1, 3, 0, 0, 0]}
df = pd.DataFrame(data)
fig = px.area(df)
fig.show()
This results in the following figure:
The problem is now that at the indicated time (index=4), Asset 3 is not in the portfolio anymore, hence its value 0. However it is still shown, and the bigger problem is that it makes it impossible to see the value of Asset 2 which is in the portfolio.
I tried changing the zeros to NaN values to indicate that they don't exist but that gives the exact same figure.
data2 = {"a": [np.nan, 1, 2, 3, 4, 5], "b": [np.nan, np.nan, 2, 3, 2, 2], "c": [1, 1, 3, np.nan, np.nan, np.nan]}
df2 = pd.DataFrame(data2)
fig2 = px.area(df2)
fig2.show()
I am afraid I cannot construct an elegant solution. However this will work for most requirements you stated. How it works:
Instead of using the auto stack function, draw the line one by one by yourself.
That means you will have to pre-process the dataframe a little bit - by calculating the values of column A+B and column A+B+C.
plotly.express offers limited custom control. Instead of using plotly.express, use plotly.graph_objects. They have similar syntax.
The order of placing the "traces" (aka. lines) is important. The last line rendered get placed on the top. In your problem statement, the lines get drawn from left-most to right-most column, and that's why overlapping would favor the right-er column.
The NaN values has to be zero-filled manually before the plotting. Otherwise the filled areas create weird shapes, considering your sample data contains a certain amount of NaNs.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
data = {"a": [np.nan, 1, 2, 3, 4, 5], "b": [np.nan, np.nan, 2, 3, 2, 2], "c": [1, 1, 3, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
# fill NAs with zeros before doing anything
df = df.fillna(0)
fig = go.Figure()
# add lines one by one. The order matters - last one lays on top along with its hoverinfo
fig.add_trace(go.Scatter(
x=df.index,
y=df['a'],
mode='lines',
fill='tonexty', # fill the area under line to next y
))
fig.add_trace(go.Scatter(
x=df.index,
y=df['a']+df['b'], # sum of 'a' and 'b'
mode='lines',
fill='tonexty', # fill the area under line to next y
))
fig.add_trace(go.Scatter(
x=df.index,
y=df['a']+df['b']+df['c'], # sum of 'a' and 'b' and 'c'
mode='lines',
fill='tonexty', # fill the area under line to next y
))
# minor bug where an area below zero is shown
fig.update_layout(yaxis=dict(range=[0, max(df.sum(axis=1) * 1.05)]))
fig.show()
The resulting plot would look like:
The green line, representing values of df['a']+df['b']+df['c'] still sits on the top. However, the hover label is now showing the value of df['a']+df['b']+df['c'] instead of either of the assets.
In fact, I found these asset-allocation-y plot prettier without the edge lines:
and this can be done by setting mode='none' for each of the 3 plot objects.
Remarks:
Another way I have tried for anyone who is reading: consider each filled area and line as two separate traces. By doing so, you will need to define custom pairs of colors (solid and its half-transparent color). There were some buggy results for this. Also, the struggle of traces with stackgroup set in argument cannot contain NaN values and NaN values will either be zero-filled or interpolated. This creates bad plots in the context of this problem.
Related
In seaborn, is it possible to group observations based on a column without using the hue argument?
For example, how could I get these two lines to show up in the same colour, but as separate lines?
Code for generating this is below.
import pandas as pd
import seaborn as sns
df = pd.DataFrame(
{
'group': ["group01", "group01", "group02", "group02"],
'x': [1, 2, 3, 5],
'y': [2, 4, 3, 5]
}
)
sns.lineplot(df, x='x', y='y', hue='group')
plt.show()
This is straightforward to do in R's ggplot, by mapping the group variable to group, rather than to colour. For example, see this.
The reason I want to do this is that I want to show multiple overlaid plots all in the same colour. This helps to show variability across different datasets. The different colours that I would get with seaborn's hue are unnecessary and distracting, especially when there would be dozens of them. Here is the sort of plot I want to create:
seaborn.lineplot has a units parameter, which seems to be equivalent to ggplot's group:
units: vector or key in data
Grouping variable identifying sampling units. When used, a separate line will be drawn for each unit with appropriate semantics,
but no legend entry will be added. Useful for showing distribution of
experimental replicates when exact identities are not needed.
sns.lineplot(df, x='x', y='y', units='group')
Output:
combining units and hue in a more complex example:
df = pd.DataFrame(
{
'group': ["group01", "group01", "group02", "group02", "group01", "group01"],
'group2': ['A', 'A', 'A', 'A', 'B', 'B'],
'x': [1, 2, 3, 5, 2, 4],
'y': [2, 4, 3, 5, 3, 2]
}
)
sns.lineplot(df, x='x', y='y', units='group', hue='group2')
Output:
I have a data frame with one column that describes y-axis values and two more columns that describe the upper and lower bounds of a confidence interval. I would like to use those values to draw error bars using plotly. Now I am aware that plotly offers the possibility to draw confidence intervals (using the error_y and error_y_minus keyword-arguments) but not in the logic that I need, because those keywords are interpreted as additions and subtractions from the y-values. Instead, I would like to directly define the upper and lower positions:
For instance, how could I use plotly and this example data frame
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'x':[0, 1, 2],
'y':[6, 10, 2],
'ci_upper':[8,11,2.5],
'ci_lower':[5,9,1.5]})
to produce a plot like this?
used Plotly Express to create bar chart
used https://plotly.com/python/error-bars/#asymmetric-error-bars for generation of error bars using appropriate subtractions with your required outcome
import pandas as pd
import plotly.express as px
df = pd.DataFrame(
{"x": [0, 1, 2], "y": [6, 10, 2], "ci_upper": [8, 11, 2.5], "ci_lower": [5, 9, 1.5]}
)
px.bar(df, x="x", y="y").update_traces(
error_y={
"type": "data",
"symmetric": False,
"array": df["ci_upper"] - df["y"],
"arrayminus": df["y"] - df["ci_lower"],
}
)
When plotting a bar chart with Seaborn and using the hue parameter to color the bars according to their column value, bars with identical column values are nested, or aggregated, and only a single bar is shown. The image below illustrates the problem. Patient number 1 has two samples of sample_type 1, with values 10 and 20. The two values have been nested, and both values are represented as a single bar (as the average of the two).
I'd like to avoid this nesting, and rather have something like in the image below.
Is this possible to achieve? MVE below. Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
"patient_number": [1, 1, 1, 2, 2, 2],
"sample_type": [1, 1, 2, 1, 2, 3],
"value": [10, 20, 15, 10, 11, 12]
})
sns.barplot(x="patient_number", y="value", hue="sample_type", data=df)
plt.show()
The following approach obtains the desired plot:
Seaborn's hue= parameter both defines the color and the position of the bars.
Per patient, an extra field ('idx') contains a unique number for each of the desired bars. This field 'idx' restarts from 0 for every next patient and is added to the dataframe.
'idx' can then be used as hue='idx' to get the desired columns, although they will be colored just sequently.
In order to get one color per sample type, an extra column now contains a factorized version of the sample types (so, 0 for the first type, 1 for the next, etc.)
Seaborn generates the bars per hue, one for each patient. These bars can be accessed as a list via ax.patches If some patient doesn't have a value for a given 'idx', a dummy bar is will be added to the list.
By iterating through the patients and then through the 'idx', all bars can be visited and colored via 'sample_type'. As the ordering of the bars is a bit tricky, an adequate renumbering is needed.
The legend needs to be changed to reflect the sample types.
The given data is extended a bit to be able to test different numbers of samples per patient, and sample types that aren't simple subsequent numbers.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'patient_number': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'sample_type': ['st1', 'st1', 'st2', 'st1', 'st2', 'st3', 'st4', 'st4', 'st4', 'st4', 'st4'],
'value': [10, 20, 15, 10, 11, 12, 1, 2, 3, 4, 5]
})
df['idx'] = df.groupby('patient_number').cumcount()
df['sample_factors'], sample_labels = pd.factorize(df['sample_type'])
ax = sns.barplot(x='patient_number', y='value', hue='idx', data=df)
colors = plt.cm.get_cmap('Set2').colors # https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
handles = [None for _ in sample_labels]
num_patients = len(ax.patches) // (df['idx'].max() + 1)
for i, (patient_id, group) in enumerate(df.groupby('patient_number')):
for j, factor in enumerate(group['sample_factors']):
patch = ax.patches[i + j * num_patients]
patch.set_color(colors[factor])
handles[factor] = patch
ax.legend(handles=handles, labels=list(sample_labels), title='Sample type')
plt.show()
I would like to plot a chart with plotly that shows only the existing values in the x-axis.
When I execute the code below, a chart that looks like in the following image appears:
The range on the x-axis as well as the range on the y-axis is evenly set from zero up to the maximal value.
import plotly.graph_objs as go
from plotly.offline import plot
xValues = [1, 2, 27, 50]
yValues = [7, 1, 2, 3]
trace = go.Scatter( x = xValues, y = yValues, mode='lines+markers', name='high limits' )
plottedData = [trace]
plot( plottedData )
Now, I would like to show only the existing values on the x axis. Related to my example, I want just the values [1, 2, 27, 50] to appear. And they should have the same space in between. Is this possible? If yes, how?
You can force the xaxis.type to be category like this:
plot( dict(data=plottedData, layout=go.Layout(xaxis = {"type": "category"} )))
I have a dataset consisting of financial stock IDs [0, 1400] and timestamps [0, 1800]. For a given ID, it either does or does not have data for a given timestamp.
I have created a dictionary where each key is an ID, and each value is a list of all the timestamps for which that ID has data.
I would now like to plot a chart with the each row corresponding to an ID, and each column corresponding to a timestamp. Each cell [i, j] of the chart will be coloured green if ID i has data for timestamp j (if j in dict[i]), and red if not.
Here is a sample I produced manually in Excel:
Can this be done through matplotlip or some other library?
Since the chart would be of size 1400x1800, the cells can be very small. I am attempting to reorder the data so that the number of green cells intersecting between adjacent IDs is maximised and so this chart will allow me to provide visualisations of how well I have achieved these overlaps/intersections across the dataset.
To provide some data, I simply iterated through the first 20 IDs in my dictionary and printed out the ID and the list of its timestamps. each line is in the form of ID [list of IDs timestamps]
EDIT:
Here is my first attempt on a example of data on a small scale. Although, this does achieved what I set out to do, it is a very brute force solution, so any recommendations on improvements would be appreciated.
import matplotlib.pyplot as plt
import pandas as pd
TSs = [0, 1, 2, 3, 4, 5]
ID_TS = {0: [1, 2, 3], 1: [2, 3, 4, 5]}
df = pd.DataFrame(index=ID_TS.keys(), columns=TSs)
for ID, TS in ID_TS.items():
bools = []
for i in TSs:
if i in TS:
bools.append(True)
else:
bools.append(False)
df.loc[ID] = bools
plt.imshow(df, cmap='hot', interpolation='nearest')
plt.show()
Your code to generate your dataframe doesn't work. So I took some liberties with that...
import numpy
import pandas
from matplotlib import pyplot
from matplotlib import ticker
TSs = [0, 1, 2, 3, 4, 5]
ID_TS = {0: [1, 2, 3, numpy.nan], 1: [2, 3, 4, 5]}
fig, ax = pyplot.subplots()
img = (
pandas.DataFrame(data=ID_TS, columns=TSs)
.isnull()
.pipe(numpy.bitwise_not)
.pipe(ax.pcolor, cmap='RdYlGn', edgecolors='k')
)
unit_ints = ticker.MultipleLocator(1)
ax.set_xlabel('Time')
ax.set_ylabel('ID')
ax.yaxis.set_major_locator(unit_ints)
ax.xaxis.set_major_locator(unit_ints)