Plotly: How to handle missing dates for a financial time series? - python

Financial time series are often fraught with missing data. And out of the box, plotly handles a series with missing timestamps visually by just displaying a line like below. But the challenge here is that plotly interprets the timestamps as a value, and inserts all missing dates in the figure.
Most of the time, I find that the plot would look better by just completely leaving those dates out.
An example from the plotly docs under https://plotly.com/python/time-series/#hiding-weekends-and-holidays shows how to handle missing dates for some date categories like weekends or holidays using:
fig.update_xaxes(
rangebreaks=[
dict(bounds=["sat", "mon"]), #hide weekends
dict(values=["2015-12-25", "2016-01-01"]) # hide Christmas and New Year's
]
)
The downside here is that your dataset may just as well be missing some data for any other weekday. And of course you would have to specify given dates for holidays for different countries, so are there any other approaches?
Reproducible code:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
# data
np.random.seed(1234)
n_obs = 15
frequency = 'D'
daterange = pd.date_range('2020', freq=frequency, periods=n_obs)
values = np.random.randint(low=-5, high=6, size=n_obs).tolist()
df = pd.DataFrame({'time':daterange, 'value':values})
df = df.set_index('time')
df.iloc[0]=100; df['value']=df.value.cumsum()
# Missing timestamps
df.iloc[2:5] = np.nan; df.iloc[8:13] = np.nan
df.dropna(inplace = True)
# plotly figure
fig=go.Figure(go.Scatter(x=df.index, y =df['value']))
fig.update_layout(template = 'plotly_dark')
fig.show()

They key here is still to use the rangebreak attribute. But if you were to follow the approach explained in the linked example, you'd have to include each missing date manually. But the solution to missing data in this case is actually more missing data. And this is why:
1. You can retrieve the timestamps from the beginning and the end of your series, and then
2. build a complete timeline within that period (with possibly more missing dates) using:
dt_all = pd.date_range(start=df.index[0],
end=df.index[-1],
freq = 'D')
3. Next you can isolate the timestamps you do have in df.index that are not in that timeline using:
dt_breaks = [d for d in dt_all_py if d not in dt_obs_py]
4. And finally you can include those timestamps in rangebreaks like so:
fig.update_xaxes(
rangebreaks=[dict(values=dt_breaks)]
)
Plot:
Complete code:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
# data
np.random.seed(1234)
n_obs = 15
frequency = 'D'
daterange = pd.date_range('2020', freq=frequency, periods=n_obs)
values = np.random.randint(low=-5, high=6, size=n_obs).tolist()
df = pd.DataFrame({'time':daterange, 'value':values})
df = df.set_index('time')
df.iloc[0]=100; df['value']=df.value.cumsum()
# Missing timestamps
df.iloc[2:5] = np.nan; df.iloc[8:13] = np.nan
df.dropna(inplace = True)
# plotly figure
fig=go.Figure(go.Scatter(x=df.index, y =df['value']))
fig.update_layout(template = 'plotly_dark')
# complete timeline between first and last timestamps
dt_all = pd.date_range(start=df.index[0],
end=df.index[-1],
freq = frequency)
# make sure input and synthetic time series are of the same types
dt_all_py = [d.to_pydatetime() for d in dt_all]
dt_obs_py = [d.to_pydatetime() for d in df.index]
# find which timestamps are missing in the complete timeline
dt_breaks = [d for d in dt_all_py if d not in dt_obs_py]
# remove missing timestamps from visualization
fig.update_xaxes(
rangebreaks=[dict(values=dt_breaks)] # hide timestamps with no values
)
#fig.update_layout(title=dict(text="Some dates are missing, but still displayed"))
fig.update_layout(title=dict(text="Missing dates are excluded by rangebreaks"))
fig.update_xaxes(showgrid=False)
fig.show()

When handle big size data, the 'rangebreaks' method is working but performance is low, change the xaxis type to 'category' is also working.
fig.update_xaxes(type='category')

You can use dtick property. change the tick interval to be one day which should be defined in Milliseconds like 86400000 via the dtick property. Refer the following code:
fig.update_xaxes(dtick=86400000)

Related

Plotly: different colors in one line

I am trying to distinguish weekends from weekdays by either 1) shading the region 2) coloring points with different colors or 3) setting x-axis label marked different for weekend.
Here I am trying the 2nd option — coloring data points for weekend differently. I first created an additional column (Is_Weekday) for distinguish weekends from weekdays. However, it’s not drawn on the same line, but rather draws two lines with different colors. I would like them to be in one line but with different color for values on weekends.
Here’s my code for reproducible data:
import pandas as pd
from datetime import datetime
import plotly.express as px
np.random.seed(42)
rng = pd.date_range('2022-04-10', periods=21, freq='D')
practice_df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
practice_df = practice_df.set_index('Date')
weekend_list = []
for i in range(len(practice_df)):
if practice_df.index[i].weekday() > 4:
weekend_list.append(True)
else:
weekend_list.append(False)
practice_df['IsWeekend'] = weekend_list
fig = px.line(temp_df,
x=temp_df.index, y='cnt',
color = 'Is_Weekend',
markers=True)
fig.show()
What I want to do would look something like this but coloring data points/line for weekends differently.
Edit:
Thanks so much to #Derek_O, I was able to color weekend with my original dataset. But I'd want to color the friday-saturday line also colored as weekend legend, so I set practice_df.index[i].weekday() >= 4 instead of practice_df.index[i].weekday() > 4.
But would it be possible to have the Friday point to be the same as weekdays.
Also, is it possible to have a straight line connecting the points, not like stairs?
Otherwise, it'd also work if we could shade weekend region like the image at the bottom.
Borrowing from #Rob Raymond's answer here, we can loop through the practice_df two elements at a time, adding a trace to the fig for each iteration of the loop.
We also only want to show the legend category the first time it occurs (so that the legend entries only show each category like True or False once), which is why I've created a new column called "showlegend" that determines whether the legend is shown or not.
import numpy as np
import pandas as pd
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go
np.random.seed(42)
rng = pd.date_range('2022-04-10', periods=21, freq='D')
practice_df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
practice_df = practice_df.set_index('Date')
weekend_list = []
for i in range(len(practice_df)):
if practice_df.index[i].weekday() > 4:
weekend_list.append(True)
else:
weekend_list.append(False)
practice_df['IsWeekend'] = weekend_list
weekend_color_map = {True:0, False:1}
weekend_name_map = {True:"True", False:"False"}
practice_df['color'] = practice_df['IsWeekend'].map(weekend_color_map)
practice_df['name'] = practice_df['IsWeekend'].map(weekend_name_map)
## use the color column since weekend corresponds to 0, nonweekend corresponds to 1
first_weekend_idx = practice_df['color'].loc[practice_df['color'].idxmin()]
first_nonweekend_idx = practice_df['color'].loc[practice_df['color'].idxmax()]
practice_df["showlegend"] = False
showlegendIdx = practice_df.columns.get_indexer(["showlegend"])[0]
practice_df.iat[first_weekend_idx, showlegendIdx] = True
practice_df.iat[first_nonweekend_idx, showlegendIdx] = True
practice_df["showlegend"] = practice_df["showlegend"].astype(object)
fig = go.Figure(
[
go.Scatter(
x=practice_df.index[tn : tn + 2],
y=practice_df['Val'][tn : tn + 2],
mode='lines+markers',
# line_shape="hv",
line_color=px.colors.qualitative.Plotly[practice_df['color'][tn]],
name=practice_df['name'][tn],
legendgroup=practice_df['name'][tn],
showlegend=practice_df['showlegend'][tn],
)
for tn in range(len(practice_df))
]
)
fig.update_layout(legend_title_text='Is Weekend')
fig.show()

Matplotlib Time-Series Heatmap Visualization Row Modification

Thank you in advance for the assistance!
I am trying to create a heat map from time-series data and the data begins mid year, which is causing the top of my heat map to be shifted to the left and not match up with the rest of the plot (Shown Below). How would I go about shifting the just the top line over so that the visualization of the data syncs up with the rest of the plot?
(Code Provided Below)
import pandas as pd
import matplotlib.pyplot as plt
# links to datadata
url1 = 'https://raw.githubusercontent.com/the-datadudes/deepSoilTemperature/master/minotDailyAirTemp.csv'
# load the data into a DataFrame, not a Series
# parse the dates, and set them as the index
df1 = pd.read_csv(url1, parse_dates=['Date'], index_col=['Date'])
# groupby year and aggregate Temp into a list
dfg1 = df1.groupby(df1.index.year).agg({'Temp': list})
# create a wide format dataframe with all the temp data expanded
df1_wide = pd.DataFrame(dfg1.Temp.tolist(), index=dfg1.index)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
ax1.matshow(df1_wide, interpolation=None, aspect='auto');
Now, what its the problem, the dates on the dataset, if you see the Dataset this start on
`1990-4-24,15.533`
To solve this is neccesary to add the data between 1990/01/01 -/04/23 and delete the 29Feb.
rng = pd.date_range(start='1990-01-01', end='1990-04-23', freq='D')
df = pd.DataFrame(index= rng)
df.index = pd.to_datetime(df.index)
df['Temp'] = np.NaN
frames = [df, df1]
result = pd.concat(frames)
result = result[~((result.index.month == 2) & (result.index.day == 29))]
With this data
dfg1 = result.groupby(result.index.year).agg({'Temp': list})
df1_wide = pd.DataFrame(dfg1['Temp'].tolist(), index=dfg1.index)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
ax1.matshow(df1_wide, interpolation=None, aspect='auto');
The problem with the unfilled portions are a consequence of the NaN values on your dataset, in this case you take the option, replace the NaN values with the column-mean or replace by the row-mean.
Another ways are available to replace the NaN values
df1_wide = df1_wide.apply(lambda x: x.fillna(x.mean()),axis=0)

How to plot multiple graphs with Plotly, where each plot is for a different (next) day?

I want to plot machine observation data by days separately,
so changes between Current, Temperature etc. can be seen by hour.
Basically I want one plot for each day. Thing is when I make too many of these Jupyter Notebook can't display each one of them and plotly gives error.
f_day --> first day
n_day --> next day
I think of using sub_plots with a shared y-axis but then I don't know how I can put different dates in x-axis
How can I make these with graph objects and sub_plots ? So therefore using only 1 figure object so plots doesn't crash.
Data looks like this
,ID,IOT_ID,DATE,Voltage,Current,Temperature,Noise,Humidity,Vibration,Open,Close
0,9466,5d36edfe125b874a36c6a210,2020-08-06 09:02:00,228.893,4.17,39.9817,73.1167,33.3133,2.05,T,F
1,9467,5d36edfe125b874a36c6a210,2020-08-06 09:03:00,228.168,4.13167,40.0317,69.65,33.265,2.03333,T,F
2,9468,5d36edfe125b874a36c6a210,2020-08-06 09:04:00,228.535,4.13,40.11,71.7,33.1717,2.08333,T,F
3,9469,5d36edfe125b874a36c6a210,2020-08-06 09:05:00,228.597,4.14,40.1683,71.95,33.0417,2.0666700000000002,T,F
4,9470,5d36edfe125b874a36c6a210,2020-08-06 09:06:00,228.405,4.13333,40.2317,71.2167,32.9933,2.0,T,F
Code with display error is this
f_day = pd.Timestamp('2020-08-06 00:00:00')
for day in range(days_between.days):
n_day = f_day + pd.Timedelta('1 days')
fig_df = df[(df["DATE"] >= f_day) & (df["DATE"] <= n_day) & (df["IOT_ID"] == iot_id)]
fig_cn = px.scatter(
fig_df, x="DATE", y="Current", color="Noise", color_continuous_scale= "Sunset",
title= ("IoT " + iot_id + " " + str(f_day.date())),
range_color= (min_noise,max_noise)
)
f_day = n_day
fig_cn.show()
updated
The question was with respect to plotly not matplotlib. Same approach works. Clearly axis and titles need some beautification
import pandas as pd
import plotly.subplots
import plotly.express as px
import datetime as dt
import random
df = pd.DataFrame([{"DATE":d, "IOT_ID":random.randint(1,5), "Noise":random.uniform(0,1), "Current":random.uniform(15,25)}
for d in pd.date_range(dt.datetime(2020,9,1), dt.datetime(2020,9,4,23,59), freq="15min")])
# get days to plot
days = df["DATE"].dt.floor("D").unique()
# create axis for each day
fig = plotly.subplots.make_subplots(len(days))
iot_id=3
for i,d in enumerate(days):
# filter data and plot ....
mask = (df["DATE"].dt.floor("D")==d)&(df["IOT_ID"]==iot_id)
splt = px.scatter(df.loc[mask], x="DATE", y="Current", color="Noise", color_continuous_scale= "Sunset",
title= f"IoT ({iot_id}) Date:{pd.to_datetime(d).strftime('%d %b')}")
# select_traces() returns a generator so turn it into a list and take first one
fig.add_trace(list(splt.select_traces())[0], row=i+1, col=1)
fig.show()
It's simple - create the axis that you want to plot on first. Then plot. I've simulated your data as you didn't provide in your question.
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import random
df = pd.DataFrame([{"DATE":d, "IOT_ID":random.randint(1,5), "Noise":random.uniform(0,1), "Current":random.uniform(15,25)}
for d in pd.date_range(dt.datetime(2020,9,1), dt.datetime(2020,9,4,23,59), freq="15min")])
# get days to plot
days = df["DATE"].dt.floor("D").unique()
# create axis for each day
fig, ax = plt.subplots(len(days), figsize=[20,10],
sharey=True, sharex=False, gridspec_kw={"hspace":0.4})
iot_id=3
for i,d in enumerate(days):
# filter data and plot ....
df.loc[(df["DATE"].dt.floor("D")==d)&(df["IOT_ID"]==iot_id),].plot(kind="scatter", ax=ax[i], x="DATE", y="Current", c="Noise",
colormap= "turbo", title=f"IoT ({iot_id}) Date:{pd.to_datetime(d).strftime('%d %b')}")
ax[i].set_xlabel("") # it's in the titles...
output

Formatting datetime for plotting time-series with pandas

I am plotting some time-series with pandas dataframe and I ran into the problem of gaps on weekends. What can I do to remove gaps in the time-series plot?
date_concat = pd.to_datetime(pd.Series(df.index),infer_datetime_format=True)
pca_factors.index = date_concat
pca_colnames = ['Outright', 'Curve', 'Convexity']
pca_factors.columns = pca_colnames
fig,axes = plt.subplots(2)
pca_factors.Curve.plot(ax=axes[0]); axes[0].set_title('Curve')
pca_factors.Convexity.plot(ax=axes[1]); axes[1].set_title('Convexity'); plt.axhline(linewidth=2, color = 'g')
fig.tight_layout()
fig.savefig('convexity.png')
Partial plot below:
Ideally, I would like the time-series to only show the weekdays and ignore weekends.
To make MaxU's suggestion more explicit:
convert to datetime as you have done, but drop the weekends
reset the index and plot the data via this default Int64Index
change the x tick labels
Code:
date_concat = data_concat[date_concat.weekday < 5] # drop weekends
pca_factors = pca_factors.reset_index() # from MaxU's comment
pca_factors['Convexity'].plot() # ^^^
plt.xticks(pca_factors.index, date_concat) # change the x tick labels

Getting Pandas datetime column to display as Dates, not Numbers, on Matplotlib X-axis

Using pandas and wondering why the date column isn't showing up as the actual dates (type = pandas.tslib.Timestamp) but are showing up as numbers.
Take this replicable example:
todays_date = datetime.datetime.now().date()
columns = ['month','A','B','C','D']
_dates = pd.DataFrame(pd.date_range(todays_date-datetime.timedelta(10), periods=150, freq='M'))
_randomdata = pd.DataFrame(np.random.randn(150, 4))
data = pd.concat([_dates, _randomdata], axis=1)
data.plot(figsize = (10,6))
As you can see, the x-axis is showing up as numbers, not dates.
2 questions:
a) How do I change it so that the actual dates are showing up on the x-axis?
b) How do I change the frequency of the ticks and tick labels on the x-axis if I want more/fewer months showing up?
Thanks guys!
Just use the date_range as an index to the DataFrame:
todays_date = datetime.datetime.now().date()
columns = ['A','B','C','D']
data = pd.DataFrame(data=np.random.randn(150, 4),
index=pd.date_range(todays_date-datetime.timedelta(10), periods=150, freq='M'),
columns=columns)
data.plot(figsize = (10,6))

Categories