The Problem:
I have a dataframe df that looks like this:
value msg_type
date
2022-03-15 08:15:10+00:00 122 None
2022-03-15 08:25:10+00:00 125 None
2022-03-15 08:30:10+00:00 126 None
2022-03-15 08:30:26.542134+00:00 127 ANNOUNCEMENT
2022-03-15 08:35:10+00:00 128 None
2022-03-15 08:40:10+00:00 122 None
2022-03-15 08:45:09+00:00 127 None
2022-03-15 08:50:09+00:00 133 None
2022-03-15 08:55:09+00:00 134 None
....
2022-03-16 09:30:09+00:00 132 None
2022-03-16 09:30:13.234425+00:00 135 ANNOUNCEMENT
2022-03-16 09:35:09+00:00 130 None
2022-03-16 09:40:09+00:00 134 None
2022-03-16 09:45:09+00:00 135 None
2022-03-16 09:50:09+00:00 134 None
The value data occurs in roughly 5 minute intervals, but messages can occur at any time. I am trying to plot one line of values per day, where the x-axis ranges from t=-2 hours to t=+8 hours, and the ANNOUNCEMENT occurs at t=0 (see image below).
So, for example, if an ANNOUNCEMENT occurs at 8:30AM on 3/15 and again at 9:30AM on 3/16, there should be two lines:
one line for 3/15 that plots data from 6:30AM to 4:30PM, and
one line for 3/16 that plots data from 7:30AM to 5:30PM,
both sharing the same x-axis ranging from -2 to +8, with ANNOUNCEMENT at t=0.
What I've Tried:
I am able to do this currently by finding the index position of an announcement (e.g. say it occurs at row 298 -> announcement_index = 298), generating an array of 120 numbers from -24 to 96 (representing 10 hours at 5 minutes per number -> x = np.arange(-24, 96, 1)), then plotting
sns.lineplot(x, y=df['value'].iloc[announcement_index-24:announcement_index+96])
While this does mostly work (see image below), I suspect it's not the correct way to go about it. Specifically, trying to add more info to the plot (like a different set of 'value' markers) at specific times is difficult because I would need to convert the timestamp into this arbitrary 24-96 value range.
How can I make this same plot but by utilizing the datetime index instead? Thank you so much!
Assuming the index has already been converted to_datetime, create an IntervalArray from -2H to +8H of the index:
dl, dr = -2, 8
left = df.index + pd.Timedelta(f'{dl}H')
right = df.index + pd.Timedelta(f'{dr}H')
df['interval'] = pd.arrays.IntervalArray.from_arrays(left, right)
Then for each ANNOUNCEMENT, plot the window from interval.left to interval.right:
Set the x-axis as seconds since ANNOUNCEMENT
Set the labels as hours since ANNOUNCEMENT
fig, ax = plt.subplots()
for ann in df.loc[df['msg_type'] == 'ANNOUNCEMENT'].itertuples():
window = df.loc[ann.interval.left:ann.interval.right] # extract interval.left to interval.right
window.index -= ann.Index # compute time since announcement
window.index = window.index.total_seconds() # convert to seconds since announcement
window.plot(ax=ax, y='value', label=ann.Index.date())
deltas = np.arange(dl, dr + 1)
ax.set(xticks=deltas * 3600, xticklabels=deltas) # set tick labels to hours since announcement
ax.legend()
Here is the output with a smaller window -1H to +2H just so we can see the small sample data more clearly (full code below):
Full code:
import io
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
s = '''
date,value,msg_type
2022-03-15 08:15:10+00:00,122,None
2022-03-15 08:25:10+00:00,125,None
2022-03-15 08:30:10+00:00,126,None
2022-03-15 08:30:26.542134+00:00,127,ANNOUNCEMENT
2022-03-15 08:35:10+00:00,128,None
2022-03-15 08:40:10+00:00,122,None
2022-03-15 08:45:09+00:00,127,None
2022-03-15 08:50:09+00:00,133,None
2022-03-15 08:55:09+00:00,134,None
2022-03-16 09:30:09+00:00,132,None
2022-03-16 09:30:13.234425+00:00,135,ANNOUNCEMENT
2022-03-16 09:35:09+00:00,130,None
2022-03-16 09:40:09+00:00,134,None
2022-03-16 09:45:09+00:00,135,None
2022-03-16 09:50:09+00:00,134,None
'''
df = pd.read_csv(io.StringIO(s), index_col=0, parse_dates=['date'])
# create intervals from -1H to +2H of the index
dl, dr = -1, 2
left = df.index + pd.Timedelta(f'{dl}H')
right = df.index + pd.Timedelta(f'{dr}H')
df['interval'] = pd.arrays.IntervalArray.from_arrays(left, right)
# plot each announcement's interval.left to interval.right
fig, ax = plt.subplots()
for ann in df.loc[df['msg_type'] == 'ANNOUNCEMENT')].itertuples():
window = df.loc[ann.interval.left:ann.interval.right] # extract interval.left to interval.right
window.index -= ann.Index # compute time since announcement
window.index = window.index.total_seconds() # convert to seconds since announcement
window.plot(ax=ax, y='value', label=ann.Index.date())
deltas = np.arange(dl, dr + 1)
ax.set(xticks=deltas * 3600, xticklabels=deltas) # set tick labels to hours since announcement
ax.grid()
ax.legend()
Related
I've found several similar questions, for example: Python - Plotting Error Bar Chart with Uneven Errors (High and Low)
But when I was trying with my data, I got some weird results. I have an original dataframe dotvalue:
ticker AAPL AMD BIDU GOOGL IXIC MSFT NDXT NVDA NXPI QCOM SWKS TXN
5 0.222649 3.512100e-02 0.043558 0.153921 0.374783 0.201710 0.377886 0.159817 0.206961 0.151937 0.132801 0.226767
10 0.203363 3.398862e-02 0.113287 0.173990 0.393209 0.236895 0.421558 0.205209 0.326829 0.128487 0.174043 0.312648
...
145 0.089661 4.591069e-05 0.136814 0.017958 0.030406 0.000834 0.083278 0.162984 0.081382 0.081221 0.047221 0.057464
150 0.143404 2.403103e-02 0.076241 0.113305 0.061792 0.014000 0.096749 0.060709 0.170400 0.341342 0.049486 0.059982
and the dot plot is like below:
And for each value on x-axis, I have a series called avg=dotvalue.mean(axis=1) which looks like:
5 0.190659
10 0.226959
...
145 0.065772
150 0.100953
And if I add the avg to the plot, it looks like below:
Then I calculate the confidence interval as:
ci_u = dotvalue.quantile(q=0.975, axis=1) # upper limit
ci_l = dotvalue.quantile(q=0.025, axis=1) # lower limit
and ci_u is like:
5 0.377033
10 0.413762
...
145 0.155787
150 0.294333
ci_l is like:
5 0.037441
10 0.055796
...
145 0.000263
150 0.016758
For each x value, I want the ci_u as upper bound of the error bar around the avg, and ci_l as the lower bound.
I tried avg.plot(yerr = (ci_l, ci_u), ls = '-', marker = 'o', figsize = (10,5), label = "average") which gives me:
which is clearly wrong, since for example, when x=5, the upper bound should be 0.377033 and lower bound should be 0.037441, while the error bar on the plot for x=5 is something like (0.15, 0.57).
Any help on why this happens and how I should correct it?
I am doing a matplotlib.axes.Axes.stem graph where the x-axis is a dateline that shows days. Some of my data appear on certain days. While on other days, it has no data (because such info do not exist in my data).
Question 1: How do I make a timeline stem graph that will show my data, including days with no data? Is this possible? Is there some way to auto-scale the appearance of the data x-axis to handle such a situation?
Below is a sample data file called test.txt and my python script to read in its data to show a timeline stem plot for your consideration. output from this script is also given below.
Question2. Presentation question. How do I show a "-" symbol at each annotation? Also, how do I rotate the annotation by 30 degrees?
test.txt
No. Date
1 23/01/2020
2 24/01/2020
3 24/01/2020
4 26/01/2020
5 27/01/2020
6 28/01/2020
7 29/01/2020
8 29/01/2020
9 30/01/2020
10 30/01/2020
11 31/01/2020
12 31/01/2020
13 01/02/2020
14 01/02/2020
15 04/02/2020
16 04/02/2020
17 04/02/2020
18 05/02/2020
19 05/02/2020
20 05/02/2020
21 06/02/2020
22 07/02/2020
23 07/02/2020
24 07/02/2020
25 08/02/2020
26 08/02/2020
27 08/02/2020
28 08/02/2020
29 08/02/2020
30 09/02/2020
31 10/02/2020
32 10/02/2020
33 11/02/2020
34 11/02/2020
38 13/02/2020
39 13/02/2020
40 13/02/2020
41 13/02/2020
42 13/02/2020
43 13/02/2020
44 14/02/2020
45 14/02/2020
46 14/02/2020
47 14/02/2020
48 14/02/2020
49 14/02/2020
50 15/02/2020
51 15/02/2020
52 15/02/2020
53 15/02/2020
54 15/02/2020
57 18/02/2020
58 18/02/2020
59 18/02/2020
60 19/02/2020
61 21/02/2020
stem_plot.py
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
from datetime import datetime
from pathlib import Path
#########################
#### DATA EXTRACTION ####
#########################
source = Path('./test.txt')
with source.open() as f:
lines = f.readlines()
#print( lines )
# Store source data in dictionary with date shown as mm-dd.
data={}
for line in lines[1:]:
case, cdate = line.strip().split()
cdate = datetime.strptime(cdate, "%d/%m/%Y").strftime('%m-%d')
data[case] = cdate
print( f'\ndata = {data}' )
# Collate data's y-axis for each date, i.e. history
history2={}
cdates = list(data.values())
sorted_dates = sorted( set( cdates ) )
for i in sorted_dates:
cases=[]
for case, date in data.items():
if i == date:
cases.append(case)
#print( i, cases)
history2[i] = cases
print( f'\nhistory2 = {history2}')
###########################
#### DATA PRESENTATION ####
###########################
# Create figure and plot a stem plot with the date
fig, ax = plt.subplots(figsize=(8.8, 5), constrained_layout=True)
ax.set(title="Test")
labels=list( history2.values() ) # For annotation
yy = [ len(i) for i in labels ] # y-axis
xx = list(history2.keys()) # x-axis
markerline, stemline, baseline = ax.stem(
xx, yy, linefmt="C1:", basefmt="k-", use_line_collection=True)
plt.setp(markerline, marker="None" )
# annotate stem lines
for ann_x, label in list(history2.items()):
print(ann_x, label)
each_count=1
for each in label:
ax.annotate( each, xy=(ann_x, each_count), xycoords='data')
each_count += 1
#print(f'each_count = {each_count}' )
# format xaxis
plt.setp( ax.get_xticklabels(), rotation=30 )
# remove top and right spines
for spine in ["top", "right"]:
ax.spines[spine].set_visible(False)
# show axis name
ax.get_yaxis().set_label_text(label='Y-axis')
ax.get_xaxis().set_label_text(label='X-axis')
plt.show()
Current output:
About your first question. Basically, you make a list of all days between the days you are using and use that. So add this to the beginning of your code:
import pandas as pd
alldays = pd.date_range(start="20200123",
end="20200221",
normalize=True)
dates = []
for i in alldays:
dates.append(f"{i.month:02}-{i.day:02}")
What this does is it gets a pandas data range between two dates and converts this range into a list of month-day strings.
Then modify this part of your code like this:
# Collate data's y-axis for each date, i.e. history
history2={}
cdates = list(data.values())
sorted_dates = sorted( set( cdates ) )
for i in dates: # This is the only change!
cases=[]
for case, date in data.items():
if i == date:
cases.append(case)
#print( i, cases)
history2[i] = cases
And this change would give you this:
About your second question, change your code to this:
# annotate stem lines
for ann_x, label in list(history2.items()):
print(ann_x, label)
each_count=1
for each in label:
ax.annotate(f"--{each}", xy=(ann_x, each_count), xycoords='data', rotation=30)
each_count += 1
I just changed the ax.annotate line. The two changes are:
added a "--" to each of your annotation labels,
added a rotation parameter. The rotation parameter does not appear directly in the documentation, but the documentation says you can use any of the methods for Text as kwargs, and they are here.
This would hopefully give you what you have asked for:
Adding to #SinanKurmus answer to my 1st Question:
Solution1:
A time-axis with a daily interval for the entire history of the given data can be obtained using matplotlib's methods, namely drange and num2date, and python. The use of pandas can be avoided here.
First, express the start and end date of the time axis as a python datetime object. Note, you need to add 1 more day to the end date else data from the last date would not be included. Next, use 1 day as your time interval using python's datetime.timedelta object. Next supply them to matplotlib.date.drange method that will return a NumPy array. Matplotlib's num2date method in turns converts that back to a python datetime object.
def get_time_axis( data ):
start = datetime.strptime(min(data.values()), "%Y-%m-%d")
end = datetime.strptime(max(data.values()), "%Y-%m-%d") + timedelta(days=1)
delta = timedelta(days=1)
time_axis_md = mdates.drange( start, end, delta )
time_axis_py = mdates.num2date( time_axis_md, tz=None ) # Add tz when required
return time_axis_py
Solution 2:
Apparently, Matplotlib also has a FAQ on how to skip dates where there is no data. I have included their sample code example below.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.ticker as ticker
r = mlab.csv2rec('../data/aapl.csv')
r.sort()
r = r[-30:] # get the last 30 days
N = len(r)
ind = np.arange(N) # the evenly spaced plot indices
def format_date(x, pos=None):
thisind = np.clip(int(x+0.5), 0, N-1)
return r.date[thisind].strftime('%Y-%m-%d')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ind, r.adj_close, 'o-')
ax.xaxis.set_major_formatter(ticker.FuncFormatter(format_date))
fig.autofmt_xdate()
plt.show()
With Bokeh, I'm trying to color scatter with datetime values and create a colorbar with a datetime scale.
(something like this) :
Origin pro Graph
A sample of the timeseries :
Date Rate Level
01/01/2019 08:59 38.3 -19.7
02/01/2019 09:04 39.1 -21
01/01/2019 09:09 40.7 -31
01/01/2019 09:14 42.1 -15
01/01/2019 09:19 43.6 -14
01/01/2019 09:24 46.8 -19.7
I tried to plot Rate=f(Level) :
cwd=os.getcwd()
delimit_file=','
fichier = 'qsv1.csv'
qsv = pd.read_csv(fichier,delimiter=delimit_file, encoding = 'ISO-8859-1')
qsv['Date'] = pd.to_datetime(qsv['Date'], format='%Y/%m/%d %H:%M')
p = figure(x_axis_type="datetime", plot_width=800, plot_height=500,)
source = ColumnDataSource(qsv)
exp_cmap = LinearColorMapper(palette="Viridis256",
low = min(qsv["Date"]),
high = max(qsv["Date"]))
p.circle("Rate", "Level", source=source, line_color=None,
fill_color={"field":"Date", "transform":exp_cmap})
#p.line("Date", "QS_conv", source=source, color='navy', legend='moyenne glissante')
bar = ColorBar(color_mapper=exp_cmap, location=(0,0))
p.add_layout(bar, "left")
show(p)
But i get :
ValueError: expected a value of type Real, got 2019-01-01 00:04:00 of type Timestamp
Someone knows how to solve this problem?
Thanks, it works by converting datetime in ms
qsv['actualDateTime'] = qsv['Date'].astype(np.int64) / int(1e6)
exp_cmap1 = LinearColorMapper(palette="Viridis256",
low = min(qsv['actualDateTime']),
high = max(qsv['actualDateTime']))
p1.circle("Debit, "Niveau", source=source, line_color=None,
fill_color={"field":"actualDateTime", "transform":exp_cmap1})
bar1 = ColorBar(color_mapper=exp_cmap1, location=(0,0), formatter=DatetimeTickFormatter(days=["%d/%m/%y"]), label_standoff=12)
Please always include a full stack trace, not just one line. Presumably, this message is from setting, e.g.
low = min(qsv["Date"])
The configuration properties of the linear color mapper do not expect anything other than plain numbers. The underlying units of datetimes in Bokeh are Milliseconds since Epoch so that is what you should convert your datetime values to before passing to the color mapper.
Hi and thanks in advance,
So I'm trying to write my forecast data to a file from my plot which is using an ARIMA forecast. How would I be able to do so, so that I could access the saved forecast data later?
Here is my code:
import pandas
from matplotlib import pyplot
series = pandas.read_csv('Quantity.csv',header=0,parse_dates=[0])
series.columns = ['Date','Quantity']
series.set_index(['Date'],inplace=True)
model = ARIMA(series['Quantity'].astype(float), order=(2,0,1))
fittedModel = model.fit(disp=0,method='css')
stepsAhead = 10
forecastArray = fittedModel.forecast(steps=stepsAhead)
for i in range(stepsAhead):
series.loc[len(series)] = forecastArray[0][i]
series.plot()
pyplot.show()
Here is the data I used to plot with if needed:
Date Quantity
2010/01/01 1358
2010/07/02 0
2010/08/03 0
2011/02/04 0
2011/11/05 0
2011/12/06 274
2012/06/07 1074
2012/08/30 2223
2013/04/16 0
2013/03/18 1753
2014/02/22 345
2014/01/27 24
2015/12/15 652
2015/09/28 275
2016/05/04 124
2017/11/07 75
2017/09/22 32
2017/04/04 12
Thank you.
series.to_csv(yourPath +'\Forecasts.csv')
That will be convenient for you I guess.
I've created a script to create multiple plots in one object. The results I am looking for are two plots one over the other such that each plot has different y axis scale but x axis is fixed - dates. However, only one of the plots (the top) is properly created, the bottom plot is visible but empty i.e the geom_line is not visible. Furthermore, the y-axis of the second plot does not match the range of values - min to max. I also tried using facet_grid (scales="free") but no change in the y-axis. The y-axis for the second graph has a range of 0 to 0.05.
I've limited the date range to the past few weeks. This is the code I am using:
df = df.set_index('date')
weekly = df.resample('w-mon',label='left',closed='left').sum()
data = weekly[-4:].reset_index()
data= pd.melt(data, id_vars=['date'])
pplot = ggplot(aes(x="date", y="value", color="variable", group="variable"), data)
#geom_line()
scale_x_date(labels = date_format('%d.%m'),
limits=(data.date.min() - dt.timedelta(2),
data.date.max() + dt.timedelta(2)))
#facet_grid("variable", scales="free_y")
theme_bw()
The dataframe sample (df), its a daily dataset containing values for each variable x and a, in this case 'date' is the index:
date x a
2016-08-01 100 20
2016-08-02 50 0
2016-08-03 24 18
2016-08-04 0 10
The dataframe sample (to_plot) - weekly overview:
date variable value
0 2016-08-01 x 200
1 2016-08-08 x 211
2 2016-08-15 x 104
3 2016-08-22 x 332
4 2016-08-01 a 8
5 2016-08-08 a 15
6 2016-08-15 a 22
7 2016-08-22 a 6
Sorry for not adding the df dataframe before.
Your calls to the plot directives geom_line(), scale_x_date(), etc. are standing on their own in your script; you do not connect them to your plot object. Thus, they do not have any effect on your plot.
In order to apply a plot directive to an existing plot object, use the graphics language and "add" them to your plot object by connecting them with a + operator.
The result (as intended):
The full script:
from __future__ import print_function
import sys
import pandas as pd
import datetime as dt
from ggplot import *
if __name__ == '__main__':
df = pd.DataFrame({
'date': ['2016-08-01', '2016-08-08', '2016-08-15', '2016-08-22'],
'x': [100, 50, 24, 0],
'a': [20, 0, 18, 10]
})
df['date'] = pd.to_datetime(df['date'])
data = pd.melt(df, id_vars=['date'])
plt = ggplot(data, aes(x='date', y='value', color='variable', group='variable')) +\
scale_x_date(
labels=date_format('%y-%m-%d'),
limits=(data.date.min() - dt.timedelta(2), data.date.max() + dt.timedelta(2))
) +\
geom_line() +\
facet_grid('variable', scales='free_y')
plt.show()