Trigger update of Bokeh plot upon new data point in stream - python

I am trying to create a plot in Bokeh to visualize data from a live data-feed. I'm fairly new to Bokeh at this point. The data stream is a stream of files where the data for the plots first requires to be extracted and then pre-processed before being visualized. This part is currently handled by the python watchdog package, where the processing is triggered upon the appearance of a new file in the streams that are monitored.
The output of this is a dictionary holding all the information for this particular datapoint needed in the plots handled by the Bokeh app.
My question is, how would I trigger an update of the Bokeh plot when a new datapoint arrives?
I had looked into add_periodic_callback, but since I do not know up front when a new datapoint will arrive, nor how much time there will be between them, I risk missing data in the plot. What would be the best way to solve this?
1) Use functionality "x" in Bokeh that I am unaware of, that will trigger an update of the ColumnDataSource and the actual plots, exactly when a new datapoint arrives (this would be my preferred solution).
2) Create a form of buffer data source that stores the data for the past NN files and then use add_periodic_callback to a function that queries this source to update the ColumnDataSource
3) Another solution than the two above, that I don't know of with my limited software development skills.

Related

How to plot OHLC data in simulated real-time from historical tick data?

I'm working on developing a charting package that is able to take in tick data and, given a predefined time interval (timeframe), plot price movements on a candlestick chart (open, high, low, and close) in simulated real-time, allowing one to change the speed and direction of the simulation, as well as pause it. Here is an example of the desired behavior (updating the data in real-time) represented graphically:
Real-time price data plotting using a candlestick chart.
The UI aspect of the plotting is not a concern, as the graphical aspect of the project has already been created (I am using a function that takes in all the aggregated OHLC data and plots it according to timeframe). I am, however, having difficulty deciding on how to achieve the above described behavior in a performance-friendly manner (e.g. not re-creating the entire OHLC dataframe from the tick data every time a new tick is read into the method) as the dataset I am working with is quite large.
I've thought of some preliminary solutions, however all them have their flaws and I am at a loss for how to tackle this problem. The following are some ideas I've had to tackle this problem:
Deriving OHLC data directly from the tick data with each new tick.
The idea here is that we'll have a sliding timestamp that will allow us to read in all the tick data up to the "current" time in the simulation, and then aggregate it into OHLC format using a dataframe.
Although this achieves our desired behavior, this requires us to recreate the dataframe containing the OHLC data every tick, thus the overhead for a large set of data would be unacceptable in this case.
Pre-processing the ticks into OHLC and reading directly from the dataframe.
Similar to the above, using a sliding timestamp to read data up to the "current" time in the simulation. The difference here, is that we'd be reading data from the OHLC dataframe directly after pre-processing all the tick data into it.
This allows us to easily go through the data, as it would exist in a format that the method which plots it would understand (that being OHLC), but it doesn't achieve our required behavior of simulating movements that happen during a candle's formation.
Performing the first solution every n milliseconds.
Although this would reduce the overhead, it would still recreate the dataframe every n milliseconds, which would still make it unfeasible for large datasets.
Only modifying the current candle.
The idea here is that all the historical candles will be stored in a dataframe, while the current candle is updated with every new tick until it closes (and so on with each successive candle).
While this achieves the desired simulation behavior in a more performance-friendly manner, I'm not sure how we could get rewinding to work in this context, as moving backward on the historical data wouldn't be possible since it would be in OHLC format.
I am also unsure of how we could change the timeframe using this method as it would need to determine how much time to look back and gather ticks to create a new "current" candle.
I'm working with Python, however I think this is more of a conceptual problem than a language-specific one.

Python/Jupyter: scroll scatter-plot with many data points horizontally

I want to visualize time-series-like data with several measurements over time.
There are a lot of such measurements in a dataset, in the order of tens to hundreds of thousands.
In order to view these in a notebook or HTML page, I would like some efficient method to show a subrange of the whole time range with just a view hundred to thousand database and have controls to scroll lef/right i.e. forward/backward in time through the data.
I have tried doing this with Plotly and a range slider, but unfortunately this does not scale to a lot of data at all. Apparently, this approach creates all the graph data in the output javascript, which slows down everything and at some point makes the browser hang or crash.
What I would need is an approach that actually only renders the data in the subrange and interacts with the python code via the scrolling widgets to update the view.
Ideally, this would work with Plotly as I am using it for all other visualizations, but any other efficient solution would also be welcome.
Plotly runs into rendering issues when there are too many data points within the window (see Plotly Benchmarks). I would suggest using Plotly-Resampler which resamples data that is within the user's view.

Python interactive plotting for large data sets

Suppose I have a dataset with 100k rows (1000 different times, 100 different series, an observation for each, and auxilliary information). I'd like to create something like the following:
(1) first panel of plot has time on x axis, and average of the different series (and standard error) on y axis.
(2) based off the time slice (vertical line) we hover over in panel 1, display a (potentially down sampled) scatter plot of auxilliary information versus the series value at that time slice.
I've looked into a few options for this: (1) matplotlib + ipywidgets doesn't seem to handle it unless you explicitly select points via a slider. This also doesn't translate well to html exporting. This is not ideal, but is potentially workable. (2) altair - this library is pretty sleek, but from my understanding, I need to give it the whole dataset for it to handle the interactions, but it also can't handle more than 5kish data points. This would preclude my use case, correct?
Any suggestions as to how to proceed? Is what I'm asking impossible in the current state of things?
You can work with datasets larger than 5k rows in Altair, as specified in this section of the docs.
One of the most convenient solutions in my opinion is to install altair_data_server and then add alt.data_transformers.enable('data_server') on the top of your notebooks and scripts. This server will provide the data to Altair as long as your Python process is running so there is no need to include all the data as part of the created chart specification, which means that the 5k error will be avoided. The main drawback is that it wont work if you export to a standalone HTML because you rely on being in an environment where the server Python process is running.

Choroplethmapbox slow to render?

I’ve been playing around with Plotly and Dash for the first time over the past few days, with the hope of developing a browser-based data explorer for geographic NetCDF4 data. I’ve been impressed at how straightforward this has been so far, however I’m finding that some interactions with choroplethmapbox are taking longer to update and render than expected. I believe this may be the same issue discussed here
The following refers to the code and sample data available here, where the Dash application can be run using:
python choropleth.py (Python 3.7).
The source of my data comes from a 4D NetCDF4 file (in this case a model of ocean temperature - temp.nc) with dimensions of time, depth, lat and lon. In my case I’m only plotting a 2D chloropleth map, but I’d like the user to interactively select the desired time interval (and eventually depth) as well (the render will always be in 2D space).
Using the examples from here, I’m using a GeoJSON file of the 2D grid cells coupled with a Pandas DataFrame to render ocean temperature. Everything is working as expected, however any changes to the slider value (time) take a long time to update (approx six seconds on my machine). It appears as though there’s a second or so between selecting the slider value and running the update_figure() callback, then another 4-5 seconds before the new render starts to take place in the browser.
The update_figure() callback reads the requested data directly from the NetCDF4 file, then directly updates the Z values in the existing figure dictionary and returns this as a new figure (see code fragment below). At first I was concerned that the slow response time was due to reading from the NetCDF4, however a basic timing function shows that the update_figure() callback runs in less than 0.01 seconds in most cases. So it appears the delay is either coming from the #app.callback or the render function (post update_figure()) in Dash?
# Create the callback and callback function (update_figure)
#app.callback(Output('plot', 'figure'),
[Input('slide', 'value')],
[State('plot','relayoutData'),State('plot', 'figure')])
def update_figure(x,r,f):
t0 = tme.time()
f['layout']['mapbox']['center']['lat'] = f['layout']['mapbox']['center']['lat']
f['layout']['mapbox']['center']['lon'] = f['layout']['mapbox']['center']['lon']
f['layout']['mapbox']['zoom'] = f['layout']['mapbox']['zoom']
# If the map window has been panned or zoomed, grab those values for the new figure
if r is not None:
if 'mapbox.center' in r:
f['layout']['mapbox']['center']['lat'] = r['mapbox.center']['lat']
f['layout']['mapbox']['center']['lon'] = r['mapbox.center']['lon']
f['layout']['mapbox']['zoom'] = r['mapbox.zoom']
# Extract the new time values from the NetCDF file
tmp = nc['temp'][x, -1, :, :].values.flatten()
# Repace the Z values in the original figure with the updated values, leave everything else (e.g. cell geojson and max/min ranges) as-is
f['data'][0]['z'] = np.where(np.isnan(tmp), None, tmp).tolist()
print("update_figure() time: ",tme.time()-t0)
return f
I suspect that the slow render times are somehow related to the GeoJSON of each cell polygon (47k grid cell polygons are being rendered in total, with each polygon being defined by 6 points (i.e. 284k points total)), and unfortunately this can’t be simplified any further.
I'm seeking suggestions on how I can speed up the update/render when a user is interacting with the application. Two ideas I've had include:
Utilising WebGL if possible? It's unclear to me from the documentation whether choroplethmapbox already uses WebGL? If not, is there a pathway for making use of this for faster rendering?
Implementing some form of client side callback, although I don't know if this is possible given that I need to read the values directly out of the NetCDF file when requested by the user? Perhaps it's possible to just read/return the new Z values, then merge that with the existing GeoJSON on the client side?
Suggestions appreciated.

matplotlib, draw multiple graphs / points in figure

I am trying to develop a telemetry system which sends sensor data from an Arduino, plotted in realtime. For this I'm using Python and the matplotlib-library. My problem is that every time a new data point arrives, I want to add that data point by plotting it into the same figure as the other data points. So far I could not find a solution to this.
You can stream data from an Arduino into a Plotly graph with the Arduino API in Plotly. You have two options: continuously transmit data (which it sounds like you'll want to do), or transmit a single chunk.
It will update the graph every few seconds if you refresh the page.
The Arduino API is available here. And, if you're already using Python, you can use the extend option to update data into another plot. The Python API is here.
Here's an example of how it looks to transmit from an Arduino, and you can see the interactive version here
Full disclosure: I work at Plotly.
as far as I can see, you have a few different ways of doing this (i'll list them in what I consider increasing difficulty
Making a bitmap file, eg .png, which has to be regenerated each time a new datapoint arrives. To do this you need to have your old data stored somewhere in a file or in a database.
Using svg in a browser. Then you can add on points or lines using javascript (e.g. http://sickel.net/blogg/?p=1506 )
Make a bitmap, store it and edit it to add in new points - this really gets tricky if you either wants to "roll old points off" at one end, or rescale the image when more data arrives.
Make a series of bitmaps, and have the total graph as a combination of a lot of slices. - here you can easily "roll off" old points, but you are out of luck if you want to rescale.

Categories