Dynamic Regression Line on Bokeh Plot - Python - python

I have a bokeh scatter plot that updates the data displayed based on the checkbox selection for time of day. The legend allows you to hide or show the different datasets.
I'd like to add a regression line that updates with the selections. For example, if hour 0 and both data sets are showing it would calculate and show the corresponding regression line. Then if hour 9 was added the regression line would update accordingly.
Example of Current Bokeh Plot
Below is a sample of the code corresponding to the example plot.
# create filtered plot
filt_time_fig = figure(title = 'AC Power vs Expected Power',
x_axis_label = 'AC Power',
y_axis_label = 'Expected Power')
# define source
source = ColumnDataSource(master_interval_df)
# create checkbox
active_hour = '0'
f = BooleanFilter(booleans=[l == active_hour for l in master_interval_df['hour']])
uniq_hours = list(map(str,list(range(0,24))))
cg = CheckboxGroup(labels=uniq_hours, active=[uniq_hours.index(active_hour)])
cg.js_on_change('active',
CustomJS(args=dict(source=source, f=f),
code="""\
const hours = cb_obj.active.map(idx => cb_obj.labels[idx]);
f.booleans = source.data.hour.map(l => hours.includes(l));
source.change.emit();
"""))
#create glyph for each dataset
for turb in turbine_list:
filt_time_fig.circle(x = 'AC_p_'+turb,
y = 'Exp_p_'+turb,
source = source,
legend_label = turb,
line_color = None,
fill_color={"field":"hour","transform":colormap},
view = CDSView(source=source, filters=[f]))
cg.max_width=50
filt_time_fig.legend.click_policy = 'hide'
show(filt_time_fig)

Related

Issues in displaying year by year line plot using bokeh (pyspark dataframe)

I have an issue when I try to displaying line plot from pyspark dataframe using bokeh, it was displayed successfully but it shows what I'm not expected. The problem is some line plot dots are not connected year by year sequentially.
Previously I tried to sort the source data frame using orderBy :
# Join df_max, and df_avg to df_quake_freq
df_quake_freq = df_quake_freq.join(df_avg, ['Year']).join(df_max, ['Year'])
df_quake_freq = df_quake_freq.orderBy(asc('Year'))
df_quake_freq.show(5)
And the output is :
dataframe source for line plot
This is code for plotting:
# Create a magnitude plot
def plotMagnitude():
# Load the datasource
cds = ColumnDataSource(data=dict(
yrs = df_quake_freq['Year'].values.tolist(),
avg_mag = df_quake_freq['Avg_Magnitude'].round(1).values.tolist(),
max_mag = df_quake_freq['Max_Magnitude'].values.tolist()
))
# Tooltip
TOOLTIPS = [
('Year', ' #yrs'),
('Average Magnitude', ' #avg_mag'),
('Maximum Magnitude', ' #max_mag')
]
# Create the figure
mp = figure(title='Maximum and Average Magnitude by Year',
plot_width=1150, plot_height=400,
x_axis_label='Years',
y_axis_label='Magnitude',
x_minor_ticks=2,
y_range=(5, df_quake_freq['Max_Magnitude'].max() + 1),
toolbar_location=None,
tooltips=TOOLTIPS)
# Max Magnitude
mp.line(x='yrs', y='max_mag', color='#cc0000', line_width=2, legend='Max Magnitude', source=cds)
mp.circle(x='yrs', y='max_mag', color='#cc0000', size=8, fill_color='#cc0000', source=cds)
# Average Magnitude
mp.line(x='yrs', y='avg_mag', color='yellow', line_width=2, legend='Avg Magnitude', source=cds)
mp.circle(x='yrs', y='avg_mag', color='yellow', size=8, fill_color='yellow', source=cds)
mp = style(mp)
show(mp)
return mp
plotMagnitude()
The output is :
line plot
We can see from the picture that some dots are not connected sequentially, for example like 1965-1967-1966
My problem is solved, the reason why the dots are not sequenced is that plotMagnitude function is relying on global variables, so to handle this I put
sort.values() inside the function. This is the syntax:
def plotMagnitude():
# Load the datasource
cds = ColumnDataSource(data=dict(
yrs = df_quake_freq['Year'].sort_values().values.tolist(),
avg_mag = df_quake_freq['Avg_Magnitude'].round(1).values.tolist(),
max_mag = df_quake_freq['Max_Magnitude'].values.tolist()

Grouping data in Python using Bokeh and visualizing it

I am trying to build a visual that tracks widget counts by category using hbar. The source data is not aggregated. This is what it looks like:
This data is aggregated at MktCatKey level, but I want to group by category and then perform a calculation on the counts. Lets say if the category is Category_A, I want to add +10 to the counts. Finally, I want to display both current and projected on a visual.
This is how far I have gotten:
query = open('workingsql.sql')
dataset = pd.read_sql_query(query.read(), cnxn)
query.close()
p = figure()
CurrentCount = dataset.Current
ProjCount = dataset.Projected
Cat = dataset.Category
grouped = dataset.groupby('Category')['Current','Projected'].sum()
source = ColumnDataSource(grouped)
p = figure(y_range=Cat)
p.hbar(y=Cat, right = CurrentCount, left = 0, height = 0.5,source=source, fill_color="#D7D7D7")
p.hbar(y=Cat, right = ProjCount, left = 0, height = 0.5,source=source, fill_color="#E21150")
hover = HoverTool()
hover.tooltips = [("Totals", "#Current Current Count")]
hover.mode = 'hline'
p.add_tools(hover)
show(p)
I was able to get this to work if I source directly from the dataset. But since I’m trying to perform a calculation, I cant use the source directly. I’m not fully familiar on how to do an if statement on CurrentCount to see if it’s for Category_A or not but that’s where I’m at.
I have additional things I want to do on this dataset (like bring in a goals dataset and plot against that), but taking small steps for now. Any help is appreciated.
Working code below:
import pyodbc
import pandas as pd
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, Div, Select, Slider, TextInput
from bokeh.embed import components
from bokeh.models.tools import HoverTool
query = open('workingsql.sql')
dataset = pd.read_sql_query(query.read(), cnxn)
query.close()
p = figure()
CurrentCount = dataset.Current
ProjCount = dataset.Projected
Cat = dataset.Category
grouped = dataset.groupby('Category')['Current','Projected'].sum()
source = ColumnDataSource(pd.DataFrame(grouped))
Category = source.data['Category'].tolist()
p = figure(y_range=Category)
p.hbar(y='Category', right = 'Current', left = 0, height = 0.5,source=source, fill_color="#D7D7D7")
p.hbar(y='Category', right = 'Projected', left = 0, height = 0.5,source=source, fill_color="#E21150")
hover = HoverTool()
hover.tooltips = [("Totals", "#Current Current Count")]
hover.mode = 'hline'
p.add_tools(hover)
show(p)

Bokeh callback not updating chart [duplicate]

Struggling to understand why this bokeh visual will not allow me to change plots and see the predicted data. The plot and select (dropdown-looking) menu appears, but I'm not able to change the plot for items in the menu.
Running Bokeh 1.2.0 via Anaconda. The code has been run both inside & outside of Jupyter. No errors display when the code is run. I've looked through the handful of SO posts relating to this same issue, but I've not been able to apply the same solutions successfully.
I wasn't sure how to create a toy problem out of this, so in addition to the code sample below, the full code (including the regression code and corresponding data) can be found at my github here (code: Regression&Plotting.ipynb, data: pred_data.csv, historical_data.csv, features_created.pkd.)
import pandas as pd
import datetime
from bokeh.io import curdoc, output_notebook, output_file
from bokeh.layouts import row, column
from bokeh.models import Select, DataRange1d, ColumnDataSource
from bokeh.plotting import figure
#Must be run from the command line
def get_historical_data(src_hist, drug_id):
historical_data = src_hist.loc[src_hist['ndc'] == drug_id]
historical_data.drop(['Unnamed: 0', 'date'], inplace = True, axis = 1)#.dropna()
historical_data['date'] = pd.to_datetime(historical_data[['year', 'month', 'day']], infer_datetime_format=True)
historical_data = historical_data.set_index(['date'])
historical_data.sort_index(inplace = True)
# csd_historical = ColumnDataSource(historical_data)
return historical_data
def get_prediction_data(src_test, drug_id):
#Assign the new date
#Write a new dataframe with values for the new dates
df_pred = src_test.loc[src_test['ndc'] == drug_id].copy()
df_pred.loc[:, 'year'] = input_date.year
df_pred.loc[:, 'month'] = input_date.month
df_pred.loc[:, 'day'] = input_date.day
df_pred.drop(['Unnamed: 0', 'date'], inplace = True, axis = 1)
prediction = lin_model.predict(df_pred)
prediction_data = pd.DataFrame({'drug_id': prediction[0][0], 'predictions': prediction[0][1], 'date': pd.to_datetime(df_pred[['year', 'month', 'day']], infer_datetime_format=True, errors = 'coerce')})
prediction_data = prediction_data.set_index(['date'])
prediction_data.sort_index(inplace = True)
# csd_prediction = ColumnDataSource(prediction_data)
return prediction_data
def make_plot(historical_data, prediction_data, title):
#Historical Data
plot = figure(plot_width=800, plot_height = 800, x_axis_type = 'datetime',
toolbar_location = 'below')
plot.xaxis.axis_label = 'Time'
plot.yaxis.axis_label = 'Price ($)'
plot.axis.axis_label_text_font_style = 'bold'
plot.x_range = DataRange1d(range_padding = 0.0)
plot.grid.grid_line_alpha = 0.3
plot.title.text = title
plot.line(x = 'date', y='nadac_per_unit', source = historical_data, line_color = 'blue', ) #plot historical data
plot.line(x = 'date', y='predictions', source = prediction_data, line_color = 'red') #plot prediction data (line from last date/price point to date, price point for input_date above)
return plot
def update_plot(attrname, old, new):
ver = vselect.value
new_hist_source = get_historical_data(src_hist, ver) #calls the function above to get the data instead of handling it here on its own
historical_data.data = ColumnDataSource.from_df(new_hist_source)
# new_pred_source = get_prediction_data(src_pred, ver)
# prediction_data.data = new_pred_source.data
#Import data source
src_hist = pd.read_csv('data/historical_data.csv')
src_pred = pd.read_csv('data/pred_data.csv')
#Prep for default view
#Initialize plot with ID number
ver = 781593600
#Set the prediction date
input_date = datetime.datetime(2020, 3, 31) #Make this selectable in future
#Select-menu options
menu_options = src_pred['ndc'].astype(str) #already contains unique values
#Create select (dropdown) menu
vselect = Select(value=str(ver), title='Drug ID', options=sorted((menu_options)))
#Prep datasets for plotting
historical_data = get_historical_data(src_hist, ver)
prediction_data = get_prediction_data(src_pred, ver)
#Create a new plot with the source data
plot = make_plot(historical_data, prediction_data, "Drug Prices")
#Update the plot every time 'vselect' is changed'
vselect.on_change('value', update_plot)
controls = row(vselect)
curdoc().add_root(row(plot, controls))
UPDATED: ERRORS:
1) No errors show up in Jupyter Notebook.
2) CLI shows a UserWarning: Pandas doesn't allow columns to be careated via a new attribute name, referencing `historical_data.data = ColumnDatasource.from_df(new_hist_source).
Ultimately, the plot should have a line for historical data, and another line or dot for predicted data derived from sklearn. It also has a dropdown menu to select each item to plot (one at a time).
Your update_plot is a no-op that does not actually make any changes to Bokeh model state, which is what is necessary to change a Bokeh plot. Changing Bokeh model state means assigning a new value to a property on a Bokeh object. Typically, to update a plot, you would compute a new data dict and then set an existing CDS from it:
source.data = new_data # plain python dict
Or, if you want to update from a DataFame:
source.data = ColumnDataSource.from_df(new_df)
As an aside, don't assign the .data from one CDS to another:
source.data = other_source.data # BAD
By contrast, your update_plot computes some new data and then throws it away. Note there is never any purpose to returning anything at all from any Bokeh callback. The callbacks are called by Bokeh library code, which does not expect or use any return values.
Lastly, I don't think any of those last JS console errors were generated by BokehJS.

Linking plots (box select, lasso etc) in Bokeh generated in a loop

I am plotting some data using bokeh using a for loop to iterate over my columns in the dataframe. For some reason the box select and lasso tools which I have managed to have as linked in plots explicitly plotted (i.e. not generated with a for loop) does not seem to work now.
Do I need to increment some bokeh function within the for loop?
#example dataframe
array = {'variable': ['var1', 'var2', 'var3', 'var4'],
'var1': [np.random.rand(10)],
'var2': [np.random.rand(10)],
'var3': [np.random.rand(10)],
'var4': [np.random.rand(10)]}
cols = ['var1',
'var2',
'var3',
'var4']
df = pd.DataFrame(array, columns = cols)
w = 500
h = 400
#collect plots in a list (start with an empty)
plots = []
#iterate over the columns in the dataframe
# specify the tools in TOOLS
#add additional lines to show tolerance bands etc
for c in df[cols]:
source = ColumnDataSource(data = dict(x = df.index, y = df[c]))
TOOLS = "pan,wheel_zoom,box_zoom,reset,save,box_select,lasso_select"
f = figure(tools = TOOLS, width = w, plot_height = h, title = c + ' Run Chart',
x_axis_label = 'Run ID', y_axis_label = c)
f.line('x', 'y', source = source, name = 'data')
f.triangle('x', 'y', source = source)
#data mean line
f.line(df.index, df[c].mean(), color = 'orange')
#tolerance lines
f.line (df.index, df[c + 'USL'][0], color = 'red', line_dash = 'dashed', line_width = 2)
f.line (df.index, df[c + 'LSL'][0], color = 'red', line_dash = 'dashed', line_width = 2)
#append the new plot in this loop to the existing list of plots
plots.append(f)
#link all the x_ranges
for i in plots:
i.x_range = plots[0].x_range
#plot
p = gridplot(plots, ncols = 2)
output_notebook()
show(p)
I expect to produce plots which are linked and allow me to box or lasso select some points on one chart and for them to be highlighted on the others. However, the plots only let me select on one plot with no linked behaviour.
SOLUTION
This may seem a bit of a noob problem, but I am sure someone else will come across this, so here is the answer!!!
Bokeh works by referring to a datasource object (the columndatasource object). You can pass your dataframe completely into this and then call explicit x and y values within the glyph creation (e.g. my f.line, f.triangle etc).
So I moved the 'source' outside of the loop to prevent it being reset each iteration and just passed my df to it. I then within the loop, call the iteration index + descriptor string (USL, LSL, mean) for the y values and the 'index' for my x values.
I add a box select tool explicitly with a 'name' defined so that when the box selects, it only selects those glyphs that I want it to select (i.e. don't want it to select my constant value mean and spec limit lines).
Also, be careful that if you want to output to a html or something, that you probably will need to supress your in-notebook output as bokeh does not like having duplicate plots open. I have not included my html output solution here.
In terms of adding linked lasso objects for loop generated plots, I could only find an explicit box select tool generator so not sure this is possible.
So here it is:
#keep the source out of the loop to stop it resetting every time
Source = ColumnDataSource(df)
for c in cols:
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
f = figure(tools = TOOLS, width = w, plot_height = h, title = c + ' Run Chart',
x_axis_label = 'Run ID', y_axis_label = c)
f.line(x = 'index', y = c , source = Source, name = 'data')
f.triangle(x = 'index', y = c, source = Source, name = 'data')
#data mean line
f.line(x = 'index', y = c + '_mean', source = Source, color = 'orange')
#tolerance lines
f.line (x = 'index', y = c + 'USL', color = 'red', line_dash = 'dashed', line_width = 2, source = Source)
f.line (x = 'index', y = c + 'LSL', color = 'red', line_dash = 'dashed', line_width = 2, source = Source)
# Add BoxSelect tool - this allows points on one plot to be highligted on all linked plots. Note only the delta info
# is linked using name='data'. Again names can be used to ensure only the relevant glyphs are highlighted.
bxselect1 = BoxSelectTool(renderers=f.select(name='data'))
f.add_tools(bxselect1)
plots.append(f)
#tie the x_ranges together so that panning is linked between plots
for i in plots:
i.x_range = plots[0].x_range
forp = gridplot(plots, ncols = 2)
show(forp)

Plotly for python, only first data point is being graphed

I am new to plotly and working on a script to generate a graph based on some results pulled from a database. However when I send the data over to plotly, only the first data point for each of the three traces is being graphed. I've verified that the lists contain the right data, I've even simply pasted the lists in instead of dynamically creating the variables. Unfortunately each time only the first data point is being graphed. Does anyone know what I am missing here? I am also open to another library if needed.
Is it also possible to have the x axis show as a string?
import plotly.plotly as py
import plotly.graph_objs as go
# Custom database class, works fine.
from classes.database import DatabaseConnection
# Database Connections and instances
db_instance = DatabaseConnection()
db_conn = db_instance.conn
db_cur = db_instance.cur
def main():
# Get a list of versions and their stats.
db_cur.execute(
"""
select row_to_json(x) from
(SELECT
versions.version_number,
cast(AVG(results.average) as double precision) as average,
cast(AVG(results.minimum) as double precision) as minimum,
cast(AVG(results.maximum) as double precision) as maximum
FROM versions,results
WHERE
versions.version_number = results.version_number
GROUP BY
versions.version_number) x;
"""
)
versions = []
average = []
minimum = []
maximum = []
unclean = db_cur.fetchall()
# Create lists for x and y coordinates.
for row in unclean:
versions.append(row[0]['version_number'])
average.append(int(row[0]['average']))
minimum.append(int(row[0]['minimum']))
maximum.append(int(row[0]['maximum']))
grph_average = go.Scatter(
x=versions,
y=average,
name = 'Average',
mode='lines',
)
grph_minimum = go.Scatter(
x=versions,
y=minimum,
name = 'Minimum',
mode='lines',
)
grph_maximum = go.Scatter(
x=versions,
y=maximum,
name = 'Maximum',
mode='lines',
)
data = go.Data([grph_average, grph_minimum, grph_maximum])
# Edit the layout
layout = dict(title = 'Responses',
xaxis = dict(title = 'Versions'),
yaxis = dict(title = 'Ms'),
)
fig = dict(data=data, layout=layout)
py.plot(fig, filename='response-times', auto_open=False)
if __name__ == '__main__':
main()
The data that query returns is as follows, if you want to plug in the values :
versions = ['6.1', '5.0', '5.2']
average = [11232, 29391, 10429]
minimum = [3641, 7729, 3483]
maximum = [57440, 62535, 45201]
Here is some matplotlib that might get you started on this:
import matplotlib.pyplot as plt
versions = ['6.1', '5.0', '5.2']
average = [11232, 29391, 10429]
minimum = [3641, 7729, 3483]
maximum = [57440, 62535, 45201]
plt.plot(minimum)
plt.plot(average)
plt.plot(maximum)
plt.xticks(range(len(versions)), versions)
It looks like it was an issue with my x axis. By adding some text before the version number and specifically type casting to a string I was able to get the graphs to generate properly.
# Create lists for x and y coordinates.
for row in unclean:
versions.append("Version: " + str(row[0]['version_number']))
average.append(int(row[0]['average']))
minimum.append(int(row[0]['minimum']))
maximum.append(int(row[0]['maximum']))

Categories