how to visualize high volumn 3 dimensional data - python

I have a data set like the following:
import numpy as np
from pandas import DataFrame
mypos = np.random.randint(10, size=(100, 2))
mydata = DataFrame(mypos, columns=['x', 'y'])
myres = np.random.rand(100, 1)
mydata['res'] = myres
The res variable is continous, the x and y variables are integers representing
positions (therefore largely repetitive),
and res represents kind of correlations between pairs of positions.
I am wondering what are the best ways of visualizing this data set?
Possible approaches already considered:
Scatter plot, with the res variable visualized by a color gradient.
Parallel coordinates plot.
The first approach is problematic when the number of positions get large,
because high values (which are the values we care about) of the res variable would be drowned in a sea of
small dots.
The second approach could be promising, but I am having trouble producing it.
I have tried the parallel_coordinates function from the pandas module,
but it's not behaving as I would like it to. (see this question here:
parallel coordinates plot for continous data in pandas
)

I hope this helps to find a solution in R. Good luck.
# you need this package for the colour palette
library(RColorBrewer)
# create the random data
dd <- data.frame(
x = round(runif(100, 0, 10), 0),
y = round(runif(100, 0, 10), 0),
res = runif(100)
)
# pick the number of colours (granularity of colour scale)
nColors <- 100
# create the colour pallete
cols <-colorRampPalette(colors=c("white","blue"))(nColors)
# get a zScale for the colours
zScale <- seq(min(dd$res), max(dd$res), length.out = nColors)
# function that returns the nearest colour given a value of res
findNearestColour <- function(x) {
colorIndex <- which(abs(zScale - x) == min(abs(zScale - x)))
return(cols[colorIndex])
}
# the first plot is the scatterplot
### this has problems because points come out on top of eachother
plot(y ~ x, dd, type = "n")
for(i in 1:dim(dd)[1]){
with(dd[i,],
points(y ~ x, col = findNearestColour(res), pch = 19)
)
}
# this is your parallel coordinates plot (a little better)
plot(1, 1, xlim = c(0, 1), ylim = c(min(dd$x, dd$y), max(dd$x, dd$y)),
type = "n", axes = F, ylab = "", xlab = "")
for(i in 1:dim(dd)[1]){
with(dd[i,],
segments(0, x, 1, y, col = findNearestColour(res))
)
}

Related

Draw longest possible vertical line between two curves in seaborn

I currently have a plot like this (consider that data is the dataframe I pasted at the very bottom):
import seaborn as sns
sns.relplot(
data = data,
x = "Threshold",
y = "Value",
kind = "line",
hue="Metric"
).set(xlabel="Threshold")
Which produces:
Now, I want to know how can I annotate a line in this plot, such that it is located between the curves, at the x-Axis value where the distance between curves are maximized. I would also need to annotate text to show the distance value.
It should be something like this:
Here is the pandas dataframe:
Threshold,Metric,Value
0.0,Recall,1.0
0.010101010101010102,Recall,0.9802536231884058
0.020202020202020204,Recall,0.9706521739130435
0.030303030303030304,Recall,0.9621376811594203
0.04040404040404041,Recall,0.9541666666666667
0.05050505050505051,Recall,0.9456521739130435
0.06060606060606061,Recall,0.9322463768115942
0.07070707070707072,Recall,0.9173913043478261
0.08080808080808081,Recall,0.908695652173913
0.09090909090909091,Recall,0.8976449275362319
0.10101010101010102,Recall,0.8813405797101449
0.11111111111111112,Recall,0.8644927536231884
0.12121212121212122,Recall,0.8498188405797101
0.13131313131313133,Recall,0.8358695652173913
0.14141414141414144,Recall,0.818659420289855
0.15151515151515152,Recall,0.7967391304347826
0.16161616161616163,Recall,0.7748188405797102
0.17171717171717174,Recall,0.7521739130434782
0.18181818181818182,Recall,0.7269927536231884
0.19191919191919193,Recall,0.6952898550724638
0.20202020202020204,Recall,0.6704710144927536
0.21212121212121213,Recall,0.648731884057971
0.22222222222222224,Recall,0.6097826086956522
0.23232323232323235,Recall,0.5847826086956521
0.24242424242424243,Recall,0.5521739130434783
0.25252525252525254,Recall,0.5023550724637681
0.26262626262626265,Recall,0.4766304347826087
0.27272727272727276,Recall,0.42047101449275365
0.2828282828282829,Recall,0.3958333333333333
0.29292929292929293,Recall,0.3539855072463768
0.30303030303030304,Recall,0.3327898550724638
0.31313131313131315,Recall,0.3036231884057971
0.32323232323232326,Recall,0.2798913043478261
0.33333333333333337,Recall,0.2371376811594203
0.3434343434343435,Recall,0.22119565217391304
0.3535353535353536,Recall,0.17300724637681159
0.36363636363636365,Recall,0.15996376811594204
0.37373737373737376,Recall,0.13568840579710145
0.38383838383838387,Recall,0.11938405797101449
0.393939393939394,Recall,0.10652173913043478
0.4040404040404041,Recall,0.09891304347826087
0.4141414141414142,Recall,0.08894927536231884
0.42424242424242425,Recall,0.07681159420289856
0.43434343434343436,Recall,0.06557971014492754
0.4444444444444445,Recall,0.05253623188405797
0.4545454545454546,Recall,0.04655797101449275
0.4646464646464647,Recall,0.024456521739130436
0.4747474747474748,Recall,0.019384057971014494
0.48484848484848486,Recall,0.009782608695652175
0.494949494949495,Recall,0.0034420289855072463
0.5050505050505051,Recall,0.002173913043478261
0.5151515151515152,Recall,0.0016304347826086956
0.5252525252525253,Recall,0.0007246376811594203
0.5353535353535354,Recall,0.00018115942028985507
0.5454545454545455,Recall,0.0
0.5555555555555556,Recall,0.0
0.5656565656565657,Recall,0.0
0.5757575757575758,Recall,0.0
0.5858585858585859,Recall,0.0
0.595959595959596,Recall,0.0
0.6060606060606061,Recall,0.0
0.6161616161616162,Recall,0.0
0.6262626262626263,Recall,0.0
0.6363636363636365,Recall,0.0
0.6464646464646465,Recall,0.0
0.6565656565656566,Recall,0.0
0.6666666666666667,Recall,0.0
0.6767676767676768,Recall,0.0
0.686868686868687,Recall,0.0
0.696969696969697,Recall,0.0
0.7070707070707072,Recall,0.0
0.7171717171717172,Recall,0.0
0.7272727272727273,Recall,0.0
0.7373737373737375,Recall,0.0
0.7474747474747475,Recall,0.0
0.7575757575757577,Recall,0.0
0.7676767676767677,Recall,0.0
0.7777777777777778,Recall,0.0
0.787878787878788,Recall,0.0
0.797979797979798,Recall,0.0
0.8080808080808082,Recall,0.0
0.8181818181818182,Recall,0.0
0.8282828282828284,Recall,0.0
0.8383838383838385,Recall,0.0
0.8484848484848485,Recall,0.0
0.8585858585858587,Recall,0.0
0.8686868686868687,Recall,0.0
0.8787878787878789,Recall,0.0
0.888888888888889,Recall,0.0
0.8989898989898991,Recall,0.0
0.9090909090909092,Recall,0.0
0.9191919191919192,Recall,0.0
0.9292929292929294,Recall,0.0
0.9393939393939394,Recall,0.0
0.9494949494949496,Recall,0.0
0.9595959595959597,Recall,0.0
0.9696969696969697,Recall,0.0
0.9797979797979799,Recall,0.0
0.98989898989899,Recall,0.0
1.0,Recall,0.0
0.0,Fall-out,1.0
0.010101010101010102,Fall-out,0.6990465720990212
0.020202020202020204,Fall-out,0.58461408367334
0.030303030303030304,Fall-out,0.516647992727734
0.04040404040404041,Fall-out,0.4643680104855929
0.05050505050505051,Fall-out,0.4172674037587468
0.06060606060606061,Fall-out,0.3796376551170116
0.07070707070707072,Fall-out,0.3507811343889394
0.08080808080808081,Fall-out,0.33186055852694335
0.09090909090909091,Fall-out,0.3152231359533222
0.10101010101010102,Fall-out,0.29964272879098575
0.11111111111111112,Fall-out,0.2855844238208993
0.12121212121212122,Fall-out,0.27161068008371564
0.13131313131313133,Fall-out,0.25719298987379235
0.14141414141414144,Fall-out,0.24338836860241422
0.15151515151515152,Fall-out,0.2312538316808659
0.16161616161616163,Fall-out,0.22026087140350506
0.17171717171717174,Fall-out,0.2083377375642137
0.18181818181818182,Fall-out,0.19694311143056467
0.19191919191919193,Fall-out,0.18402638310466565
0.20202020202020204,Fall-out,0.17440754286197493
0.21212121212121213,Fall-out,0.16548633279073208
0.22222222222222224,Fall-out,0.15278100754709004
0.23232323232323235,Fall-out,0.14292962391391667
0.24242424242424243,Fall-out,0.1317252605542989
0.25252525252525254,Fall-out,0.11555292476164303
0.26262626262626265,Fall-out,0.10612434729298353
0.27272727272727276,Fall-out,0.08902183793839714
0.2828282828282829,Fall-out,0.08331395471745978
0.29292929292929293,Fall-out,0.07232099444009894
0.30303030303030304,Fall-out,0.06735302200706086
0.31313131313131315,Fall-out,0.061454876012092256
0.32323232323232326,Fall-out,0.05665602604485973
0.33333333333333337,Fall-out,0.048982094158932836
0.3434343434343435,Fall-out,0.045641925459273196
0.3535353535353536,Fall-out,0.03748176648415534
0.36363636363636365,Fall-out,0.0341415977844957
0.37373737373737376,Fall-out,0.029321607509037482
0.38383838383838387,Fall-out,0.026996173604211148
0.393939393939394,Fall-out,0.024353635075999407
0.4040404040404041,Fall-out,0.022514428260364035
0.4141414141414142,Fall-out,0.01940680295118703
0.42424242424242425,Fall-out,0.017165930279263473
0.43434343434343436,Fall-out,0.014459970826374648
0.4444444444444445,Fall-out,0.011035240893812233
0.4545454545454546,Fall-out,0.009386296852208105
0.4646464646464647,Fall-out,0.004756569350781135
0.4747474747474748,Fall-out,0.003868676405301989
0.48484848484848486,Fall-out,0.002135171130795087
0.494949494949495,Fall-out,0.0008033317125763693
0.5050505050505051,Fall-out,0.0004228061645138786
0.5151515151515152,Fall-out,0.00031710462338540896
0.5252525252525253,Fall-out,4.228061645138786e-05
0.5353535353535354,Fall-out,0.0
0.5454545454545455,Fall-out,0.0
0.5555555555555556,Fall-out,0.0
0.5656565656565657,Fall-out,0.0
0.5757575757575758,Fall-out,0.0
0.5858585858585859,Fall-out,0.0
0.595959595959596,Fall-out,0.0
0.6060606060606061,Fall-out,0.0
0.6161616161616162,Fall-out,0.0
0.6262626262626263,Fall-out,0.0
0.6363636363636365,Fall-out,0.0
0.6464646464646465,Fall-out,0.0
0.6565656565656566,Fall-out,0.0
0.6666666666666667,Fall-out,0.0
0.6767676767676768,Fall-out,0.0
0.686868686868687,Fall-out,0.0
0.696969696969697,Fall-out,0.0
0.7070707070707072,Fall-out,0.0
0.7171717171717172,Fall-out,0.0
0.7272727272727273,Fall-out,0.0
0.7373737373737375,Fall-out,0.0
0.7474747474747475,Fall-out,0.0
0.7575757575757577,Fall-out,0.0
0.7676767676767677,Fall-out,0.0
0.7777777777777778,Fall-out,0.0
0.787878787878788,Fall-out,0.0
0.797979797979798,Fall-out,0.0
0.8080808080808082,Fall-out,0.0
0.8181818181818182,Fall-out,0.0
0.8282828282828284,Fall-out,0.0
0.8383838383838385,Fall-out,0.0
0.8484848484848485,Fall-out,0.0
0.8585858585858587,Fall-out,0.0
0.8686868686868687,Fall-out,0.0
0.8787878787878789,Fall-out,0.0
0.888888888888889,Fall-out,0.0
0.8989898989898991,Fall-out,0.0
0.9090909090909092,Fall-out,0.0
0.9191919191919192,Fall-out,0.0
0.9292929292929294,Fall-out,0.0
0.9393939393939394,Fall-out,0.0
0.9494949494949496,Fall-out,0.0
0.9595959595959597,Fall-out,0.0
0.9696969696969697,Fall-out,0.0
0.9797979797979799,Fall-out,0.0
0.98989898989899,Fall-out,0.0
1.0,Fall-out,0.0
Use pivot to transform the data from long to wide
Use idxmax to find the x (Threshold) of the max difference between y1 and y2 (Fall-out and Recall)
Use vlines to plot the vertical line at x from y1 to y2
Use annotate to plot the label at the midpoint of y1 and y2
g = sns.relplot(data=data, x='Threshold', y='Value', hue='Metric', kind='line')
# pivot to wide form
p = data.pivot(index='Threshold', columns='Metric', values='Value')
# find x, y1, and y2 corresponding to max difference
diff = p['Fall-out'].sub(p['Recall']).abs()
x = diff.idxmax()
y1, y2 = p.loc[x]
# plot line and label
ax = g.axes.flat[0]
ax.vlines(x, y1, y2, ls='--')
ax.annotate(f'Dist = {diff.loc[x]:.2f}', ha='left', va='center',
xy=(x, 0.5*(y1+y2)), xycoords='data',
xytext=(5, 0), textcoords='offset pixels')
The easiest way which I can think of is to create two separate lists of all values where the metric is Recall and another with all values where metric is Fall-out. This can be easily done using pandas operations as follows (Assuming the dataframe has name df) -
import math
import matplotlib.pyplot as plt
ls_metric = df['Metric'].to_list()
ls_value = df['Value'].to_list()
ls_threshold = df['Threshold'].to_list()
ls_value_recall = []
ls_value_fallout = []
ls_threshold_recall = []
ls_threshold_fallout = []
for i, j, k in zip(ls_metric, ls_value, ls_threshold):
if (i == 'Recall'):
ls_value_recall.append(j)
ls_threshold_recall.append(k)
elif(i == 'Fall-out'):
ls_value_fallout.append(j)
ls_threshold_recall.append(k)
ls_dist = []
for i, j in zip(ls_value_recall, ls_value_fallout):
ls_dist.append(math.abs(i-j))
max_diff = max(ls_dist)
location_of_max_diff = ls_dist.index(max_diff)
value_of_threshold_at_max_diff = ls_threshold_recall[location_of_max_diff]
value_of_recall_at_max_diff = ls_value_recall[location_of_max_diff]
value_of_fallout_at_max_diff = ls_value_fallout[location_of_max_diff]
x_values = [value_of_threshold_at_max_diff, value_of_threshold_at_max_diff]
y_values = [value_of_recall_at_max_diff, value_of_fallout_at_max_diff]
plt.plot(x_values, y_values)
Certain Assumptions - The Threshold Values are the same and same number of readings are present for both metrics which I think is true having had a brief glance at the data but if not I believe it's still pretty easy to modify the code
You can add this plot to your own figure for which the syntax is readily available, now as far as the label for the line is concerned one way to do this is use matplotlib.pyplot.text to add a textbox but with that you'll need to tweak with the location to get the desired location another way to do this would be to add it as a legend only

Altair: Extract and display regression coefficients

This question addresses how to access and display the R2 value using mark_text()
I am interested in accessing and displaying the coefficients. Replacing rSquared with coef yields a flattened array of both the intercept and slope, as described in the documentation.
How can I index into this array to display only one of the values, e.g. the slope? I wondered if the mark_text() step should be preceded by a transform (possibly transform_filter(), or if altair.Text() could be used.
I am aware of other approaches which involve determining this information separately then adding it as an additional layer.
Apologies if this is a very straightforward question. Thanks in advance.
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.linspace(0, 10)
y = x - 5 + np.random.randn(len(x))
df = pd.DataFrame({'x': x, 'y': y})
chart = alt.Chart(df).mark_point().encode(
x='x',
y='y'
)
line = chart.transform_regression('x', 'y').mark_line()
params = alt.Chart(df).transform_regression(
'x', 'y', params=True
).mark_text(align='left').encode(
x=alt.value(20), # pixels from left
y=alt.value(20), # pixels from top
text='rSquared:N',
# text='coef:N' # flattened array
# text='coef[0]:N' # fails
)
chart + line + params
You can access this using a calculate transform:
params = alt.Chart(df).transform_regression(
'x', 'y', params=True
).transform_calculate(
intercept='datum.coef[0]',
slope='datum.coef[1]',
).mark_text(align='left').encode(
x=alt.value(20), # pixels from left
y=alt.value(20), # pixels from top
text='intercept:N'
)
chart + line + params

How to Create a Boxplot / Group Boxplot from [Min ,Q1 ,Q2 ,Q3 ,Max] in Python? [duplicate]

From what I can see, boxplot() method expects a sequence of raw values (numbers) as input, from which it then computes percentiles to draw the boxplot(s).
I would like to have a method by which I could pass in the percentiles and get the corresponding boxplot.
For example:
Assume that I have run several benchmarks and for each benchmark I've measured latencies ( floating point values ). Now additionally, I have precomputed the percentiles for these values.
Hence for each benchmark, I have the 25th, 50th, 75th percentile along with the min and max.
Now given these data, I would like to draw the box plots for the benchmarks.
As of 2020, there is a better method than the one in the accepted answer.
The matplotlib.axes.Axes class provides a bxp method, which can be used to draw the boxes and whiskers based on the percentile values. Raw data is only needed for the outliers, and that is optional.
Example:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
boxes = [
{
'label' : "Male height",
'whislo': 162.6, # Bottom whisker position
'q1' : 170.2, # First quartile (25th percentile)
'med' : 175.7, # Median (50th percentile)
'q3' : 180.4, # Third quartile (75th percentile)
'whishi': 187.8, # Top whisker position
'fliers': [] # Outliers
}
]
ax.bxp(boxes, showfliers=False)
ax.set_ylabel("cm")
plt.savefig("boxplot.png")
plt.close()
This produces the following image:
To draw the box plot using just the percentile values and the outliers ( if any ) I made a customized_box_plot function that basically modifies attributes in a basic box plot ( generated from a tiny sample data ) to make it fit according to your percentile values.
The customized_box_plot function
def customized_box_plot(percentiles, axes, redraw = True, *args, **kwargs):
"""
Generates a customized boxplot based on the given percentile values
"""
box_plot = axes.boxplot([[-9, -4, 2, 4, 9],]*n_box, *args, **kwargs)
# Creates len(percentiles) no of box plots
min_y, max_y = float('inf'), -float('inf')
for box_no, (q1_start,
q2_start,
q3_start,
q4_start,
q4_end,
fliers_xy) in enumerate(percentiles):
# Lower cap
box_plot['caps'][2*box_no].set_ydata([q1_start, q1_start])
# xdata is determined by the width of the box plot
# Lower whiskers
box_plot['whiskers'][2*box_no].set_ydata([q1_start, q2_start])
# Higher cap
box_plot['caps'][2*box_no + 1].set_ydata([q4_end, q4_end])
# Higher whiskers
box_plot['whiskers'][2*box_no + 1].set_ydata([q4_start, q4_end])
# Box
box_plot['boxes'][box_no].set_ydata([q2_start,
q2_start,
q4_start,
q4_start,
q2_start])
# Median
box_plot['medians'][box_no].set_ydata([q3_start, q3_start])
# Outliers
if fliers_xy is not None and len(fliers_xy[0]) != 0:
# If outliers exist
box_plot['fliers'][box_no].set(xdata = fliers_xy[0],
ydata = fliers_xy[1])
min_y = min(q1_start, min_y, fliers_xy[1].min())
max_y = max(q4_end, max_y, fliers_xy[1].max())
else:
min_y = min(q1_start, min_y)
max_y = max(q4_end, max_y)
# The y axis is rescaled to fit the new box plot completely with 10%
# of the maximum value at both ends
axes.set_ylim([min_y*1.1, max_y*1.1])
# If redraw is set to true, the canvas is updated.
if redraw:
ax.figure.canvas.draw()
return box_plot
USAGE
Using inverse logic ( code at the very end ) I extracted the percentile values from this example
>>> percentiles
(-1.0597368367634488, 0.3977683984966961, 1.0298955252405229, 1.6693981537742526, 3.4951447843464449)
(-0.90494930553559483, 0.36916539612108634, 1.0303658700697103, 1.6874542731392828, 3.4951447843464449)
(0.13744105279440233, 1.3300645202649739, 2.6131540656339483, 4.8763411136047647, 9.5751914834437937)
(0.22786243898199182, 1.4120860286080519, 2.637650402506837, 4.9067126578493259, 9.4660357513550899)
(0.0064696168078617741, 0.30586770128093388, 0.70774153557312702, 1.5241965711101928, 3.3092932063051976)
(0.007009744579241136, 0.28627373934008982, 0.66039691869500572, 1.4772725266672091, 3.221716765477217)
(-2.2621660374110544, 5.1901313713883352, 7.7178532139979357, 11.277744848353247, 20.155971739152388)
(-2.2621660374110544, 5.1884411864079532, 7.3357079047721054, 10.792299385806913, 18.842012119715388)
(2.5417888074435702, 5.885996170695587, 7.7271286220368598, 8.9207423361593179, 10.846938621419374)
(2.5971767318505856, 5.753551925927133, 7.6569980004033464, 8.8161056254143233, 10.846938621419374)
Note that to keep this short I haven't shown the outliers vectors which will be the 6th element of each of the percentile array.
Also note that all usual additional kwargs / args can be used since they are simply passed to the boxplot method inside it :
>>> fig, ax = plt.subplots()
>>> b = customized_box_plot(percentiles, ax, redraw=True, notch=0, sym='+', vert=1, whis=1.5)
>>> plt.show()
EXPLANATION
The boxplot method returns a dictionary mapping the components of the boxplot to the individual matplotlib.lines.Line2D instances that were created.
Quoting from the matplotlib.pyplot.boxplot documentation :
That dictionary has the following keys (assuming vertical boxplots):
boxes: the main body of the boxplot showing the quartiles and the median’s confidence intervals if enabled.
medians: horizonal lines at the median of each box.
whiskers: the vertical lines extending to the most extreme, n-outlier data points. caps: the horizontal lines at the ends of the whiskers.
fliers: points representing data that extend beyond the whiskers (outliers).
means: points or lines representing the means.
For example observe the boxplot of a tiny sample data of [-9, -4, 2, 4, 9]
>>> b = ax.boxplot([[-9, -4, 2, 4, 9],])
>>> b
{'boxes': [<matplotlib.lines.Line2D at 0x7fe1f5b21350>],
'caps': [<matplotlib.lines.Line2D at 0x7fe1f54d4e50>,
<matplotlib.lines.Line2D at 0x7fe1f54d0e50>],
'fliers': [<matplotlib.lines.Line2D at 0x7fe1f5b317d0>],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x7fe1f63549d0>],
'whiskers': [<matplotlib.lines.Line2D at 0x7fe1f5b22e10>,
<matplotlib.lines.Line2D at 0x7fe20c54a510>]}
>>> plt.show()
The matplotlib.lines.Line2D objects have two methods that I'll be using in my function extensively. set_xdata ( or set_ydata ) and get_xdata ( or get_ydata ).
Using these methods we can alter the position of the constituent lines of the base box plot to conform to your percentile values ( which is what the customized_box_plot function does ). After altering the constituent lines' position, you can redraw the canvas using figure.canvas.draw()
Summarizing the mappings from percentile to the coordinates of the various Line2D objects.
The Y Coordinates :
The max ( q4_end - end of 4th quartile ) corresponds to the top most cap Line2D object.
The min ( q1_start - start of the 1st quartile ) corresponds to the lowermost most cap Line2D object.
The median corresponds to the ( q3_start ) median Line2D object.
The 2 whiskers lie between the ends of the boxes and extreme caps ( q1_start and q2_start - lower whisker; q4_start and q4_end - upper whisker )
The box is actually an interesting n shaped line bounded by a cap at the lower portion. The extremes of the n shaped line correspond to the q2_start and the q4_start.
The X Coordinates :
The Central x coordinates ( for multiple box plots are usually 1, 2, 3... )
The library automatically calculates the bounding x coordinates based on the width specified.
INVERSE FUNCTION TO RETRIEVE THE PERCENTILES FROM THE boxplot DICT:
def get_percentiles_from_box_plots(bp):
percentiles = []
for i in range(len(bp['boxes'])):
percentiles.append((bp['caps'][2*i].get_ydata()[0],
bp['boxes'][i].get_ydata()[0],
bp['medians'][i].get_ydata()[0],
bp['boxes'][i].get_ydata()[2],
bp['caps'][2*i + 1].get_ydata()[0],
(bp['fliers'][i].get_xdata(),
bp['fliers'][i].get_ydata())))
return percentiles
NOTE:
The reason why I did not make a completely custom boxplot method is because, there are many features offered by the inbuilt box plot that cannot be fully reproduced.
Also excuse me if I may have unnecessarily explained something that may have been too obvious.
Here is an updated version of this useful routine. Setting the vertices directly appears to work for both filled boxes (patchArtist=True) and unfilled ones.
def customized_box_plot(percentiles, axes, redraw = True, *args, **kwargs):
"""
Generates a customized boxplot based on the given percentile values
"""
n_box = len(percentiles)
box_plot = axes.boxplot([[-9, -4, 2, 4, 9],]*n_box, *args, **kwargs)
# Creates len(percentiles) no of box plots
min_y, max_y = float('inf'), -float('inf')
for box_no, pdata in enumerate(percentiles):
if len(pdata) == 6:
(q1_start, q2_start, q3_start, q4_start, q4_end, fliers_xy) = pdata
elif len(pdata) == 5:
(q1_start, q2_start, q3_start, q4_start, q4_end) = pdata
fliers_xy = None
else:
raise ValueError("Percentile arrays for customized_box_plot must have either 5 or 6 values")
# Lower cap
box_plot['caps'][2*box_no].set_ydata([q1_start, q1_start])
# xdata is determined by the width of the box plot
# Lower whiskers
box_plot['whiskers'][2*box_no].set_ydata([q1_start, q2_start])
# Higher cap
box_plot['caps'][2*box_no + 1].set_ydata([q4_end, q4_end])
# Higher whiskers
box_plot['whiskers'][2*box_no + 1].set_ydata([q4_start, q4_end])
# Box
path = box_plot['boxes'][box_no].get_path()
path.vertices[0][1] = q2_start
path.vertices[1][1] = q2_start
path.vertices[2][1] = q4_start
path.vertices[3][1] = q4_start
path.vertices[4][1] = q2_start
# Median
box_plot['medians'][box_no].set_ydata([q3_start, q3_start])
# Outliers
if fliers_xy is not None and len(fliers_xy[0]) != 0:
# If outliers exist
box_plot['fliers'][box_no].set(xdata = fliers_xy[0],
ydata = fliers_xy[1])
min_y = min(q1_start, min_y, fliers_xy[1].min())
max_y = max(q4_end, max_y, fliers_xy[1].max())
else:
min_y = min(q1_start, min_y)
max_y = max(q4_end, max_y)
# The y axis is rescaled to fit the new box plot completely with 10%
# of the maximum value at both ends
axes.set_ylim([min_y*1.1, max_y*1.1])
# If redraw is set to true, the canvas is updated.
if redraw:
ax.figure.canvas.draw()
return box_plot
Here is a bottom-up approach where the box_plot is build up using matplotlib's vline, Rectangle, and normal plot functions
def boxplot(df, ax=None, box_width=0.2, whisker_size=20, mean_size=10, median_size = 10 , line_width=1.5, xoffset=0,
color=0):
"""Plots a boxplot from existing percentiles.
Parameters
----------
df: pandas DataFrame
ax: pandas AxesSubplot
if to plot on en existing axes
box_width: float
whisker_size: float
size of the bar at the end of each whisker
mean_size: float
size of the mean symbol
color: int or rgb(list)
If int particular color of property cycler is taken. Example of rgb: [1,0,0] (red)
Returns
-------
f, a, boxes, vlines, whisker_tips, mean, median
"""
if type(color) == int:
color = plt.rcParams['axes.prop_cycle'].by_key()['color'][color]
if ax:
a = ax
f = a.get_figure()
else:
f, a = plt.subplots()
boxes = []
vlines = []
xn = []
for row in df.iterrows():
x = row[0] + xoffset
xn.append(x)
# box
y = row[1][25]
height = row[1][75] - row[1][25]
box = plt.Rectangle((x - box_width / 2, y), box_width, height)
a.add_patch(box)
boxes.append(box)
# whiskers
y = (row[1][95] + row[1][5]) / 2
vl = a.vlines(x, row[1][5], row[1][95])
vlines.append(vl)
for b in boxes:
b.set_linewidth(line_width)
b.set_facecolor([1, 1, 1, 1])
b.set_edgecolor(color)
b.set_zorder(2)
for vl in vlines:
vl.set_color(color)
vl.set_linewidth(line_width)
vl.set_zorder(1)
whisker_tips = []
if whisker_size:
g, = a.plot(xn, df[5], ls='')
whisker_tips.append(g)
g, = a.plot(xn, df[95], ls='')
whisker_tips.append(g)
for wt in whisker_tips:
wt.set_markeredgewidth(line_width)
wt.set_color(color)
wt.set_markersize(whisker_size)
wt.set_marker('_')
mean = None
if mean_size:
g, = a.plot(xn, df['mean'], ls='')
g.set_marker('o')
g.set_markersize(mean_size)
g.set_zorder(20)
g.set_markerfacecolor('None')
g.set_markeredgewidth(line_width)
g.set_markeredgecolor(color)
mean = g
median = None
if median_size:
g, = a.plot(xn, df['median'], ls='')
g.set_marker('_')
g.set_markersize(median_size)
g.set_zorder(20)
g.set_markeredgewidth(line_width)
g.set_markeredgecolor(color)
median = g
a.set_ylim(np.nanmin(df), np.nanmax(df))
return f, a, boxes, vlines, whisker_tips, mean, median
This is how it looks in action:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
nopts = 12
df = pd.DataFrame()
df['mean'] = np.random.random(nopts) + 7
df['median'] = np.random.random(nopts) + 7
df[5] = np.random.random(nopts) + 4
df[25] = np.random.random(nopts) + 6
df[75] = np.random.random(nopts) + 8
df[95] = np.random.random(nopts) + 10
out = boxplot(df)

how to isolate data that are 2 and 3 sigma deviated from mean and then mark them in a plot in python?

I am reading from a dataset which looks like the following when plotted in matplotlib and then taken the best fit curve using linear regression.
The sample of data looks like following:
# ID X Y px py pz M R
1.04826492772e-05 1.04828050287e-05 1.048233088e-05 0.000107002791008 0.000106552433081 0.000108704469007 387.02 4.81947797625e+13
1.87380963036e-05 1.87370588085e-05 1.87372620448e-05 0.000121616280029 0.000151924707761 0.00012371156585 428.77 6.54636174067e+13
3.95579877816e-05 3.95603773653e-05 3.95610756809e-05 0.000163470663023 0.000265203868883 0.000228031803626 470.74 8.66961875758e+13
My code looks the following:
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = M, Y_z # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
However I can see a lot upward scatter in the data and the best fit curve is affected by those. So first I want to isolate the data points which are 2 and 3 sigma away from my mean data, and mark them with circle around them.
Then take the best fit curve considering only the points which fall within 1 sigma of my mean data
Is there a good function in python which can do that for me?
Also in addition to that may I also isolate the data from my actual dataset, like if the third row in the sample input represents 2 sigma deviation may I have that row as an output too to save later and investigate more?
Your help is most appreciated.
Here's some code that goes through the data in a given number of windows, calculates statistics in said windows, and separates data in well- and misbehaved lists.
Hope this helps.
from scipy import stats
from scipy import polyval
import numpy as np
import matplotlib.pyplot as plt
num_data = 10000
fake_data_x = np.sort(12.8+np.random.random(num_data))
fake_data_y = np.exp(fake_data_x) + np.random.normal(0,scale=50000,size=num_data)
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = fake_data_x, fake_data_y # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.figure()
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
plt.show()
num_bin_intervals = 10 # approx number of averaging windows
window_boundaries = np.linspace(min(fake_data_x),max(fake_data_x),int(len(fake_data_x)/num_bin_intervals)) # window boundaries
y_good = [] # list to collect the "well-behaved" y-axis data
x_good = [] # list to collect the "well-behaved" x-axis data
y_outlier = []
x_outlier = []
for i in range(len(window_boundaries)-1):
# create a boolean mask to select the data within the averaging window
window_indices = (fake_data_x<=window_boundaries[i+1]) & (fake_data_x>window_boundaries[i])
# separate the pieces of data in the window
fake_data_x_slice = fake_data_x[window_indices]
fake_data_y_slice = fake_data_y[window_indices]
# calculate the mean y_value in the window
y_mean = np.mean(fake_data_y_slice)
y_std = np.std(fake_data_y_slice)
# choose and select the outliers
y_outliers = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
x_outliers = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
# choose and select the good ones
y_goodies = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
x_goodies = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
# extend the lists with all the good and the bad
y_good.extend(list(y_goodies))
y_outlier.extend(list(y_outliers))
x_good.extend(list(x_goodies))
x_outlier.extend(list(x_outliers))
plt.figure()
plt.semilogy(x_good,y_good,'o')
plt.semilogy(x_outlier,y_outlier,'r*')
plt.show()

How to generate a fractal graph of a market in python

I wish to generate this in python:
http://classes.yale.edu/fractals/RandFrac/Market/TradingTime/Example1/Example1.html
but I'm incredibly stuck and new to this concept. Does anybody know of a library or gist for this?
Edit:
From what I can understand is that you need to split the fractal in 2 every time. So you have to calculate the y-axis point from the line between the two middle points. Then the two sections need to be formed according to the fractal?
Not 100% sure what you are asking, but as I understood from your comments, you want to generate a realistically looking stock market curve using the recursion described in the link.
As far as I understood the description in the linked page and some of the parent pages, it works like this:
You are given a start and an end point and a number of turning points in the form (t1, v1), (t2, v2), etc., for example start=(0,0), end=(1,1), turns = [(1/4, 1/2), (3/4, 1/4)], where ti and vi are fractions between 0 and 1.
You determine the actual turning points scaled to that interval between start and end and calculate the differences between those points, i.e. how far to go from pi to reach pi+1.
You shuffle those segments to introduce some randomness; when put together, they still cover exactly the same distance, i.e. they connect the original start and end point.
Repeat by recursively calling the function for the different segments between the new points.
Here's some Python code I just put together:
from __future__ import division
from random import shuffle
def make_graph(depth, graph, start, end, turns):
# add points to graph
graph.add(start)
graph.add(end)
if depth > 0:
# unpack input values
fromtime, fromvalue = start
totime, tovalue = end
# calcualte differences between points
diffs = []
last_time, last_val = fromtime, fromvalue
for t, v in turns:
new_time = fromtime + (totime - fromtime) * t
new_val = fromvalue + (tovalue - fromvalue) * v
diffs.append((new_time - last_time, new_val - last_val))
last_time, last_val = new_time, new_val
# add 'brownian motion' by reordering the segments
shuffle(diffs)
# calculate actual intermediate points and recurse
last = start
for segment in diffs:
p = last[0] + segment[0], last[1] + segment[1]
make_graph(depth - 1, graph, last, p, turns)
last = p
make_graph(depth - 1, graph, last, end, turns)
from matplotlib import pyplot
depth = 8
graph = set()
make_graph(depth, graph, (0, 0), (1, 1), [(1/9, 2/3), (5/9, 1/3)])
pyplot.plot(*zip(*sorted(graph)))
pyplot.show()
And here some example output:
I had a similar interest and developed a python3 library to do just what you want.
pip install fractalmarkets
See https://github.com/hyperstripe50/fractal-market-analysis/blob/master/README.md
Using #tobias_k solution and pandas, we can translate and scale the normalized fractal to a time-based one.
import arrow
import pandas as pd
import time
depth = 5
# the "geometry" of fractal
turns = [
(1 / 9, 0.60),
(5 / 9, 0.30),
(8 / 9, 0.70),
]
# select start / end time
t0 = arrow.now().floor("hours")
t1 = t0.shift(days=5)
start = (pd.to_datetime(t0._datetime), 1000)
end = (pd.to_datetime(t1._datetime), 2000)
# create a non-dimensionalized [0,0]x[1,1] Fractal
_start, _end = (0, 0), (1, 1)
graph = set()
make_graph(depth, graph, _start, _end, turns)
# just check graph length
assert len(graph) == (len(turns) + 1) ** depth + 1
# create a pandas dataframe from the normalized Fractal
df = pd.DataFrame(graph)
df.sort_values(0, inplace=True)
df.reset_index(drop=True, inplace=True)
# translate to real coordinates
X = pd.DataFrame(
data=[(start[0].timestamp(), start[1]), (end[0].timestamp(), end[1])]
).T
delta = X[1] - X[0]
Y = df.mul(delta) + X[0]
Y[0] = [*map(lambda x: pd.to_datetime(x, unit="s"), Y[0])]
# now resample and interpolate data according to *grid* size
grid ="min"
Z = Y.set_index(0)
A = Z.resample(grid).mean().interpolate()
# plot both graph to check errors
import matplotlib.pyplot as plt
ax = Z.plot()
A.plot(ax=ax)
plt.show()
showing both graphs:
and zooming to see interpolation and snap-to-grid differences:

Categories