Related
I currently have a plot like this (consider that data is the dataframe I pasted at the very bottom):
import seaborn as sns
sns.relplot(
data = data,
x = "Threshold",
y = "Value",
kind = "line",
hue="Metric"
).set(xlabel="Threshold")
Which produces:
Now, I want to know how can I annotate a line in this plot, such that it is located between the curves, at the x-Axis value where the distance between curves are maximized. I would also need to annotate text to show the distance value.
It should be something like this:
Here is the pandas dataframe:
Threshold,Metric,Value
0.0,Recall,1.0
0.010101010101010102,Recall,0.9802536231884058
0.020202020202020204,Recall,0.9706521739130435
0.030303030303030304,Recall,0.9621376811594203
0.04040404040404041,Recall,0.9541666666666667
0.05050505050505051,Recall,0.9456521739130435
0.06060606060606061,Recall,0.9322463768115942
0.07070707070707072,Recall,0.9173913043478261
0.08080808080808081,Recall,0.908695652173913
0.09090909090909091,Recall,0.8976449275362319
0.10101010101010102,Recall,0.8813405797101449
0.11111111111111112,Recall,0.8644927536231884
0.12121212121212122,Recall,0.8498188405797101
0.13131313131313133,Recall,0.8358695652173913
0.14141414141414144,Recall,0.818659420289855
0.15151515151515152,Recall,0.7967391304347826
0.16161616161616163,Recall,0.7748188405797102
0.17171717171717174,Recall,0.7521739130434782
0.18181818181818182,Recall,0.7269927536231884
0.19191919191919193,Recall,0.6952898550724638
0.20202020202020204,Recall,0.6704710144927536
0.21212121212121213,Recall,0.648731884057971
0.22222222222222224,Recall,0.6097826086956522
0.23232323232323235,Recall,0.5847826086956521
0.24242424242424243,Recall,0.5521739130434783
0.25252525252525254,Recall,0.5023550724637681
0.26262626262626265,Recall,0.4766304347826087
0.27272727272727276,Recall,0.42047101449275365
0.2828282828282829,Recall,0.3958333333333333
0.29292929292929293,Recall,0.3539855072463768
0.30303030303030304,Recall,0.3327898550724638
0.31313131313131315,Recall,0.3036231884057971
0.32323232323232326,Recall,0.2798913043478261
0.33333333333333337,Recall,0.2371376811594203
0.3434343434343435,Recall,0.22119565217391304
0.3535353535353536,Recall,0.17300724637681159
0.36363636363636365,Recall,0.15996376811594204
0.37373737373737376,Recall,0.13568840579710145
0.38383838383838387,Recall,0.11938405797101449
0.393939393939394,Recall,0.10652173913043478
0.4040404040404041,Recall,0.09891304347826087
0.4141414141414142,Recall,0.08894927536231884
0.42424242424242425,Recall,0.07681159420289856
0.43434343434343436,Recall,0.06557971014492754
0.4444444444444445,Recall,0.05253623188405797
0.4545454545454546,Recall,0.04655797101449275
0.4646464646464647,Recall,0.024456521739130436
0.4747474747474748,Recall,0.019384057971014494
0.48484848484848486,Recall,0.009782608695652175
0.494949494949495,Recall,0.0034420289855072463
0.5050505050505051,Recall,0.002173913043478261
0.5151515151515152,Recall,0.0016304347826086956
0.5252525252525253,Recall,0.0007246376811594203
0.5353535353535354,Recall,0.00018115942028985507
0.5454545454545455,Recall,0.0
0.5555555555555556,Recall,0.0
0.5656565656565657,Recall,0.0
0.5757575757575758,Recall,0.0
0.5858585858585859,Recall,0.0
0.595959595959596,Recall,0.0
0.6060606060606061,Recall,0.0
0.6161616161616162,Recall,0.0
0.6262626262626263,Recall,0.0
0.6363636363636365,Recall,0.0
0.6464646464646465,Recall,0.0
0.6565656565656566,Recall,0.0
0.6666666666666667,Recall,0.0
0.6767676767676768,Recall,0.0
0.686868686868687,Recall,0.0
0.696969696969697,Recall,0.0
0.7070707070707072,Recall,0.0
0.7171717171717172,Recall,0.0
0.7272727272727273,Recall,0.0
0.7373737373737375,Recall,0.0
0.7474747474747475,Recall,0.0
0.7575757575757577,Recall,0.0
0.7676767676767677,Recall,0.0
0.7777777777777778,Recall,0.0
0.787878787878788,Recall,0.0
0.797979797979798,Recall,0.0
0.8080808080808082,Recall,0.0
0.8181818181818182,Recall,0.0
0.8282828282828284,Recall,0.0
0.8383838383838385,Recall,0.0
0.8484848484848485,Recall,0.0
0.8585858585858587,Recall,0.0
0.8686868686868687,Recall,0.0
0.8787878787878789,Recall,0.0
0.888888888888889,Recall,0.0
0.8989898989898991,Recall,0.0
0.9090909090909092,Recall,0.0
0.9191919191919192,Recall,0.0
0.9292929292929294,Recall,0.0
0.9393939393939394,Recall,0.0
0.9494949494949496,Recall,0.0
0.9595959595959597,Recall,0.0
0.9696969696969697,Recall,0.0
0.9797979797979799,Recall,0.0
0.98989898989899,Recall,0.0
1.0,Recall,0.0
0.0,Fall-out,1.0
0.010101010101010102,Fall-out,0.6990465720990212
0.020202020202020204,Fall-out,0.58461408367334
0.030303030303030304,Fall-out,0.516647992727734
0.04040404040404041,Fall-out,0.4643680104855929
0.05050505050505051,Fall-out,0.4172674037587468
0.06060606060606061,Fall-out,0.3796376551170116
0.07070707070707072,Fall-out,0.3507811343889394
0.08080808080808081,Fall-out,0.33186055852694335
0.09090909090909091,Fall-out,0.3152231359533222
0.10101010101010102,Fall-out,0.29964272879098575
0.11111111111111112,Fall-out,0.2855844238208993
0.12121212121212122,Fall-out,0.27161068008371564
0.13131313131313133,Fall-out,0.25719298987379235
0.14141414141414144,Fall-out,0.24338836860241422
0.15151515151515152,Fall-out,0.2312538316808659
0.16161616161616163,Fall-out,0.22026087140350506
0.17171717171717174,Fall-out,0.2083377375642137
0.18181818181818182,Fall-out,0.19694311143056467
0.19191919191919193,Fall-out,0.18402638310466565
0.20202020202020204,Fall-out,0.17440754286197493
0.21212121212121213,Fall-out,0.16548633279073208
0.22222222222222224,Fall-out,0.15278100754709004
0.23232323232323235,Fall-out,0.14292962391391667
0.24242424242424243,Fall-out,0.1317252605542989
0.25252525252525254,Fall-out,0.11555292476164303
0.26262626262626265,Fall-out,0.10612434729298353
0.27272727272727276,Fall-out,0.08902183793839714
0.2828282828282829,Fall-out,0.08331395471745978
0.29292929292929293,Fall-out,0.07232099444009894
0.30303030303030304,Fall-out,0.06735302200706086
0.31313131313131315,Fall-out,0.061454876012092256
0.32323232323232326,Fall-out,0.05665602604485973
0.33333333333333337,Fall-out,0.048982094158932836
0.3434343434343435,Fall-out,0.045641925459273196
0.3535353535353536,Fall-out,0.03748176648415534
0.36363636363636365,Fall-out,0.0341415977844957
0.37373737373737376,Fall-out,0.029321607509037482
0.38383838383838387,Fall-out,0.026996173604211148
0.393939393939394,Fall-out,0.024353635075999407
0.4040404040404041,Fall-out,0.022514428260364035
0.4141414141414142,Fall-out,0.01940680295118703
0.42424242424242425,Fall-out,0.017165930279263473
0.43434343434343436,Fall-out,0.014459970826374648
0.4444444444444445,Fall-out,0.011035240893812233
0.4545454545454546,Fall-out,0.009386296852208105
0.4646464646464647,Fall-out,0.004756569350781135
0.4747474747474748,Fall-out,0.003868676405301989
0.48484848484848486,Fall-out,0.002135171130795087
0.494949494949495,Fall-out,0.0008033317125763693
0.5050505050505051,Fall-out,0.0004228061645138786
0.5151515151515152,Fall-out,0.00031710462338540896
0.5252525252525253,Fall-out,4.228061645138786e-05
0.5353535353535354,Fall-out,0.0
0.5454545454545455,Fall-out,0.0
0.5555555555555556,Fall-out,0.0
0.5656565656565657,Fall-out,0.0
0.5757575757575758,Fall-out,0.0
0.5858585858585859,Fall-out,0.0
0.595959595959596,Fall-out,0.0
0.6060606060606061,Fall-out,0.0
0.6161616161616162,Fall-out,0.0
0.6262626262626263,Fall-out,0.0
0.6363636363636365,Fall-out,0.0
0.6464646464646465,Fall-out,0.0
0.6565656565656566,Fall-out,0.0
0.6666666666666667,Fall-out,0.0
0.6767676767676768,Fall-out,0.0
0.686868686868687,Fall-out,0.0
0.696969696969697,Fall-out,0.0
0.7070707070707072,Fall-out,0.0
0.7171717171717172,Fall-out,0.0
0.7272727272727273,Fall-out,0.0
0.7373737373737375,Fall-out,0.0
0.7474747474747475,Fall-out,0.0
0.7575757575757577,Fall-out,0.0
0.7676767676767677,Fall-out,0.0
0.7777777777777778,Fall-out,0.0
0.787878787878788,Fall-out,0.0
0.797979797979798,Fall-out,0.0
0.8080808080808082,Fall-out,0.0
0.8181818181818182,Fall-out,0.0
0.8282828282828284,Fall-out,0.0
0.8383838383838385,Fall-out,0.0
0.8484848484848485,Fall-out,0.0
0.8585858585858587,Fall-out,0.0
0.8686868686868687,Fall-out,0.0
0.8787878787878789,Fall-out,0.0
0.888888888888889,Fall-out,0.0
0.8989898989898991,Fall-out,0.0
0.9090909090909092,Fall-out,0.0
0.9191919191919192,Fall-out,0.0
0.9292929292929294,Fall-out,0.0
0.9393939393939394,Fall-out,0.0
0.9494949494949496,Fall-out,0.0
0.9595959595959597,Fall-out,0.0
0.9696969696969697,Fall-out,0.0
0.9797979797979799,Fall-out,0.0
0.98989898989899,Fall-out,0.0
1.0,Fall-out,0.0
Use pivot to transform the data from long to wide
Use idxmax to find the x (Threshold) of the max difference between y1 and y2 (Fall-out and Recall)
Use vlines to plot the vertical line at x from y1 to y2
Use annotate to plot the label at the midpoint of y1 and y2
g = sns.relplot(data=data, x='Threshold', y='Value', hue='Metric', kind='line')
# pivot to wide form
p = data.pivot(index='Threshold', columns='Metric', values='Value')
# find x, y1, and y2 corresponding to max difference
diff = p['Fall-out'].sub(p['Recall']).abs()
x = diff.idxmax()
y1, y2 = p.loc[x]
# plot line and label
ax = g.axes.flat[0]
ax.vlines(x, y1, y2, ls='--')
ax.annotate(f'Dist = {diff.loc[x]:.2f}', ha='left', va='center',
xy=(x, 0.5*(y1+y2)), xycoords='data',
xytext=(5, 0), textcoords='offset pixels')
The easiest way which I can think of is to create two separate lists of all values where the metric is Recall and another with all values where metric is Fall-out. This can be easily done using pandas operations as follows (Assuming the dataframe has name df) -
import math
import matplotlib.pyplot as plt
ls_metric = df['Metric'].to_list()
ls_value = df['Value'].to_list()
ls_threshold = df['Threshold'].to_list()
ls_value_recall = []
ls_value_fallout = []
ls_threshold_recall = []
ls_threshold_fallout = []
for i, j, k in zip(ls_metric, ls_value, ls_threshold):
if (i == 'Recall'):
ls_value_recall.append(j)
ls_threshold_recall.append(k)
elif(i == 'Fall-out'):
ls_value_fallout.append(j)
ls_threshold_recall.append(k)
ls_dist = []
for i, j in zip(ls_value_recall, ls_value_fallout):
ls_dist.append(math.abs(i-j))
max_diff = max(ls_dist)
location_of_max_diff = ls_dist.index(max_diff)
value_of_threshold_at_max_diff = ls_threshold_recall[location_of_max_diff]
value_of_recall_at_max_diff = ls_value_recall[location_of_max_diff]
value_of_fallout_at_max_diff = ls_value_fallout[location_of_max_diff]
x_values = [value_of_threshold_at_max_diff, value_of_threshold_at_max_diff]
y_values = [value_of_recall_at_max_diff, value_of_fallout_at_max_diff]
plt.plot(x_values, y_values)
Certain Assumptions - The Threshold Values are the same and same number of readings are present for both metrics which I think is true having had a brief glance at the data but if not I believe it's still pretty easy to modify the code
You can add this plot to your own figure for which the syntax is readily available, now as far as the label for the line is concerned one way to do this is use matplotlib.pyplot.text to add a textbox but with that you'll need to tweak with the location to get the desired location another way to do this would be to add it as a legend only
I have a data frame like the below:
Every row represents a person. They stay at 3 different locations for some time given on the dataframe. The first few people don't stay at location1 but they "born" at location2. The rest of them stay at every locations (3 locations).
I would like to animate every person at the given X, Y coordinates given on the data frame and represent them as dots or any other shape. Here is the flow:
Every person should appear at the first given location (location1) at the given time. Their color should be blue at this state.
Stay at location1 until location2_time and then appear at location2. Their color should be red at this state.
Stay at location2 until location3_time and then appear at location3. Their color should be red at this state.
Stay at location3 for 3 seconds and disappear forever.
There can be several people on the visual at the same time. How can I do that?
There are some good answers on the below links. However, on these solutions, points don't disappear.
How can i make points of a python plot appear over time?
How to animate a scatter plot?
The following is an implementation with python-ffmpeg, pandas, matplotlib, and seaborn. You can find output video on my YouTube channel (link is unlisted).
Each frame with figures is saved directly to memory. New figures are generated only when the state of the population changes (person appears/moves/disappears).
You should definetely separate this code into smaller chunks if you are using this in a Python package:
from numpy.random import RandomState, SeedSequence
from numpy.random import MT19937
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import ffmpeg
RESOLUTION = (12.8, 7.2) # * 100 pixels
NUMBER_OF_FRAMES = 900
class VideoWriter:
# Courtesy of https://github.com/kylemcdonald/python-utils/blob/master/ffmpeg.py
def __init__(
self,
filename,
video_codec="libx265",
fps=15,
in_pix_fmt="rgb24",
out_pix_fmt="yuv420p",
input_args=None,
output_args=None,
):
self.filename = filename
self.process = None
self.input_args = {} if input_args is None else input_args
self.output_args = {} if output_args is None else output_args
self.input_args["r"] = self.input_args["framerate"] = fps
self.input_args["pix_fmt"] = in_pix_fmt
self.output_args["pix_fmt"] = out_pix_fmt
self.output_args["vcodec"] = video_codec
def add(self, frame):
if self.process is None:
height, width = frame.shape[:2]
self.process = (
ffmpeg.input(
"pipe:",
format="rawvideo",
s="{}x{}".format(width, height),
**self.input_args,
)
.filter("crop", "iw-mod(iw,2)", "ih-mod(ih,2)")
.output(self.filename, **self.output_args)
.global_args("-loglevel", "quiet")
.overwrite_output()
.run_async(pipe_stdin=True)
)
conv = frame.astype(np.uint8).tobytes()
self.process.stdin.write(conv)
def close(self):
if self.process is None:
return
self.process.stdin.close()
self.process.wait()
def figure_to_array(figure):
"""adapted from: https://stackoverflow.com/questions/21939658/"""
figure.canvas.draw()
buf = figure.canvas.tostring_rgb()
n_cols, n_rows = figure.canvas.get_width_height()
return np.frombuffer(buf, dtype=np.uint8).reshape(n_rows, n_cols, 3)
# Generate data for the figure
rs1 = RandomState(MT19937(SeedSequence(123456789)))
time_1 = np.round(rs1.rand(232) * NUMBER_OF_FRAMES).astype(np.int16)
time_2 = time_1 + np.round(rs1.rand(232) * (NUMBER_OF_FRAMES - time_1)).astype(np.int16)
time_3 = time_2 + np.round(rs1.rand(232) * (NUMBER_OF_FRAMES - time_2)).astype(np.int16)
loc_1_x, loc_1_y, loc_2_x, loc_2_y, loc_3_x, loc_3_y = np.round(rs1.rand(6, 232) * 100, 1)
df = pd.DataFrame({
"loc_1_time": time_1,
"loc_1_x": loc_1_x,
"loc_1_y": loc_1_y,
"loc_2_time": time_2,
"loc_2_x": loc_2_x,
"loc_2_y": loc_2_y,
"loc_3_time": time_3,
"loc_3_x": loc_3_x,
"loc_3_y": loc_3_y,
})
"""The stack answer starts here"""
# Add extra column for disappear time
df["disappear_time"] = df["loc_3_time"] + 3
all_times = df[["loc_1_time", "loc_2_time", "loc_3_time", "disappear_time"]]
change_times = np.unique(all_times)
# Prepare ticks for plotting the figure across frames
x_values = df[["loc_1_x", "loc_2_x", "loc_3_x"]].values.flatten()
x_ticks = np.array(np.linspace(x_values.min(), x_values.max(), 6), dtype=np.uint8)
y_values = df[["loc_1_y", "loc_2_y", "loc_3_y"]].values.flatten()
y_ticks = np.array(np.round(np.linspace(y_values.min(), y_values.max(), 6)), dtype=np.uint8)
sns.set_theme(style="whitegrid")
video_writer = VideoWriter("endermen.mp4")
if 0 not in change_times:
# Generate empty figure if no person arrive at t=0
fig, ax = plt.subplots(figsize=RESOLUTION)
ax.set_xticklabels(x_ticks)
ax.set_yticklabels(y_ticks)
ax.set_title("People movement. T=0")
video_writer.add(figure_to_array(fig))
loop_range = range(1, NUMBER_OF_FRAMES)
else:
loop_range = range(NUMBER_OF_FRAMES)
palette = sns.color_palette("tab10") # Returns three colors from the palette (we have three groups)
animation_data_df = pd.DataFrame(columns=["x", "y", "location", "index"])
for frame_idx in loop_range:
if frame_idx in change_times:
plt.close("all")
# Get person who appears/moves/disappears
indexes, loc_nums = np.where(all_times == frame_idx)
loc_nums += 1
for i, loc in zip(indexes, loc_nums):
if loc != 4:
x, y = df[[f"loc_{loc}_x", f"loc_{loc}_y"]].iloc[i]
if loc == 1: # location_1
animation_data_df = animation_data_df.append(
{"x": x, "y": y, "location": loc, "index": i},
ignore_index=True
)
else:
data_index = np.where(animation_data_df["index"] == i)[0][0]
if loc in (2, 3): # location_2 or 3
animation_data_df.loc[[data_index], :] = x, y, loc, i
elif loc == 4: # Disappear
animation_data_df.iloc[data_index] = np.nan
current_palette_size = np.sum(~np.isnan(np.unique(animation_data_df["location"])))
fig, ax = plt.subplots(figsize=RESOLUTION)
sns.scatterplot(
x="x", y="y", hue="location", data=animation_data_df, ax=ax, palette=palette[:current_palette_size]
)
ax.set_xticks(x_ticks)
ax.set_xticklabels(x_ticks)
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_ticks)
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.set_title(f"People movement. T={frame_idx}")
video_writer.add(figure_to_array(fig))
video_writer.close()
Edit: There was a bug in which location_3 wasn't removed after 3 seconds. Fixed now.
Modifying the code from this question to only include the positions you want automatically removes the old ones if the old position isn't included in the new ones. This doesn't change if you want to animate by time or iterations or anything else. I have opted to use iterations here since it's easier and I don't know how you are handling your dataset. The code does have one bug though, the last point (or points if they last the same amount of time) remaining won't disappear, this can be solved easily if you don't want to draw anything again, if you do though for exaple in case you there is a gap in the data with no people and then the data resumes I haven't found any workarounds
import math
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
#The t0,t1,t2,t3 are the times (in iterations) that the position changes
#If t0 is None then the person will never be displayed
people = [
# t0 x1 y1 t1 x2 y2 t2 x3 y3 t4
[ 0, 1, 0.1, 1, 2, 0.2, 2, 3, 0.3, 3],
[ 2, None, None, None, 2, 1, 3, 4, 1, 7],
[ 2, float("NaN"), float("NaN"), float("NaN"), 2, 0.8, 4, 4, 0.8, 10],
]
fig = plt.figure()
plt.xlim(0, 5)
plt.ylim(0, 1)
graph = plt.scatter([], [])
def animate(i):
points = []
colors = []
for person in people:
if person[0] is None or math.isnan(person[0]) or i < person[0]:
continue
# Position 1
elif person[3] is not None and not (math.isnan(person[3])) and i <= person[3]:
new_point = [person[1], person[2]]
color = "b"
# Position 2
elif person[6] is not None and not (math.isnan(person[6])) and i <= person[6]:
new_point = [person[4], person[5]]
color = "r"
# Position 3
elif person[9] is not None and not (math.isnan(person[9])) and i <= person[9]:
new_point = [person[7], person[8]]
color = "r"
else:
people.remove(person)
new_point = []
if new_point != []:
points.append(new_point)
colors.append(color)
if points != []:
graph.set_offsets(points)
graph.set_facecolors(colors)
else:
# You can use graph.remove() to fix the last point not disappiring but you won't be able to plot anything after that
# graph.remove()
pass
return graph
ani = FuncAnimation(fig, animate, repeat=False, interval=500)
plt.show()
From what I can see, boxplot() method expects a sequence of raw values (numbers) as input, from which it then computes percentiles to draw the boxplot(s).
I would like to have a method by which I could pass in the percentiles and get the corresponding boxplot.
For example:
Assume that I have run several benchmarks and for each benchmark I've measured latencies ( floating point values ). Now additionally, I have precomputed the percentiles for these values.
Hence for each benchmark, I have the 25th, 50th, 75th percentile along with the min and max.
Now given these data, I would like to draw the box plots for the benchmarks.
As of 2020, there is a better method than the one in the accepted answer.
The matplotlib.axes.Axes class provides a bxp method, which can be used to draw the boxes and whiskers based on the percentile values. Raw data is only needed for the outliers, and that is optional.
Example:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
boxes = [
{
'label' : "Male height",
'whislo': 162.6, # Bottom whisker position
'q1' : 170.2, # First quartile (25th percentile)
'med' : 175.7, # Median (50th percentile)
'q3' : 180.4, # Third quartile (75th percentile)
'whishi': 187.8, # Top whisker position
'fliers': [] # Outliers
}
]
ax.bxp(boxes, showfliers=False)
ax.set_ylabel("cm")
plt.savefig("boxplot.png")
plt.close()
This produces the following image:
To draw the box plot using just the percentile values and the outliers ( if any ) I made a customized_box_plot function that basically modifies attributes in a basic box plot ( generated from a tiny sample data ) to make it fit according to your percentile values.
The customized_box_plot function
def customized_box_plot(percentiles, axes, redraw = True, *args, **kwargs):
"""
Generates a customized boxplot based on the given percentile values
"""
box_plot = axes.boxplot([[-9, -4, 2, 4, 9],]*n_box, *args, **kwargs)
# Creates len(percentiles) no of box plots
min_y, max_y = float('inf'), -float('inf')
for box_no, (q1_start,
q2_start,
q3_start,
q4_start,
q4_end,
fliers_xy) in enumerate(percentiles):
# Lower cap
box_plot['caps'][2*box_no].set_ydata([q1_start, q1_start])
# xdata is determined by the width of the box plot
# Lower whiskers
box_plot['whiskers'][2*box_no].set_ydata([q1_start, q2_start])
# Higher cap
box_plot['caps'][2*box_no + 1].set_ydata([q4_end, q4_end])
# Higher whiskers
box_plot['whiskers'][2*box_no + 1].set_ydata([q4_start, q4_end])
# Box
box_plot['boxes'][box_no].set_ydata([q2_start,
q2_start,
q4_start,
q4_start,
q2_start])
# Median
box_plot['medians'][box_no].set_ydata([q3_start, q3_start])
# Outliers
if fliers_xy is not None and len(fliers_xy[0]) != 0:
# If outliers exist
box_plot['fliers'][box_no].set(xdata = fliers_xy[0],
ydata = fliers_xy[1])
min_y = min(q1_start, min_y, fliers_xy[1].min())
max_y = max(q4_end, max_y, fliers_xy[1].max())
else:
min_y = min(q1_start, min_y)
max_y = max(q4_end, max_y)
# The y axis is rescaled to fit the new box plot completely with 10%
# of the maximum value at both ends
axes.set_ylim([min_y*1.1, max_y*1.1])
# If redraw is set to true, the canvas is updated.
if redraw:
ax.figure.canvas.draw()
return box_plot
USAGE
Using inverse logic ( code at the very end ) I extracted the percentile values from this example
>>> percentiles
(-1.0597368367634488, 0.3977683984966961, 1.0298955252405229, 1.6693981537742526, 3.4951447843464449)
(-0.90494930553559483, 0.36916539612108634, 1.0303658700697103, 1.6874542731392828, 3.4951447843464449)
(0.13744105279440233, 1.3300645202649739, 2.6131540656339483, 4.8763411136047647, 9.5751914834437937)
(0.22786243898199182, 1.4120860286080519, 2.637650402506837, 4.9067126578493259, 9.4660357513550899)
(0.0064696168078617741, 0.30586770128093388, 0.70774153557312702, 1.5241965711101928, 3.3092932063051976)
(0.007009744579241136, 0.28627373934008982, 0.66039691869500572, 1.4772725266672091, 3.221716765477217)
(-2.2621660374110544, 5.1901313713883352, 7.7178532139979357, 11.277744848353247, 20.155971739152388)
(-2.2621660374110544, 5.1884411864079532, 7.3357079047721054, 10.792299385806913, 18.842012119715388)
(2.5417888074435702, 5.885996170695587, 7.7271286220368598, 8.9207423361593179, 10.846938621419374)
(2.5971767318505856, 5.753551925927133, 7.6569980004033464, 8.8161056254143233, 10.846938621419374)
Note that to keep this short I haven't shown the outliers vectors which will be the 6th element of each of the percentile array.
Also note that all usual additional kwargs / args can be used since they are simply passed to the boxplot method inside it :
>>> fig, ax = plt.subplots()
>>> b = customized_box_plot(percentiles, ax, redraw=True, notch=0, sym='+', vert=1, whis=1.5)
>>> plt.show()
EXPLANATION
The boxplot method returns a dictionary mapping the components of the boxplot to the individual matplotlib.lines.Line2D instances that were created.
Quoting from the matplotlib.pyplot.boxplot documentation :
That dictionary has the following keys (assuming vertical boxplots):
boxes: the main body of the boxplot showing the quartiles and the median’s confidence intervals if enabled.
medians: horizonal lines at the median of each box.
whiskers: the vertical lines extending to the most extreme, n-outlier data points. caps: the horizontal lines at the ends of the whiskers.
fliers: points representing data that extend beyond the whiskers (outliers).
means: points or lines representing the means.
For example observe the boxplot of a tiny sample data of [-9, -4, 2, 4, 9]
>>> b = ax.boxplot([[-9, -4, 2, 4, 9],])
>>> b
{'boxes': [<matplotlib.lines.Line2D at 0x7fe1f5b21350>],
'caps': [<matplotlib.lines.Line2D at 0x7fe1f54d4e50>,
<matplotlib.lines.Line2D at 0x7fe1f54d0e50>],
'fliers': [<matplotlib.lines.Line2D at 0x7fe1f5b317d0>],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x7fe1f63549d0>],
'whiskers': [<matplotlib.lines.Line2D at 0x7fe1f5b22e10>,
<matplotlib.lines.Line2D at 0x7fe20c54a510>]}
>>> plt.show()
The matplotlib.lines.Line2D objects have two methods that I'll be using in my function extensively. set_xdata ( or set_ydata ) and get_xdata ( or get_ydata ).
Using these methods we can alter the position of the constituent lines of the base box plot to conform to your percentile values ( which is what the customized_box_plot function does ). After altering the constituent lines' position, you can redraw the canvas using figure.canvas.draw()
Summarizing the mappings from percentile to the coordinates of the various Line2D objects.
The Y Coordinates :
The max ( q4_end - end of 4th quartile ) corresponds to the top most cap Line2D object.
The min ( q1_start - start of the 1st quartile ) corresponds to the lowermost most cap Line2D object.
The median corresponds to the ( q3_start ) median Line2D object.
The 2 whiskers lie between the ends of the boxes and extreme caps ( q1_start and q2_start - lower whisker; q4_start and q4_end - upper whisker )
The box is actually an interesting n shaped line bounded by a cap at the lower portion. The extremes of the n shaped line correspond to the q2_start and the q4_start.
The X Coordinates :
The Central x coordinates ( for multiple box plots are usually 1, 2, 3... )
The library automatically calculates the bounding x coordinates based on the width specified.
INVERSE FUNCTION TO RETRIEVE THE PERCENTILES FROM THE boxplot DICT:
def get_percentiles_from_box_plots(bp):
percentiles = []
for i in range(len(bp['boxes'])):
percentiles.append((bp['caps'][2*i].get_ydata()[0],
bp['boxes'][i].get_ydata()[0],
bp['medians'][i].get_ydata()[0],
bp['boxes'][i].get_ydata()[2],
bp['caps'][2*i + 1].get_ydata()[0],
(bp['fliers'][i].get_xdata(),
bp['fliers'][i].get_ydata())))
return percentiles
NOTE:
The reason why I did not make a completely custom boxplot method is because, there are many features offered by the inbuilt box plot that cannot be fully reproduced.
Also excuse me if I may have unnecessarily explained something that may have been too obvious.
Here is an updated version of this useful routine. Setting the vertices directly appears to work for both filled boxes (patchArtist=True) and unfilled ones.
def customized_box_plot(percentiles, axes, redraw = True, *args, **kwargs):
"""
Generates a customized boxplot based on the given percentile values
"""
n_box = len(percentiles)
box_plot = axes.boxplot([[-9, -4, 2, 4, 9],]*n_box, *args, **kwargs)
# Creates len(percentiles) no of box plots
min_y, max_y = float('inf'), -float('inf')
for box_no, pdata in enumerate(percentiles):
if len(pdata) == 6:
(q1_start, q2_start, q3_start, q4_start, q4_end, fliers_xy) = pdata
elif len(pdata) == 5:
(q1_start, q2_start, q3_start, q4_start, q4_end) = pdata
fliers_xy = None
else:
raise ValueError("Percentile arrays for customized_box_plot must have either 5 or 6 values")
# Lower cap
box_plot['caps'][2*box_no].set_ydata([q1_start, q1_start])
# xdata is determined by the width of the box plot
# Lower whiskers
box_plot['whiskers'][2*box_no].set_ydata([q1_start, q2_start])
# Higher cap
box_plot['caps'][2*box_no + 1].set_ydata([q4_end, q4_end])
# Higher whiskers
box_plot['whiskers'][2*box_no + 1].set_ydata([q4_start, q4_end])
# Box
path = box_plot['boxes'][box_no].get_path()
path.vertices[0][1] = q2_start
path.vertices[1][1] = q2_start
path.vertices[2][1] = q4_start
path.vertices[3][1] = q4_start
path.vertices[4][1] = q2_start
# Median
box_plot['medians'][box_no].set_ydata([q3_start, q3_start])
# Outliers
if fliers_xy is not None and len(fliers_xy[0]) != 0:
# If outliers exist
box_plot['fliers'][box_no].set(xdata = fliers_xy[0],
ydata = fliers_xy[1])
min_y = min(q1_start, min_y, fliers_xy[1].min())
max_y = max(q4_end, max_y, fliers_xy[1].max())
else:
min_y = min(q1_start, min_y)
max_y = max(q4_end, max_y)
# The y axis is rescaled to fit the new box plot completely with 10%
# of the maximum value at both ends
axes.set_ylim([min_y*1.1, max_y*1.1])
# If redraw is set to true, the canvas is updated.
if redraw:
ax.figure.canvas.draw()
return box_plot
Here is a bottom-up approach where the box_plot is build up using matplotlib's vline, Rectangle, and normal plot functions
def boxplot(df, ax=None, box_width=0.2, whisker_size=20, mean_size=10, median_size = 10 , line_width=1.5, xoffset=0,
color=0):
"""Plots a boxplot from existing percentiles.
Parameters
----------
df: pandas DataFrame
ax: pandas AxesSubplot
if to plot on en existing axes
box_width: float
whisker_size: float
size of the bar at the end of each whisker
mean_size: float
size of the mean symbol
color: int or rgb(list)
If int particular color of property cycler is taken. Example of rgb: [1,0,0] (red)
Returns
-------
f, a, boxes, vlines, whisker_tips, mean, median
"""
if type(color) == int:
color = plt.rcParams['axes.prop_cycle'].by_key()['color'][color]
if ax:
a = ax
f = a.get_figure()
else:
f, a = plt.subplots()
boxes = []
vlines = []
xn = []
for row in df.iterrows():
x = row[0] + xoffset
xn.append(x)
# box
y = row[1][25]
height = row[1][75] - row[1][25]
box = plt.Rectangle((x - box_width / 2, y), box_width, height)
a.add_patch(box)
boxes.append(box)
# whiskers
y = (row[1][95] + row[1][5]) / 2
vl = a.vlines(x, row[1][5], row[1][95])
vlines.append(vl)
for b in boxes:
b.set_linewidth(line_width)
b.set_facecolor([1, 1, 1, 1])
b.set_edgecolor(color)
b.set_zorder(2)
for vl in vlines:
vl.set_color(color)
vl.set_linewidth(line_width)
vl.set_zorder(1)
whisker_tips = []
if whisker_size:
g, = a.plot(xn, df[5], ls='')
whisker_tips.append(g)
g, = a.plot(xn, df[95], ls='')
whisker_tips.append(g)
for wt in whisker_tips:
wt.set_markeredgewidth(line_width)
wt.set_color(color)
wt.set_markersize(whisker_size)
wt.set_marker('_')
mean = None
if mean_size:
g, = a.plot(xn, df['mean'], ls='')
g.set_marker('o')
g.set_markersize(mean_size)
g.set_zorder(20)
g.set_markerfacecolor('None')
g.set_markeredgewidth(line_width)
g.set_markeredgecolor(color)
mean = g
median = None
if median_size:
g, = a.plot(xn, df['median'], ls='')
g.set_marker('_')
g.set_markersize(median_size)
g.set_zorder(20)
g.set_markeredgewidth(line_width)
g.set_markeredgecolor(color)
median = g
a.set_ylim(np.nanmin(df), np.nanmax(df))
return f, a, boxes, vlines, whisker_tips, mean, median
This is how it looks in action:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
nopts = 12
df = pd.DataFrame()
df['mean'] = np.random.random(nopts) + 7
df['median'] = np.random.random(nopts) + 7
df[5] = np.random.random(nopts) + 4
df[25] = np.random.random(nopts) + 6
df[75] = np.random.random(nopts) + 8
df[95] = np.random.random(nopts) + 10
out = boxplot(df)
I have a netcdf file with float values representing chlorophyll concentration at latitudes and longitudes. I am trying to draw a line between two sets of lats/lons and return all chlorophyll values from points on the line.
I'm approaching it from a geometry point of view: for points (x1, y1) and (x2, y2), find the slope and intercept of the line and return all values of x for given values of y on the line. Once I have all x and y values (longitude and latitude) I hope to input those into the xarray select method to return the chlorophyll concentration.
ds = '~/apr1.nc'
ds = xarray.open_dataset(ds, decode_times=False)
x1, y1 = [34.3282, 32.4791]
x2, y2 = [34.7, 32.21]
slope = (y2 - y1) / (x2 - x1)
intercept = y1 - (slope * x1)
line_lons = np.arange(x1, x2, step)
line_lats = [slope * x + intercept for x in lons]
values = ds.CHL.sel(lat=line_lats, lon=line_lons, method='nearest')
ds.values
>>> [0.0908799 , 0.06634101, 0.07615771, 0.16289435],
[0.06787204, 0.07480557, 0.0655338 , 0.06064864],
[0.06352911, 0.06586582, 0.06702182, 0.10024723],
[0.0789495 , 0.07035938, 0.07455409, 0.08405576]]], dtype=float32)
line_lons
>>> array([34.3282, 34.4282, 34.5282, 34.6282])
I want to create a plot with longitudes on the x axis, and values on the y axis. The problem is that the ds.values command returns an numpy data array with a shape of (1, 4, 4) while the longitudes are only 4. There are way more values in the returned array.
plt.plot(line_lons, chlvalues.values)
Any idea why that is and how I can return one value for one input?
Thanks.
I assume it is because by default your output was taken from box instead of along a selected transect.
I propose a more complex solution with Numpy and netCDF4, where you first make the transect with random coordinates and then turn these random coordinates into the closest unique coordinates from input file (unique = so that each point along the transect is encounted only once).
Afterwards, when you know your output coordinates, you have 2 possibilities how to take out data along transect:
a) you find the indices of the corresponding coordinates
b) interpolate original data to those coordinates (either nearest or bi-linear method)
Here is the code:
#!/usr/bin/env ipython
# --------------------------------------------------------------------------------------------------------------
import numpy as np
from netCDF4 import Dataset
# -----------------------------
# coordinates:
x1, y1 = [10., 55.]
x2, y2 = [20., 58.]
# --------------------------------
# ==============================================================================================================
# create some test data:
nx,ny = 100,100
dataout = np.random.random((ny,nx));
# -------------------------------
lonout=np.linspace(9.,30.,nx);
latout=np.linspace(54.,66.,ny);
# make data:
ncout=Dataset('test.nc','w','NETCDF3_CLASSIC');
ncout.createDimension('lon',nx);
ncout.createDimension('lat',ny);
ncout.createDimension('time',None);
ncout.createVariable('lon','float64',('lon'));ncout.variables['lon'][:]=lonout;
ncout.createVariable('lat','float64',('lat'));ncout.variables['lat'][:]=latout;
ncout.createVariable('var','float32',('lat','lon'));ncout.variables['var'][:]=dataout;
ncout.close()
#=================================================================================================================
# CUT THE DATA FROM FILE:
# make some arbitrary line between start-end point, later let us convert it to indices:
coords=np.linspace(x1+1j*y1,x2+1j*y2,1000);
xo=np.real(coords);yo=np.imag(coords);
# ------------------------------------------------------
# get transect:
ncin = Dataset('test.nc');
lonin=ncin.variables['lon'][:];
latin=ncin.variables['lat'][:];
# ------------------------------------------------------
# get the transect indices:
rxo=np.array([np.squeeze(np.min(lonout[np.where(np.abs(lonout-val)==np.abs(lonout-val).min())])) for val in xo]);
ryo=np.array([np.squeeze(np.min(latout[np.where(np.abs(latout-val)==np.abs(latout-val).min())])) for val in yo]);
rcoords=np.unique(rxo+1j*ryo);
rxo=np.real(rcoords);ryo=np.imag(rcoords);
# ------------------------------------------------------
ixo=[int(np.squeeze(np.where(lonin==val))) for val in rxo];
jxo=[int(np.squeeze(np.where(latin==val))) for val in ryo];
# ------------------------------------------------------
# get var data along transect:
trans_data=np.array([ncin.variables['var'][jxo[ii],ixo[ii]] for ii in range(len(ixo))]);
# ------------------------------------------------------
ncin.close()
# ================================================================================================================
# Another solution using interpolation, when we already know the target coordinates (original coordinates along the transect):
from scipy.interpolate import griddata
ncin = Dataset('test.nc');
lonin=ncin.variables['lon'][:];
latin=ncin.variables['lat'][:];
varin=ncin.variables['var'][:];
ncin.close()
# ----------------------------------------------------------------------------------------------------------------
lonm,latm = np.meshgrid(lonin,latin);
trans_data_b=griddata((lonm.flatten(),latm.flatten()),varin.flatten(),(rxo,ryo),'nearest')
I'm trying to make a movie by taking png images of an updating plot and stitching them together. There are three variables: degrees, ksB, and mp. Only mp changes each frame; the other two are constant. The data for mp for all times is stored in X. This is the relevant part of the code:
def plot(fname, haveMLPY=False):
# Load data from .npz file.
data = np.load(fname)
X = data["X"]
T = data["T"]
N = X.shape[1]
A = data["vipWeights"]
degrees = A.sum(1)
ksB = data["ksB"]
# Initialize a figure.
figure = plt.figure()
# Generate a plottable axis as the first subplot in 1 rows and 1 columns.
axis = figure.add_subplot(1,1,1)
# MP is the first (0th) variable. Plot one trajectory for each cell over time.
axis.plot(T, X[:,:,0], color="black")
# Decorate the plot.
axis.set_xlabel("time [hours]")
axis.set_ylabel("MP [nM]")
axis.set_title("PER mRNA concentration across all %d cells" % N)
firstInd = int(T.size / 2)
if haveMLPY:
import circadian.analysis
# Generate a and plot Signal object, which encapsulates wavelet analysis.
signal = circadian.analysis.Signal(X[firstInd:, 0, 0], T[firstInd:])
signal.showSpectrum(show=False)
files=[]
# filename for the name of the resulting movie
filename = 'animation'
mp = X[10**4-1,:,0]
from mpl_toolkits.mplot3d import Axes3D
for i in range(10**4):
print i
mp = X[i,:,0]
data2 = np.c_[degrees, ksB, mp]
# Find best fit surface for data2
# regular grid covering the domain of the data
mn = np.min(data2, axis=0)
mx = np.max(data2, axis=0)
X,Y = np.meshgrid(np.linspace(mn[0], mx[0], 20), np.linspace(mn[1], mx[1], 20))
XX = X.flatten()
YY = Y.flatten()
order = 2 # 1: linear, 2: quadratic
if order == 1:
# best-fit linear plane
A = np.c_[data2[:,0], data2[:,1], np.ones(data2.shape[0])]
C,_,_,_ = scipy.linalg.lstsq(A, data2[:,2]) # coefficients
# evaluate it on grid
Z = C[0]*X + C[1]*Y + C[2]
# or expressed using matrix/vector product
#Z = np.dot(np.c_[XX, YY, np.ones(XX.shape)], C).reshape(X.shape)
elif order == 2:
# best-fit quadratic curve
A = np.c_[np.ones(data2.shape[0]), data2[:,:2], np.prod(data2[:,:2], axis=1), data2[:,:2]**2]
C,_,_,_ = scipy.linalg.lstsq(A, data2[:,2])
# evaluate it on a grid
Z = np.dot(np.c_[np.ones(XX.shape), XX, YY, XX*YY, XX**2, YY**2], C).reshape(X.shape)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, alpha=0.2)
ax.scatter(degrees, ksB, mp)
ax.set_xlabel('degrees')
ax.set_ylabel('ksB')
ax.set_zlabel('mp')
# form a filename
fname2 = '_tmp%03d.png'%i
# save the frame
savefig(fname2)
# append the filename to the list
files.append(fname2)
# call mencoder
os.system("mencoder 'mf://_tmp*.png' -mf type=png:fps=10 -ovc lavc -lavcopts vcodec=wmv2 -oac copy -o " + filename + ".mpg")
# cleanup
for fname2 in files: os.remove(fname2)
Basically, all the data is stored in X. The format X[i, i, i] means X[time, neuron, data type]. Each time through the loop, I want to update the time, but still plot mp (the 0th variable) for all the neurons.
When I run this code, I get "IndexError: too many indices for array". I asked it to print i to see when the code was going wrong. I get an error when i = 1, meaning that the code loops through once but then has the error the second time.
However, I have data for 10^4 time steps. You can see in the first line of the provided code, I access X[10**4-1, :, 0] successfully. That's why it's confusing to me why X[1,:,0] would be out of range. If anybody could explain why/help me get around this, that would be great.
The traceback error is
Traceback (most recent call last):
File"/Users/angadanand/Documents/LiClipseWorkspace/Circadian/scripts /runMeNets.py", line 196, in module
plot(fname)
File"/Users/angadanand/Documents/LiClipseWorkspace/Circadian/scripts /runMeNets.py", line 142, in plot
mp = X[i,:,0]
IndexError: too many indices for array
Thanks!
Your problem is that you overwrite your X inside your loop:
X,Y = np.meshgrid(np.linspace(mn[0], mx[0], 20), np.linspace(mn[1], mx[1], 20))
So afterwards it will have another shape and contain different data. I would suggest changing this second X to x_grid and check where you need this "other" X and where the original.
for example:
X_grid, Y_grid = np.meshgrid(np.linspace(mn[0], mx[0], 20), np.linspace(mn[1], mx[1], 20))