Python matplotlib annotate variable length arc - python

I would like to annotate some points in a matplotlib plot (dynamically ofc) with "arc" connectionstyles in such a way that the annotations are grouped at the top, the connectors running to their respective x-positions, but stop at a defined y-position, and extend as straight arms from their to the data point (see fig). The solution might be manipulating the individual "armB"-values, but the problem there is that those are only in points and don't correspond to the data coordinate system. The reason I don't just use straight lines in the first place is that in the real data, the points are sometimes too close together and the text would overlap, hence the bent "arrows". Oh, and it should dynamically adapt to the zoom level (the final plot being interactive), but I think I'll be able to pull that off once the connection line problem is solved. Minimal working example:
import matplotlib.pyplot as plt
x=[2, 3, 4, 6, 7, 8, 10, 11]
y=[1, 3, 4, 2, 3, 1, 5, 2,]
tx=[3, 4, 5, 6, 7, 8, 9, 10]
yd=dict(zip(x, y))
plt.scatter(x, y)
plt.xlim(0, 14)
plt.ylim(0, 7)
arpr = {"arrowstyle": "-", "connectionstyle": "arc,angleA=-90,armA=20,angleB=90,armB=20,rad=10"}
for i, j in zip(x, tx):
#lines all the way down but messy
plt.annotate("foo", (i, yd[i]), (j, 6), arrowprops=arpr, rotation="vertical")
#lines orderly, but incomplete
plt.annotate("foo", (i, 5), (j, 6), arrowprops=arpr, rotation="vertical")
What I would like (red lines are the issue, added in the pic w/ MS paint...):
Just clipping the connectors to the data points, not good:

You can set the length of armB so that it's dependent on the position of the point.
Here's an example for what I thought looked good:
connection = "arc,angleA=-90,armA=20,angleB=90,rad=10,"
arpr = {"arrowstyle": "-"}
for i, j in zip(x, tx):
#lines all the way down but messy
arpr["connectionstyle"] = connection + "armB="+ "{0}".format((5-yd[i])*30)
plt.annotate("foo", (i, yd[i]), (j, 6), arrowprops=arpr, rotation="vertical")

Related

What's the fastest way to generate millions of png files using Matplotlib?

For a deep learning project, I need to synthesize plots for each item in my dataset. This means generating 2.5 million plots, each 224x224 pixels.
So far the best I've been able to do is this, which takes 2.7 seconds to run on my PC:
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import matplotlib.pyplot as plt
for i in range(100):
fig = plt.Figure(frameon=False, facecolor="white", figsize=(4, 4))
ax = fig.add_subplot(111)
ax.axis('off')
ax.plot([1, 2, 3, 4, 5, 6, 7, 8], [2, 4, 6, 8, 8, 6, 4, 3])
canvas = FigureCanvas(fig)
canvas.print_figure(str(i), dpi=56)
A resulting image (from this reproducible example) looks like this:
The real images use a bit more data (200 rows) but that makes little difference to speed.
At the speed above it will take me around 18 hours to generate all my plots! Are there any clever ways to speed this up?
Per the comment from AKX, Pillow has a function ImageDraw.line() that performs faster for this task:
from PIL import Image, ImageDraw
from itertools import chain
scale = 224
pad = 5
scale_pad = scale - pad * 2
for i in range(200):
im = Image.new('RGB', (scale, scale), (255, 255, 255))
draw = ImageDraw.Draw(im)
x = [1, 2, 3, 4, 5, 6, 7, 8]
y = [2, 4, 6, 8, 8, 6, 4, 3]
x = [pad + (i - min(x)) / (max(x) - min(x)) * scale_pad for i in x]
y = [pad + (i - min(y)) / (max(y) - min(y)) * scale_pad for i in y]
draw.line(list(chain.from_iterable(zip(x, y))), fill=(0, 0, 0), width=4)
im.save(f"{i}.png")
This performs about 6x faster than Matplotlib, meaning my task should take only ~3 hours instead of 18.

Plotting a histogram from a database using matplot and python

So from the database, I'm trying to plot a histogram using the matplot lib library in python.
as shown here:
cnx = sqlite3.connect('practice.db')
sql = pd.read_sql_query('''
SELECT CAST((deliverydistance/1)as int)*1 as bin, count(*)
FROM orders
group by 1
order by 1;
''',cnx)
which outputs
This
From the sql table, I try to extract the columns using a for loop and place them in array.
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
print(distance)
print(counts)
OUTPUT:
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
When I plot a histogram
plt.hist(counts,bins=distance)
I get this out put:
click here
My question is, how do I make it so that the count is on the Y axis and the distance is on the X axis? It doesn't seem to allow me to put it there.
you could also skip the for loop and plot direct from your pandas dataframe using
sql.bin.plot(kind='hist', weights=sql['count(*)'])
or with the for loop
import matplotlib.pyplot as plt
import pandas as pd
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
plt.hist(distance, bins=distance, weights=counts)
You can skip the middle section where you count the instances of each distance. Check out this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'distance':np.round(20 * np.random.random(100))})
df['distance'].hist(bins = np.arange(0,21,1))
Pandas has a built-in histogram plot which counts, then plots the occurences of each distance. You can specify the bins (in this case 0-20 with a width of 1).
If you are not looking for a bar chart and are looking for a horizontal histogram, then you are looking to pass orientation='horizontal':
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
# plt.style.use('dark_background')
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
plt.hist(counts,bins=distance, orientation='horizontal')
Use :
plt.bar(distance,counts)

matplotlib connecting the dots in scatter plot

I am trying to visualize some data regarding the time at which the process was running or alive and the time it was idle. For each process, I have a_x_axis the time at which process started running and a_live_for is the time it was alive after it woke up. I have two data points in for each process. I am trying to connect these two dots by a line by connecting 1st green dot with the first red dot and second green dot with the second red dot and so on, so I can see alive and idle time for each process in the large data set. I looked into scatter plot examples but could not find any way to solve this issue.
import matplotlib.pyplot as plt
a_x_axis = [32, 30, 40, 50, 60, 78]
a_live = [1, 3, 2, 1, 2, 4]
a_alive_for = [a + b for a, b in zip(a_x_axis, a_live)]
b_x_axis = [22, 25, 45, 55, 60, 72]
b_live = [1, 3, 2, 1, 2, 4]
b_alive_for = [a + b for a, b in zip(b_x_axis, b_live)]
a_y_axis = []
b_y_axis = []
for i in range(0, len(a_x_axis)):
a_y_axis.append('process-1')
b_y_axis.append('process-2')
print("size of a: %s" % len(a_x_axis))
print("size of a: %s" % len(a_y_axis))
plt.xlabel('time (s)')
plt.scatter(a_x_axis, [1]*len(a_x_axis))
plt.scatter(a_alive_for, [1]*len(a_x_axis))
plt.scatter(b_x_axis, [2]*len(b_x_axis))
plt.scatter(b_alive_for, [2]*len(b_x_axis))
plt.show()
You need:
import matplotlib.pyplot as plt
a_x_axis = [32, 30, 40, 50, 60, 78]
a_live = [1, 3, 2, 1, 2, 4]
a_alive_for = [a + b for a, b in zip(a_x_axis, a_live)]
b_x_axis = [22, 25, 45, 55, 60, 72]
b_live = [1, 3, 2, 1, 2, 4]
b_alive_for = [a + b for a, b in zip(b_x_axis, b_live)]
a_y_axis = []
b_y_axis = []
for i in range(0, len(a_x_axis)):
a_y_axis.append('process-1')
b_y_axis.append('process-2')
print("size of a: %s" % len(a_x_axis))
print("size of a: %s" % len(a_y_axis))
plt.xlabel('time (s)')
plt.scatter(a_x_axis, [1]*len(a_x_axis))
plt.scatter(a_alive_for, [1]*len(a_x_axis))
plt.scatter(b_x_axis, [2]*len(b_x_axis))
plt.scatter(b_alive_for, [2]*len(b_x_axis))
for i in range(0, len(a_x_axis)):
plt.plot([a_x_axis[i],a_alive_for[i]], [1,1], 'green')
for i in range(0, len(b_x_axis)):
plt.plot([b_x_axis[i],b_alive_for[i]], [2,2], 'green')
plt.show()
Output:
scatter is just not the tool for plotting lines, it's plot. And it accepts 2D-arrays of x- and y-coordinates, so you don't have to manually iterate over lists. So you would need sth like
plt.plot([a_x_axis, a_alive_for], [[1]*n,[1]*n], 'green')
with n = len(a_x_axis).
However, you could structure your data much better in numpy arrays or pandas dataframes where you can set titles for columns, too. (Is it that, what you wanted to achieve by appending 'process-x' to your data lists...?)
Also, the colors of your markers seem to me not chosen by purpose; if you want to have them the same like the lines you could even leave scatter completely away.

Piecewise Fit not working - large dataset

I have been using a solution found in several places on stack overflow for fitting a piecewise function:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y)
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
(for example, here: How to apply piecewise linear fit in Python?)
The first time I try it in the console I get an OptimizeWarning.
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
After that I just get a straight line for my fit. It seems as though there is clearly a bend in the data that the fit isn't following, although I cannot figure out why.
For the dataset I am using there are about 3200 points in each x and y, is this part of the problem?
Here are some fake data that kind of simulate mine (same problem occurs where fit is not piecewise):
x = np.append(np.random.uniform(low=10.0, high=40.2, size=(1500,)), np.random.uniform(low=-10.0, high=20.2, size=(1500,)))
y = np.append(np.random.uniform(low=-3000, high=0, size=(1500,)), np.random.uniform(low=-2000, high=1000, size=(1500,)))
Just to complete the question with the answer provided in the comment above:
The issue was not due to the large number of points, but the fact that I had such large values on my y axis. Since the default initial values are 1, my values of around 1000 were too large. To fix that an initial guess for the line fit was used for parameter p0. From the docs for scipy.optimize.curve_fit it looks like:
p0 : None, scalar, or N-length sequence, optional
Initial guess for the parameters. If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
So my final code ended up looking like this:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([500, 700, 900, 1100, 1300, 1500, 2892, 4281, 5670, 7059, 8447, 9836, 11225, 12614, 14003])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y, p0=(10, -2500, 0, -500))
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
Just for fun (very scattered case) :
Since the original data was not available, the coordinates of the points are obtained from the figure published in the Rachel W's question, thanks to a graphical scan and the record of the blue pixels. They are some artefact due to the straight line and the grid which, after scanning, appear in white.
The result of a piecewise regression (two segments) is drawn in red on the above figure.
The equation of the fitted function is :
The regression method used is not iterative and don't require initial guess. The code is very simple : pp.12-13 in this paper https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf

matplotlib connecting wrong points in line graph

I am plotting two lists using matplotlib python library. There are two arrays x and y which look like this when plotted-
Click here for plot (sorry don't have enough reputation to post pictures here)
The code used is this-
import matplotlib.pyplot as plt
plt.plot(x,y,"bo")
plt.fill(x,y,'#99d8cp')
It plots the points then connects the points using a line. But the problem is that it is not connecting the points correctly. Point 0 and 2 on x axis are connected wrongly instead of 1 and 2. Similarly on the other end it connects points 17 to 19, instead of 18 to 19. I also tried plotting simple line graph using-
plt.plot(x,y)
But then too it wrongly connected the points. Would really appreciated if anyone could point me in right direction as to why this is happening and what can be done to resolve it.
Thanks!!
The lines of matplotlib expects that the coordinates are in order, therefore you are connecting your points in a 'strange' way (although exactly like you told matplotlib to do, e.g. from (0,1) to (3,2)). You can fix this by simply sorting the data prior to plotting.
#! /usr/bin/env python
import matplotlib.pyplot as plt
x = [20, 21, 22, 23, 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14, 17, 16, 19, 18]
y = [ 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1]
x2,y2 = zip(*sorted(zip(x,y),key=lambda x: x[0]))
plt.plot(x2,y2)
plt.show()
That should give you what you want, as shown below:

Categories