How do I use sum()/average() for namedtuple in python? - python

from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
points = [Point(x=1.0, y=1.0), Point(x=2.0, y=2.0)]
I'd like to compute the average point out of points list, i.e. receive Point(1.5, 1.5) as a result:
point = average(points) # x = 1.5, y = 1.5
E.g. I know there's np.average(points, axis=0) if points.shape is (N, 2), but I'd rather keep a named tuple instead.

Calculate the average coordinate-wise:
import numpy as np
Point(np.average([p.x for p in points]),
np.average([p.y for p in points]))
#Point(x=1.5, y=1.5)
Or, better, implicitly convert the list of points to a numpy array, get the average, and convert the result back to a Point
Point(*np.average(points, axis=0))
#Point(x=1.5, y=1.5)

Maybe I am missing something, but if you want to avoid numpy then you can do
>>> Point(sum(p.x for p in points)/len(points),sum(p.y for p in points)/len(points))
Point(x=1.5, y=1.5)
Seems a bit roundabout, though.

Here's a cutsie way, using only Python built-ins:
In [1]: from collections import namedtuple
...: Point = namedtuple('Point', ['x', 'y'])
...: points = [Point(x=1.0, y=1.0), Point(x=2.0, y=2.0)]
...:
In [2]: import statistics
In [3]: Point(*map(statistics.mean, zip(*points)))
Out[3]: Point(x=1.5, y=1.5)
Why are you using numpy to begin with? It doesn't make much sense here.

This is a new subclass of namedtuple that has a classmethod to construct a new point from the average or sum of a sequence of points.
from collections import namedtuple
class Point(namedtuple('Point', ['x', 'y'])):
#classmethod
def from_average(cls, points):
from_sum = cls.from_sum(points)
return cls(from_sum.x / len(points), from_sum.y / len(points))
#classmethod
def from_sum(cls, points):
if not all(isinstance(p, cls) for p in points):
raise ValueError('All items in sequence must be of type {}'.format(cls.__name__))
x = sum(p.x for p in points)
y = sum(p.y for p in points)
return cls(x, y)
point_1 = Point(1.0, 1.0)
point_2 = Point(2.0, 2.0)
point_3 = Point.from_average([point_1, point_2])
point_3
# Point(x=1.5, y=1.5)
point_4 = Point.from_sum([point_1, point_2])
point_4
# Point(x=3.0, y=3.0)

Related

Number format python

I want to have the legend of the plot shown with the value in a list. But what I get is the element index but not the value itself. I dont know how to fix it. I'm referring to the plt.plot line. Thanks for the help.
import matplotlib.pyplot as plt
import numpy as np
x = np.random.random(1000)
y = np.random.random(1000)
n = len(x)
d_ij = []
for i in range(n):
for j in range(i+1,n):
a = np.sqrt((x[i]-x[j])**2+(y[i]-y[j])**2)
d_ij.append(a)
epsilon = np.linspace(0.01,1,num=10)
sigma = np.linspace(0.01,1,num=10)
def lj_pot(epsi,sig,d):
result = []
for i in range(len(d)):
a = 4*epsi*((sig/d[i])**12-(sig/d[i])**6)
result.append(a)
return result
for i in range(len(epsilon)):
for j in range(len(sigma)):
a = epsilon[i]
b = sigma[j]
plt.cla()
plt.ylim([-1.5, 1.5])
plt.xlim([0, 2])
plt.plot(sorted(d_ij),lj_pot(epsilon[i],sigma[j],sorted(d_ij)),label = 'epsilon = %d, sigma =%d' %(a,b))
plt.legend()
plt.savefig("epsilon_%d_sigma_%d.png" % (i,j))
plt.show()
Your code is a bit unpythonic, so I tried to clean it up to the best of my knowledge. numpy.random.random and numpy.random.uniform(0, 1) are basically the same, however, the latter also allows you to pass the shape of the return array that you would like to have, in this case an array with 1000 rows and two columns (1000, 2). I then use some magic to assign the two colums of the return array to x and y in the same line, respectively.
numpy.hypot does as the name suggests and calculates the hypothenuse of x and y. It can also do that for each entry of arrays with the same size, saving you the for loops, which you should try to aviod in Python since they are pretty slow.
You used plt for all your plotting, which is fine as long as you only have one figure, but I would recommend to be as explicit as possible, according to one of Python's key notions:
explicit is better than implicit.
I recommend you read through this guide, in particular the section called 'Stateful Versus Stateless Approaches'. I changed your commands accordingly.
It is also very unpythonic to loop over items of a list using the index of the item in the list like you did (for i in range(len(list)): item = list[i]). You can just reference the item directly (for item in list:).
Lastly I changed your formatted strings to the more convenient f-strings. Have a read here.
import matplotlib.pyplot as plt
import numpy as np
def pot(epsi, sig, d):
result = 4*epsi*((sig/d)**12 - (sig/d)**6)
return result
# I am not sure why you would create the independent variable this way,
# maybe you are simulating something. In that case, the code below is
# simpler than your version and should achieve the same.
# x, y = zip(*np.random.uniform(0, 1, (1000, 2)))
# d = np.array(sorted(np.hypot(x, y)))
# If you only want to plot your pot function then creating the value range
# like this is just fine.
d = np.linspace(0.001, 1, 1000)
epsilons = sigmas = np.linspace(0.01, 1, num=10)
fig, ax = plt.subplots()
ax.set_xlim([0, 2])
ax.set_ylim([-1.5, 1.5])
line = None
for epsilon in epsilons:
for sigma in sigmas:
if line is None:
line = ax.plot(
d, pot(epsilon, sigma, d),
label=f'epsilon = {epsilon}, sigma = {sigma}'
)[0]
fig.legend()
else:
line.set_data(d, pot(epsilon, sigma, d))
# plt.savefig(f"epsilon_{epsilon}_sigma_{sigma}.png")
fig.show()

Python data structure: parameter dependent arrays

I have a problem where I build some matrices depending on, let's say, two integer parameters. Let's call them A, that depend on p1, p2 where p1, p2 take values from 0 to 5.
Is there a way in Python to store the eigenvalues and eigenvectors of A in an "object", called B, such that somthing like B(1,2)[i] (or B[1,2,i]) will give as a result the eigenvalues (for i=0) or eigenvectors (for i=1) of the matrix A build with p1 = 1 and p2 = 2?
Currently what I am doing is storing the eigenvectors in a dictionary as in the simple example below, but I think it is a dirty hack. I would appreciate any
Example:
import numpy as np
# Build A matrices
def Amatrix(p1,p2):
import numpy as np
return np.array([[p1,p2/10],[p2/10,-p1]])
# Empty dict
eigvec_dict = {}
for p1 in range(3):
for p2 in range(2):
label = str(p1)+str(p2)
eigenvec_dict[label] = np.linalg.eigh(Amatrix(p1,p2))
eigenvec_dict.keys()
Out[9]: ['11', '10', '00', '01', '20', '21']
eigenvec_dict["01"][0]
Out[10]: array([-1., 1.])
eigenvec_dict["01"][1]
Out[11]:
array([[-0.70710678, 0.70710678],
[ 0.70710678, 0.70710678]])
I would use an object that takes a list of points (I think a point is better a tuple than a string) and calculates the eighs immediately.
__getitem__ is overridden returning for this [0, 1, 0] the eigen value for the point (0, 1). The internal data structure still is a dict, but its wrapped in an object and can be nicely called from outside.
import numpy as np
# class to store eigen values / vectors
class EigenH(object):
def __init__(self, points):
self.eighstore = self._create_eighstore(points)
def _create_eighstore(self, points):
eighstore = {}
for point in points:
eighs = np.linalg.eigh(self._get_amatrix(point))
eighstore[point] = eighs
return eighstore
def _get_amatrix(self, point):
p1, p2 = point
return np.array([[p1,p2/10.],[p2/10.,-p1]])
def __getitem__(self, key):
return self.eighstore[key[:2]][key[2]]
def keys(self):
return self.eighstore.keys()
# create point list
points = []
for p1 in range(3):
for p2 in range(2):
# I prefer tuples over strings in this case
points.append((p1, p2))
# instantiate class
eigh = EigenH(points)
# get eigen value
print(eigh[0, 1, 0])
#get eigen vectors
print(eigh[0, 1, 1])
# all available eighs
print(eigh.keys())

getting elements in an array1 that are not in array2

Main Problem
What is the better/pythonic way of retrieving elements in a particular array that are not found in a different array. This is what I have;
idata = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
idata = np.vstack(idata)
My interest is in performance. My data is an (X,Y,Z) array of size (7000 x 3) and my gdata is an (X,Y) array of (11000 x 2)
Preamble
I am working on an octant search to find the n-number(e.g. 8) of points (+) closest to my circular point (o) in each octant. This would mean that my points (+) are reduced to only 64 (8 per octant). Then for each gdata I would save the elements that are not found in data.
import tkinter as tk
from tkinter import filedialog
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from collections import defaultdict
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
data = pd.read_excel(file_path)
data = np.array(data, dtype=np.float)
nrow, cols = data.shape
file_path1 = filedialog.askopenfilename()
gdata = pd.read_excel(file_path1)
gdata = np.array(gdata, dtype=np.float)
gnrow, gcols = gdata.shape
N=8
delta = gdata - data[:,:2]
angles = np.arctan2(delta[:,1], delta[:,0])
bins = np.linspace(-np.pi, np.pi, 9)
bins[-1] = np.inf # handle edge case
octantsort = []
for j in range(gnrow):
delta = gdata[j, ::] - data[:, :2]
angles = np.arctan2(delta[:, 1], delta[:, 0])
octantsort = []
for i in range(8):
data_i = data[(bins[i] <= angles) & (angles < bins[i+1])]
if data_i.size > 0:
dist_order = np.argsort(cdist(data_i[:, :2], gdata[j, ::][np.newaxis]), axis=0)
if dist_order.size < npoint_per_octant+1:
[octantsort.append(data_i[dist_order[:npoint_per_octant][j]]) for j in range(dist_order.size)]
else:
[octantsort.append(data_i[dist_order[:npoint_per_octant][j]]) for j in range(npoint_per_octant)]
final = np.vstack(octantsort)
idata = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
idata = np.vstack(idata)
Is there an efficient and pythonic way of doing this do increase performance in the last two lines of the code?
If I understand your code correctly, then I see the following potential savings:
dedent the final = ... line
don't use arctan it's expensive; since you only want octants compare the coordinates to zero and to each other
don't do a full argsort, use argpartition instead
make your octantsort an "octantargsort", i.e. store the indices into data, not the data points themselves; this would save you the search in the last but one line and allow you to use np.delete for removing
don't use append inside a list comprehension. This will produce a list of Nones that is immediately discarded. You can use list.extend outside the comprehension instead
besides, these list comprehensions look like a convoluted way of converting data_i[dist_order[:npoint_per_octant]] into a list, why not simply cast, or even keep as an array, since you want to vstack in the end?
Here is some sample code illustrating these ideas:
import numpy as np
def discard_nearest_in_each_octant(eater, eaten, n_eaten_p_eater):
# build octants
# start with quadrants ...
top, left = (eaten < eater).T
quadrants = [np.where(v&h)[0] for v in (top, ~top) for h in (left, ~left)]
dcoord2 = (eaten - eater)**2
dc2quadrant = [dcoord2[q] for q in quadrants]
# ... and split them
oct4158 = [q[:, 0] < q [:, 1] for q in dc2quadrant]
# main loop
dc2octants = [[q[o], q[~o]] for q, o in zip (dc2quadrant, oct4158)]
reloap = [[
np.argpartition(o.sum(-1), n_eaten_p_eater)[:n_eaten_p_eater]
if o.shape[0] > n_eaten_p_eater else None
for o in opair] for opair in dc2octants]
# translate indices
octantargpartition = [q[so] if oap is None else q[np.where(so)[0][oap]]
for q, o, oaps in zip(quadrants, oct4158, reloap)
for so, oap in zip([o, ~o], oaps)]
octantargpartition = np.concatenate(octantargpartition)
return np.delete(eaten, octantargpartition, axis=0)

Wrong Exponential Power Plot - How to improve curve fit

Unfortunately, the power fit with scipy does not return a good fit. I tried to use p0 as an input argument with close values which did not help.
I would be very glad if someone could point out to me my problem.
# Imports
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
# Data
data = [[0.004408724185371062, 78.78011887652593], [0.005507091456466967, 65.01330508350753], [0.007073553026306459, 58.13364205119446], [0.009417452253958304, 50.12258366028477], [0.01315330108197482, 44.22980301062208], [0.019648758406406834, 35.436139354228956], [0.03248060063099905, 28.359815190205957], [0.06366197723675814, 21.54769216720596], [0.17683882565766149, 14.532777174472574], [1.5915494309189533, 6.156872080264581]]
# Fill lists to store x and y value
x_data,y_data = [], []
for i in data:
x_data.append(i[0])
y_data.append(i[1])
# Exponential Function
def func(x,m,c):
return x**m * c
# Curve fit
coeff, _ = curve_fit(func, x_data, y_data)
m, c = coeff[0], coeff[1]
# Plot function
x_function = np.linspace(0, 1.5, 100)
y = x_function**m * c
a = plt.scatter(x_data, y_data, s=30, marker = "v")
yfunction = x_function**m * c
plt.plot(x_function, yfunction, '-')
plt.show()
Another dataset for which the fit is really bad would be:
data = [[0.004408724185371062, 194.04075083542443], [0.005507091456466967, 146.09194314074864], [0.007073553026306459, 120.2115882821158], [0.009417452253958304, 74.04014371874908], [0.01315330108197482, 34.167114633194736], [0.019648758406406834, 12.775528348369871], [0.03248060063099905, 7.903195816871708], [0.06366197723675814, 5.186092050500438], [0.17683882565766149, 3.260540592404184], [1.5915494309189533, 2.006254812978579]]
I might miss something but I think the curve_fit just works fine. When I compare the residuals obtained by curve_fit to the ones one would obtain using the parameters obtained by excel which you provide in the comments, the python results always lead to lower residuals (code is provided below). You say "Unfortunately the power fit with scipy does not return a good fit." but what exactly is your measure for a "good fit"? The python fit seems always be better than the excel fit with respect to the residuals.
Not sure whether it has to be exactly this function but if not, you could also consider to add a third parameter to your function (below it is named "d") which will lead to better results.
Here is the modified code. I changed your "func" and also increased the resolution for the plot. Then the residuals are printed as well. For the first data set, one obtains for excel around 79.35 and with python around 34.29. For the second data set it is 15220.79 with excel and 601.08 with python (assuming I did not mess anything up).
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
# Data
data = [[0.004408724185371062, 78.78011887652593], [0.005507091456466967, 65.01330508350753], [0.007073553026306459, 58.13364205119446], [0.009417452253958304, 50.12258366028477], [0.01315330108197482, 44.22980301062208], [0.019648758406406834, 35.436139354228956], [0.03248060063099905, 28.359815190205957], [0.06366197723675814, 21.54769216720596], [0.17683882565766149, 14.532777174472574], [1.5915494309189533, 6.156872080264581]]
#data = [[0.004408724185371062, 194.04075083542443], [0.005507091456466967, 146.09194314074864], [0.007073553026306459, 120.2115882821158], [0.009417452253958304, 74.04014371874908], [0.01315330108197482, 34.167114633194736], [0.019648758406406834, 12.775528348369871], [0.03248060063099905, 7.903195816871708], [0.06366197723675814, 5.186092050500438], [0.17683882565766149, 3.260540592404184], [1.5915494309189533, 2.006254812978579]]
# Fill lists to store x and y value
x_data,y_data = [], []
for i in data:
x_data.append(i[0])
y_data.append(i[1])
# Exponential Function
def func(x,m,c):
#slightly rewritten; you could also consider using a third parameter d
return c*np.power(x,m) # + d
# Curve fit
coeff, _ = curve_fit(func, x_data, y_data)
m, c = coeff[0], coeff[1] #, coeff[2]
print m, c #, d
# Plot function
a = plt.scatter(x_data, y_data, s=30, marker = "v")
x_function = np.linspace(0, 1.5, 1000)
yfunction = c*np.power(x_function,m) # + d
plt.plot(x_function, yfunction, '-')
plt.show()
print "residuals python:",((y_data - func(x_data, *coeff))**2).sum()
#compare to excel, first data set
print "residuals excel:",((y_data - func(x_data, -0.425,7.027))**2).sum()
#compare to excel, second data set
print "residuals excel:",((y_data - func(x_data, -0.841,1.0823))**2).sum()
Taking your second dataset as an example: If you plot the raw data, a difficulty with the data becomes obvious: your data are very non-uniform. Now, since your function has a pure power law form, it's easiest to do the fitting in log scale:
In [1]: import numpy as np
In [2]: import matplotlib.pyplot as plt
In [3]: plt.ion()
In [4]: data = [[0.004408724185371062, 194.04075083542443], [0.005507091456466967, 146.09194314074864], [0.007073553026306459, 120.2115882821158], [0.009417452253958304, 74.04014371874908], [0.01315330108197482, 34.167114633194736], [0.019648758406406834, 12.775528348369871], [0.03248060063099905, 7.903195816871708], [0.06366197723675814, 5.186092050500438], [0.17683882565766149, 3.260540592404184], [1.5915494309189533, 2.006254812978579]]
In [5]: data = np.asarray(data) # just for convenience
In [6]: data.shape
Out[6]: (10, 2)
In [7]: x, y = data[:, 0], data[:, 1]
In [8]: lx, ly = np.log(x), np.log(y)
In [9]: plt.plot(lx, ly, 'ro')
Out[9]: [<matplotlib.lines.Line2D at 0x323a250>]
In [10]: def lfunc(x, a, b):
....: return a*x + b
....:
In [11]: from scipy.optimize import curve_fit
In [12]: opt, cov = curve_fit(lfunc, lx, ly)
In [13]: opt
Out[13]: array([-0.84071518, 0.07906558])
In [14]: plt.plot(lx, lfunc(lx, *opt), 'b-')
Out[14]: [<matplotlib.lines.Line2D at 0x3be0f90>]
Whether this is an adequate model for the data is a separate concern.

Interval containing specified percent of values

With numpy or scipy, is there any existing method that will return the endpoints of an interval which contains a specified percent of the values in a 1D array? I realize that this is simple to write myself, but it seems like the kind of thing that might be built in, although I can't find it.
E.g:
>>> import numpy as np
>>> x = np.random.randn(100000)
>>> print(np.bounding_interval(x, 0.68))
Would give approximately (-1, 1)
You can use np.percentile:
In [29]: x = np.random.randn(100000)
In [30]: p = 0.68
In [31]: lo = 50*(1 - p)
In [32]: hi = 50*(1 + p)
In [33]: np.percentile(x, [lo, hi])
Out[33]: array([-0.99206523, 1.0006089 ])
There is also scipy.stats.scoreatpercentile:
In [34]: scoreatpercentile(x, [lo, hi])
Out[34]: array([-0.99206523, 1.0006089 ])
I don't know of a built-in function to do it, but you can write one using the math package to specify approximate indices like this:
from __future__ import division
import math
import numpy as np
def bound_interval(arr_in, interval):
lhs = (1 - interval) / 2 # Specify left-hand side chunk to exclude
rhs = 1 - lhs # and the right-hand side
sorted = np.sort(arr_in)
lower = sorted[math.floor(lhs * len(arr_in))] # use floor to get index
upper = sorted[math.floor(rhs * len(arr_in))]
return (lower, upper)
On your specified array, I got the interval (-0.99072237819851039, 0.98691691784955549). Pretty close to (-1, 1)!

Categories