I'd like to try the SciPy suite instead of Octave for doing the statistics in my lab experiments. Most of my questions were answered here, there is just another thing left:
I usually have an error attached to the measurements, in Octave I just did the following:
R.val = 10;
R.err = 0.1;
U.val = 4;
U.err = 0.1;
And then I would calculate I with it like so:
I.val = U.val / R.val;
I.err = sqrt(
(1 / R.val * U.err)^2
+ (U.val / R.val^2 * R.err)^2
);
When I had a bunch of measurements, I usually used a structure array, like this:
R(0).val = 1;
R(0).err = 0.1;
…
R(15).val = 100;
R(15).err = 9;
Then I could do R(0).val or directly access all of them using R.val and I had a column vector with all the values, for mean(R.val) for instance.
How could I represent this using SciPy/NumPy/Python?
This kind of error propagation is exactly what the uncertainties Python package does. It does so transparently and by correctly handling correlations:
from uncertainties import ufloat
R = ufloat(10, 0.1)
U = ufloat(4, 0.1)
I = U/R
print I
prints 0.4+/-0.0107703296143, after automatically determining and calculating the error formula that you typed manually in your example. Also, I.n and I.s are respectively the nominal value (your val) and the standard deviation (your err).
Arrays holding numbers with uncertainties can also be used (http://pythonhosted.org/uncertainties/numpy_guide.html).
(Disclaimer: I'm the author of this package.)
The easiest is indeed to use NumPy structured arrays, that give you the possibility to define arrays of homogeneous elements (a record) composed of other homogeneous elements (fields).
For example, you could define
R = np.empty(15, dtype=[('val',float),('err',float)])
and then fill the corresponding columns:
R['val'] = ...
R['err'] = ...
Alternatively, you could define the array at once if you have your val and err in two lists:
R = np.array(zip(val_list, err_list), dtype=[('val',float),('err',float)])
In both cases, you can access individual elements by indices, like R[0] (which would give you a specific object, a np.void, that still gives you the possibility to access the fields separately), or by slices R[1:-1]...
With your example, you could do:
I = np.empty_like(R)
I['val'] = U['val'] / R['val']
I['err'] = np.sqrt((1 / R['val'] * U['err'])**2 + (U['val'] / R['val']**2 * R['err'])**2)
You could also use record array, which are basic structured array with the __getattr__ and __setattr__ methods overloaded in such way that you can access the fields as attributes (like in R.val) as well as indices (like the standard R['val']). Of course, as these basic methods are overloaded, record arrays are not as efficient as structured arrays.
For just one measurement probably simple namedtuple would suffice.
And instead of structure arrays you can use numpy's record arrays. Seems to be little bit more mouthful though.
Also google cache of NumPy for Matlab Users (direct link doesn't work for me atm) can help with some counterparts of basic operations.
There is a package for representing quantities along with uncertainties in Python. It is called quantities ! (also on PyPI).
Related
How are Galois fields represented in SymPy? I couldn't find any documentation for this online, but SymPy contains a module called "galoistools", so I thought I should give it a try. I tried the following experiment:
from sympy import *
x = symbols("x")
A = [LC(Poly(i*x, modulus=8) * Poly(j*x, modulus=8)) for i in range(1, 8) for j in range(1, i+1)]
B = [LC(Poly(i*x, domain=GF(8)) * Poly(j*x, domain=GF(8))) for i in range(1, 8) for j in range(1, i+1)]
However, the resulting lists A and B are identical, so I'm obviously misunderstanding how this is supposed to be used. I'm trying to work in GF(8), i.e. GF(2^3), which is not the same as computing modulo 8.
At present SymPy does not have support for finite fields other than Z/pZ. The existing class GF(n) is misleadingly named; it actually implements Z/nZ as you observed.
However, using the low-level routines in galoistools module one can create a class for general finite fields GF(p^n) and for polynomials over such a field: see this answer where these classes are implemented (for the purpose of computing an interpolating polynomial, but they can be used for other things too). This is just a minimal class; it does not interface with advanced polynomial manipulation methods that are implemented in SymPy.
I am trying to plot the force on the ith particle as function of its distance from the jth particle (ie. xi-xj) in a Lennard-Jones system. The force is given by
where sigma and epsilon are two parameters, Xi is a known quantity and Xj is variable. The force directs from the ith particle to the jth particle.
The code that I have written for this is given below.
from pylab import*
from numpy import*
#~~~ ARGON VALUES ~~~~~~~~(in natural units)~~~~~~~~~~~~~~~~
epsilon=0.0122 # depth of potential well
sigma=0.335 # dist of closest approach
xi=0.00
xj=linspace(0.1,1.0,300)
f = 48.0*epsilon*( ((sigma**12.0)/((xi-xj)**13.0)) - ((sigma**6.0)/2.0/((xi-xj)**7.0)) ) * float(xj-xi)/abs(xi-xj)
plot(xj,f,label='force')
legend()
show()
But it gives me this following error.
f = 48.0*epsilon*( ((sigma**12.0)/((xi-xj)**11.0)) - ((sigma**6.0)/2.0/((xi-xj)**5.0)) ) * float(xj-xi)/abs(xi-xj)
TypeError: only length-1 arrays can be converted to Python scalars
Can someone help me solve this problem. Thanks in advance.
The error is with this part of the expression:
float(xj-xi)
Look at the answer to a related question. It appears to be conflict between Python built-in functions and Numpy functions.
If you take out the 'float' it at least returns. Does it give the correct numbers?
f = 48.0*epsilon*( ((sigma**12.0)/((xi-xj)**11.0)) - ((sigma**6.0)/2.0/((xi-xj)**5.0)) ) * (xj-xi)/abs(xi-xj)
Instead of the term float(xj-xi)/abs(xi-xj) you should use
sign(xj-xi)
If you really want to do the division, since xi and xj are already floats you could just do:
(xj-xi)/abs(xi-xj)
More generally, if you need to convert a numpy array of ints to floats you could use either of:
1.0*(xj-xi)
(xj-xi).astype(float)
Even more generally, it's helpful in debugging to not use equations that stretch across the page because with smaller terms you can identify the location of the errors more easily. It also often runs faster. For example, here you calculate xi-xj four times, when really it only needs to be done once. And it would be easier to read:
x = xi -xj
f = 48*epsilon*(s**12/x**13 - s**6/2/x**7)
f *= sign(-x)
The TypeError is due to float(xi-xj). float() cannot convert an iterable to a single scalar value. Instead, iterate over xj and convert each value in xi-xj to float. This can be easily done with
x = [float(j - xi) for j in xj)]
I am trying to calculate the trimmed mean, which excludes the outliers, of an array.
I found there is a module called scipy.stats.tmean, but it requires the user specifies the range by absolute value instead of percentage values.
In Matlab, we have m = trimmean(X,percent), that does exactly what I want.
Do we have the counterpart in Python?
At least for scipy v0.14.0, there is a dedicated function for this (scipy.stats.trim_mean):
from scipy import stats
m = stats.trim_mean(X, 0.1) # Trim 10% at both ends
which used stats.trimboth inside.
From the source code it is possible to see that with proportiontocut=0.1 the mean will be calculated using 80% of the data. Note that the scipy.stats.trim_mean can not handle np.nan.
(Edit: the context for this answer was that scipy.stats.trim_mean wasn't documented yet. Now that it's publicly available, use that function instead of rolling your own. My answer below is kept for historical purpose.)
You can also implement the whole thing yourself, following the instruction in the MatLab documentation.
Here's the code in Python 2:
from numpy import mean
def trimmean(arr, percent):
n = len(arr)
k = int(round(n*(float(percent)/100)/2))
return mean(arr[k+1:n-k])
Here's a manual implementation using floor from the math library...
def trimMean(tlist,tperc):
removeN = int(math.floor(len(tlist) * tperc / 2))
tlist.sort()
if removeN > 0: tlist = tlist[removeN:-removeN]
return reduce(lambda a,b : a+b, tlist) / float(len(tlist))
I'm trying to code this expression in python but I'm having some difficulty.
This is the code I have so far and wanted some advice.
x = 1x2 vector
mu = 1x2 vector
Sigma = 2x2 matrix
xT = (x-mu).transpose()
sig = Sigma**(-1)
dotP = dot(xT ,sig )
dotdot = dot(dotP, (x-mu))
E = exp( (-1/2) dotdot )
Am I on the right track? Any suggestions?
Sigma ** (-1) isn't what you want. That would raise each element of Sigma to the -1 power, i.e. 1 / Sigma, whereas in the mathematical expression it means the inverse, which is written in Python as np.linalg.inv(Sigma).
(-1/2) dotdot is a syntax error; in Python, you need to always include * for multiplication, or just do - dotdot / 2. Since you're probably using python 2, division is a little wonky; unless you've done from __future__ import division (highly recommended), 1/2 will actually be 0, because it's integer division. You can use .5 to get around that, though like I said I do highly recommend doing the division import.
This is pretty trivial, but you're doing the x-mu subtraction twice where it's only necessary to do once. Could save a little speed if your vectors are big by doing it only once. (Of course, here you're doing it in two dimensions, so this doesn't matter at all.)
Rather than calling the_array.transpose() (which is fine), it's often nicer to use the_array.T, which is the same thing.
I also wouldn't use the name xT; it implies to me that it's the transpose of x, which is false.
I would probably combine it like this:
# near the top of the file
# you probably did some kind of `from somewhere import *`.
# most people like to only import specific names and/or do imports like this,
# to make it clear where your functions are coming from.
import numpy as np
centered = x - mu
prec = np.linalg.inv(Sigma)
E = np.exp(-.5 * np.dot(centered.T, np.dot(prec, centered)))
The context: my Python code pass arrays of 2D vertices to OpenGL.
I tested 2 approaches, one with ctypes, the other with struct, the latter being more than twice faster.
from random import random
points = [(random(), random()) for _ in xrange(1000)]
from ctypes import c_float
def array_ctypes(points):
n = len(points)
return n, (c_float*(2*n))(*[u for point in points for u in point])
from struct import pack
def array_struct(points):
n = len(points)
return n, pack("f"*2*n, *[u for point in points for u in point])
Any other alternative?
Any hint on how to accelerate such code (and yes, this is one bottleneck of my code)?
You can pass numpy arrays to PyOpenGL without incurring any overhead. (The data attribute of the numpy array is a buffer that points to the underlying C data structure that contains the same information as the array you're building)
import numpy as np
def array_numpy(points):
n = len(points)
return n, np.array(points, dtype=np.float32)
On my computer, this is about 40% faster than the struct-based approach.
You could try Cython. For me, this gives:
function usec per loop:
Python Cython
array_ctypes 1370 1220
array_struct 384 249
array_numpy 336 339
So Numpy only gives 15% benefit on my hardware (old laptop running WindowsXP), whereas Cython gives about 35% (without any extra dependency in your distributed code).
If you can loosen your requirement that each point is a tuple of floats, and simply make 'points' a flattened list of floats:
def array_struct_flat(points):
n = len(points)
return pack(
"f"*n,
*[
coord
for coord in points
]
)
points = [random() for _ in xrange(1000 * 2)]
then the resulting output is the same, but the timing goes down further:
function usec per loop:
Python Cython
array_struct_flat 157
Cython might be capable of substantially better than this too, if someone smarter than me wanted to add static type declarations to the code. (Running 'cython -a test.pyx' is invaluable for this, it produces an html file showing you where the slowest (yellow) plain Python is in your code, versus python that has been converted to pure C (white). That's why I spread the code above out onto so many lines, because the coloring is done per-line, so it helps to spread it out like that.)
Full Cython instructions are here:
http://docs.cython.org/src/quickstart/build.html
Cython might produce similar performance benefits across your whole codebase, and in ideal conditions, with proper static typing applied, can improve speed by factors of ten or a hundred.
There's another idea I stumbled across. I don't have time to profile it right now, but in case someone else does:
# untested, but I'm fairly confident it runs
# using 'flattened points' list, i.e. a list of n*2 floats
points = [random() for _ in xrange(1000 * 2)]
c_array = c_float * len(points * 2)
c_array[:] = points
That is, first we create the ctypes array but don't populate it. Then we populate it using the slice notation. People smarter than I tell me that assigning to a slice like this may help performance. It allows us to pass a list or iterable directly on the RHS of the assignment, without having to use the *iterable syntax, which would perform some intermediate wrangling of the iterable. I suspect that this is what happens in the depths of creating pyglet's Batches.
Presumably you could just create c_array once, then just reassign to it (the final line in the above code) every time the points list changes.
There is probably an alternative formulation which accepts the original definition of points (a list of (x,y) tuples.) Something like this:
# very untested, likely contains errors
# using a list of n tuples of two floats
points = [(random(), random()) for _ in xrange(1000)]
c_array = c_float * len(points * 2)
c_array[:] = chain(p for p in points)
If performance is an issue, you do not want to use ctypes arrays with the star operation (e.g., (ctypes.c_float * size)(*t)).
In my test packis fastest followed by the use of the array module with a cast of the address (or using the from_buffer function).
import timeit
repeat = 100
setup="from struct import pack; from random import random; import numpy; from array import array; import ctypes; t = [random() for _ in range(2* 1000)];"
print(timeit.timeit(stmt="v = array('f',t); addr, count = v.buffer_info();x = ctypes.cast(addr,ctypes.POINTER(ctypes.c_float))",setup=setup,number=repeat))
print(timeit.timeit(stmt="v = array('f',t);a = (ctypes.c_float * len(v)).from_buffer(v)",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(*t)',setup=setup,number=repeat))
print(timeit.timeit(stmt="x = pack('f'*len(t), *t);",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(); x[:] = t',setup=setup,number=repeat))
print(timeit.timeit(stmt='x = numpy.array(t,numpy.float32).data',setup=setup,number=repeat))
The array.array approach is slightly faster than Jonathan Hartley's approach in my test while the numpy approach has about half the speed:
python3 convert.py
0.004665990360081196
0.004661010578274727
0.026358536444604397
0.0028003649786114693
0.005843495950102806
0.009067213162779808
The net winner is pack.
You can use array (notice also the generator expression instead of the list comprehension):
array("f", (u for point in points for u in point)).tostring()
Another optimization would be to keep the points flattened from the beginning.