limit how much data is read with numpy.genfromtxt for matplotlib - python

I am creating a graph in python using a text file for the source data and matplotlib to plot the graph.
The simple logic below works well.
But is there a way to get have numpy.gentfromtxt only read the first 50 lines from the file 'temperature_logging'? Currently it reads the entire file.
temp = numpy.genfromtxt('temperature_logging',dtype=None,usecols=(0))
time = numpy.genfromtxt('temperature_logging',dtype=None,usecols=(1))
dates = matplotlib.dates.datestr2num(time)
pylab.plot_date(dates,temp,xdate=True,fmt='b-')
pylab.savefig('gp.png')
contents in temperature_logging;
21.75 12-01-2012-15:53:35
21.75 12-01-2012-15:54:35
21.75 12-01-2012-15:55:35
.
.
.

numpy.genfromtxt accepts iterators as well as files. That means it will accept the output of itertools.islice. Here, test.txt is a five-line file:
>>> import itertools, numpy
>>> with open('test.txt') as t_in:
... numpy.genfromtxt(itertools.islice(t_in, 3))
...
array([[ 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10.],
[ 11., 12., 13., 14., 15.]])
One might think this would be slower than letting numpy handle the file IO, but a quick test suggests otherwise. genfromtxt provides a skip_footer keyword argument that you can use if you know how long the file is...
>>> numpy.genfromtxt('test.txt', skip_footer=2)
array([[ 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10.],
[ 11., 12., 13., 14., 15.]])
...but a few informal tests on a 1000-line file suggest that using islice is faster even if you skip only a few lines:
>>> def get(nlines, islice=itertools.islice):
... with open('test.txt') as t_in:
... numpy.genfromtxt(islice(t_in, nlines))
...
>>> %timeit get(3)
1000 loops, best of 3: 338 us per loop
>>> %timeit numpy.genfromtxt('test.txt', skip_footer=997)
100 loops, best of 3: 4.92 ms per loop
>>> %timeit get(300)
100 loops, best of 3: 5.04 ms per loop
>>> %timeit numpy.genfromtxt('test.txt', skip_footer=700)
100 loops, best of 3: 8.48 ms per loop
>>> %timeit get(999)
100 loops, best of 3: 16.2 ms per loop
>>> %timeit numpy.genfromtxt('test.txt', skip_footer=1)
100 loops, best of 3: 16.7 ms per loop

No idea about numpy, but one possible solution would be to use the stringio class.
That allows you to just load the data you actually need into a string with normal file IO (there's also a byte version), create a file-like object from the string and pass that to numpy.

Related

Efficient dense counterpart to scipy.sparse.diags

scipy.sparse.diags allows me to enter multiple diagonal vectors, together with their location, to build a matrix such as
from scipy.sparse import diags
vec = np.ones((5,))
vec2 = vec + 1
diags([vec, vec2], [-2, 2])
I'm looking for an efficient way to do the same but build a dense matrix, instead of DIA. np.diag only supports a single diagonal. What's an efficient way to build a dense matrix from multiple diagonal vectors?
Expected output: the same as np.array(diags([vec, vec2], [-2, 2]).todense())
One way would be to index into the flattened output array using a step of N+1:
import numpy as np
from scipy.sparse import diags
from timeit import timeit
def diags_pp(vecs, offs, dtype=float, N=None):
if N is None:
N = len(vecs[0]) + abs(offs[0])
out = np.zeros((N, N), dtype)
outf = out.reshape(-1)
for vec, off in zip(vecs, offs):
if off<0:
outf[-N*off::N+1] = vec
else:
outf[off:N*(N-off):N+1] = vec
return out
def diags_sp(vecs, offs):
return diags(vecs, offs).A
for N, k in [(10, 2), (100, 20), (1000, 200)]:
print(N)
O = np.arange(-k,k)
D = [np.arange(1, N+1-abs(o)) for o in O]
for n, f in list(globals().items()):
if n.startswith('diags_'):
print(n.replace('diags_', ''), timeit(lambda: f(D, O), number=10000//N)*N)
if n != 'diags_sp':
assert np.all(f(D, O) == diags_sp(D, O))
Sample run:
10
pp 0.06757194991223514
sp 1.9529316504485905
100
pp 0.45834919437766075
sp 4.684177896706387
1000
pp 23.397524026222527
sp 170.66762899048626
With Paul Panzer's (10,2) case
In [107]: O
Out[107]: array([-2, -1, 0, 1])
In [108]: D
Out[108]:
[array([1, 2, 3, 4, 5, 6, 7, 8]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9])]
The diagonals have different lengths.
sparse.diags converts this to a sparse.dia_matrix:
In [109]: M = sparse.diags(D,O)
In [110]: M
Out[110]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 36 stored elements (4 diagonals) in DIAgonal format>
In [111]: M.data
Out[111]:
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 0., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.],
[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
Here the ragged list of diagonals has been converted to a padded 2d array. This can be a convenient way of specifying the diagonals, but it isn't particularly efficient. It has to be converted to csr format for most calculations:
In [112]: timeit sparse.diags(D,O)
99.8 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [113]: timeit sparse.diags(D,O, format='csr')
371 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using np.diag I can construct the same array with an iteration
np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
In [117]: timeit np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
39.3 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with Paul's function:
In [120]: timeit diags_pp(D,O)
12.3 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The key step in np.diags is a flat assignment:
res[:n-k].flat[i::n+1] = v
This is essentially the same as Paul's outf assignments. So the functionality is basically the same, assigning each diagonal via a slice. Paul streamlines it.
Creating the M.data array (Out[111]) also requires copying the D arrays to a 2d array - but with different slices.

Compute the pairwise squared difference of multiple vectors

I'm looking for the fastest way to calculate the squared difference between two vectors ((x1-x2)**2), but pairwise (all combinations or only the upper triangle).
x1 = [1,3,5,6,8]
x2 = [3,6,7,9,12]
Expected output:
array([[ 4., 25., 36., 64., 121.],
[ 0., 9., 16., 36., 81.],
[ 4., 1., 4., 16., 49.],
[ 9., 0., 1., 9., 36.],
[ 25., 4., 1., 1., 16.]])
or
array([[ 4., 25., 36., 64., 121.],
[ 0., 9., 16., 36., 81.],
[ 0., 0., 4., 16., 49.],
[ 0., 0., 0., 9., 36.],
[ 0., 0., 0., 0., 16.]])
or even (if faster):
array([ 4., 25., 36., 64., 121., 9., 16., 36., 81.,
4., 1., 4., 16., 49., 9., 1., 9., 36.,
25., 4., 1., 1., 16.])
Here's one with broadcasting and masking to get the upper triangular ones and then squaring only those for better performance efficiency -
def pairwise_squared_diff(x1, x2):
x1 = np.asarray(x1)
x2 = np.asarray(x2)
diffs = x1[:,None] - x2
mask = np.arange(len(x1))[:,None] <= np.arange(len(x2))
return (diffs[mask])**2
Sample run -
In [85]: x1
Out[85]: array([1, 3, 5, 6, 8])
In [86]: x2
Out[86]: array([ 3, 6, 7, 9, 12])
In [87]: pairwise_squared_diff(x1, x2)
Out[87]:
array([ 4, 25, 36, 64, 121, 9, 16, 36, 81, 4, 16, 49, 9,
36, 16])
Possible improvements
Improvement #1 :
We could also use np.tri to generate mask -
mask = ~np.tri(len(x1),len(x2),dtype=bool,k=-1)
Improvement #2 :
If we are okay with a 2D output with the lower triangular ones set as 0s, then a simple elementwise multiplication with mask solves it too to get the final output -
(diffs*mask)**2
This would work well with numexpr module for large data and to gain memory efficiency and hence performance.
Improvement #3 :
We could also compute the differences with numexpr and hence compute the masked output too with the same evaulate method, to give ourselves a new solution altogether -
def pairwise_squared_diff_numexpr(x1, x2):
x1 = np.asarray(x1)
x2 = np.asarray(x2)
mask = ~np.tri(len(x1),len(x2),dtype=bool,k=-1)
return ne.evaluate('mask*((x1D-x2)**2)',{'x1D':x1[:,None]})
Timings with improvements
Let's study these suggestions on performance for large arrays -
Setup :
In [136]: x1 = np.random.randint(0,9,(1000))
In [137]: x2 = np.random.randint(0,9,(1000))
With Improvement #1 :
In [138]: %timeit np.arange(len(x1))[:,None] <= np.arange(len(x2))
1000 loops, best of 3: 772 µs per loop
In [139]: %timeit ~np.tri(len(x1),len(x2),dtype=bool,k=-1)
1000 loops, best of 3: 243 µs per loop
With Improvement #2 :
In [140]: import numexpr as ne
In [141]: diffs = x1[:,None] - x2
...: mask = np.arange(len(x1))[:,None] <= np.arange(len(x2))
In [142]: %timeit (diffs[mask])**2
1000 loops, best of 3: 1.46 ms per loop
In [143]: %timeit ne.evaluate('(diffs*mask)**2')
1000 loops, best of 3: 1.05 ms per loop
With Improvement #3 on complete solutions :
In [170]: %timeit pairwise_squared_diff(x1, x2)
100 loops, best of 3: 3.66 ms per loop
In [171]: %timeit pairwise_squared_diff_numexpr(x1, x2)
1000 loops, best of 3: 1.54 ms per loop
Loopy one
For completeness, here's a loopy one that leverages slicing to perform better than pure broadcasting one, owing to the memory-efficiency -
def pairwise_squared_diff_loopy(x1,x2):
n = len(x2)
idx = np.concatenate(( [0], np.arange(n,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
L = n*(n+1)//2
out = np.empty(L,dtype=np.result_type(x1,x2))
for i,(s0,s1) in enumerate(zip(start,stop)):
out[s0:s1] = x1[i] - x2[i:]
return out**2
Timings -
In [300]: x1 = np.random.randint(0,9,(1000))
...: x2 = np.random.randint(0,9,(1000))
In [301]: %timeit pairwise_squared_diff(x1, x2)
100 loops, best of 3: 3.44 ms per loop
In [302]: %timeit pairwise_squared_diff_loopy(x1, x2)
100 loops, best of 3: 2.73 ms per loop
You can use broadcasting:
x1 = np.asarray([1,3,5,6,8]).reshape(-1, 1)
x2 = np.asarray([3,6,7,9,12]).reshape(1, -1)
(x1 - x2)**2
Output:
array([[ 4, 25, 36, 64, 121],
[ 0, 9, 16, 36, 81],
[ 4, 1, 4, 16, 49],
[ 9, 0, 1, 9, 36],
[ 25, 4, 1, 1, 16]])
which is simple to code, but computes all values, so it may be optimized to compute only the upper triangle.

sum uneven segments of an array in numpy

Given an ndarray x and a one dimensional array containing the length of contiguous slices of a dimension of x, I want to compute a new array that contains the sum of all of the slices. For example, in two dimensions summing over dimension one:
>>> lens = np.array([1, 3, 2])
array([1, 3, 2])
>>> x = np.arange(4 * lens.sum()).reshape((4, lens.sum())).astype(float)
array([[ 0., 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10., 11.],
[ 12., 13., 14., 15., 16., 17.],
[ 18., 19., 20., 21., 22., 23.]])
# I want to compute:
>>> result
array([[ 0., 6., 9.],
[ 6., 24., 21.],
[ 12., 42., 33.],
[ 18., 60., 45.]])
# 0 = 0
# 6 = 1 + 2 + 3
# ...
# 45 = 22 + 23
The two ways that come to mind are:
a) Use cumsum and fancy indexing:
def cumsum_method(x, lens):
xc = x.cumsum(1)
lc = lens.cumsum() - 1
res = xc[:, lc]
res[:, 1:] -= xc[:, lc[:-1]]
return res
b) Use bincount and intelligently generate the appropriate bins:
def bincount_method(x, lens):
bins = np.arange(lens.size).repeat(lens) + \
np.arange(x.shape[0])[:, None] * lens.size
return np.bincount(bins.flat, weights=x.flat).reshape((-1, lens.size))
Timing these two on large input had the cumsum method performing slightly better:
>>> lens = np.random.randint(1, 100, 100)
>>> x = np.random.random((100000, lens.sum()))
>>> %timeit cumsum_method(x, lens)
1 loops, best of 3: 3 s per loop
>>> %timeit bincount_method(x, lens)
1 loops, best of 3: 3.9 s per loop
Is there an obviously more efficient way that I'm missing? It seems like a native c call would be faster because it wouldn't require allocating the cumsum or the bins array. A numpy builtin function that does something close to this could likely be better than (a) or (b). I couldn't find anything through searching and looking through the documentation.
Note, this is similar to this question, but the summation intervals aren't regular.
You can use np.add.reduceat:
>>> np.add.reduceat(x, [0, 1, 4], axis=1)
array([[ 0., 6., 9.],
[ 6., 24., 21.],
[ 12., 42., 33.],
[ 18., 60., 45.]])
The list of indices [0, 1, 4] means: "sum the slices 0:1, 1:4 and 4:". You could generate these values from lens using np.hstack(([0], lens[:-1])).cumsum().
Even factoring in the calculation of the indices from lens, a reduceat method is likely to be significantly faster than alternative methods:
def reduceat_method(x, lens):
i = np.hstack(([0], lens[:-1])).cumsum()
return np.add.reduceat(x, i, axis=1)
lens = np.random.randint(1, 100, 100)
x = np.random.random((1000, lens.sum())
%timeit reduceat_method(x, lens)
# 100 loops, best of 3: 4.89 ms per loop
%timeit cumsum_method(x, lens)
# 10 loops, best of 3: 35.8 ms per loop
%timeit bincount_method(x, lens)
# 10 loops, best of 3: 43.6 ms per loop

Numpy Array Broadcasting with different dimensions

I a little confused by the broadcasting rules of numpy. Suppose you want to perform an axis-wise scalar product of a higher dimension array to reduce the array dimension by one (basically to perform a weighted summation along one axis):
from numpy import *
A = ones((3,3,2))
v = array([1,2])
B = zeros((3,3))
# V01: this works
B[0,0] = v.dot(A[0,0])
# V02: this works
B[:,:] = v[0]*A[:,:,0] + v[1]*A[:,:,1]
# V03: this doesn't
B[:,:] = v.dot(A[:,:])
Why does V03 not work?
Cheers
np.dot(a, b) operates over the last axis of a and the second-to-last of b. So for your particular case in your question,you could always go with :
>>> a.dot(v)
array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])
If you want to keep the v.dot(a) order, you need to get the axis into position, which can easily be achieved with np.rollaxis:
>>> v.dot(np.rollaxis(a, 2, 1))
array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])
I don't like np.dot too much, unless it is for the obvious matrix or vector multiplication, because it is very strict about the output dtype when using the optional out parameter. Joe Kington has mentioned it already, but if you are going to be doing this type of things, get used to np.einsum: once you get the hang of the syntax, it will cut down the amount of time you spend worrying about reshaping things to a minimum:
>>> a = np.ones((3, 3, 2))
>>> np.einsum('i, jki', v, a)
array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])
Not that it is too relevant in this case, but it is also ridiculously fast:
In [4]: %timeit a.dot(v)
100000 loops, best of 3: 2.43 us per loop
In [5]: %timeit v.dot(np.rollaxis(a, 2, 1))
100000 loops, best of 3: 4.49 us per loop
In [7]: %timeit np.tensordot(v, a, axes=(0, 2))
100000 loops, best of 3: 14.9 us per loop
In [8]: %timeit np.einsum('i, jki', v, a)
100000 loops, best of 3: 2.91 us per loop
You can also use tensordot, in this particular case.
import numpy as np
A = np.ones((3,3,2))
v = np.array([1,2])
print np.tensordot(v, A, axes=(0, 2))
This yields:
array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])
The axes=(0,2) indicates that tensordot should sum over the first axis in v and the third axis in A. (Also have a look at einsum, which is more flexible, but harder to understand if you're not used to the notation.)
If speed is a consideration, tensordot is considerably faster than using apply_along_axes for small arrays.
In [14]: A = np.ones((3,3,2))
In [15]: v = np.array([1,2])
In [16]: %timeit np.tensordot(v, A, axes=(0, 2))
10000 loops, best of 3: 21.6 us per loop
In [17]: %timeit np.apply_along_axis(v.dot, 2, A)
1000 loops, best of 3: 258 us per loop
(The difference is less apparent for large arrays due to a constant overhead, though tensordot is consistently faster.)
You could use numpy.apply_along_axis() for this:
In [35]: np.apply_along_axis(v.dot, 2, A)
Out[35]:
array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])
The reason I think V03 doesn't work is that it's no different to:
B[:,:] = v.dot(A)
i.e. it tries to compute the dot product along the outermost axis of A.

How to create identity matrix with numpy

How do I create an identity matrix with numpy?
Is there a simpler syntax than
numpy.matrix(numpy.identity(n))
Here's a simpler syntax:
np.matlib.identity(n)
And here's an even simpler syntax that runs much faster:
In [1]: n = 1000
In [2]: timeit np.matlib.identity(n)
100 loops, best of 3: 8.78 ms per loop
In [3]: timeit np.matlib.eye(n)
1000 loops, best of 3: 695 us per loop
Also np.eye can be used to create an identity array (In).
For example,
>>> np.eye(2, dtype=int)
array([[1, 0],
[0, 1]])
>>> np.eye(3, k=1)
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 0.]])
I don't think there is a simpler solution. You can do it slightly more efficiently, though:
numpy.matrix(numpy.identity(n), copy=False)
This avoids unnecessarily copying the data.

Categories