How can I read a Numpy array from a string? Take a string like:
"[[ 0.5544 0.4456], [ 0.8811 0.1189]]"
and convert it to an array:
a = from_string("[[ 0.5544 0.4456], [ 0.8811 0.1189]]")
where a becomes the object: np.array([[0.5544, 0.4456], [0.8811, 0.1189]]).
I'm looking for a very simple interface. A way to convert 2D arrays (of floats) to a string and then a way to read them back to reconstruct the array:
arr_to_string(array([[0.5544, 0.4456], [0.8811, 0.1189]])) should return "[[ 0.5544 0.4456], [ 0.8811 0.1189]]".
string_to_arr("[[ 0.5544 0.4456], [ 0.8811 0.1189]]") should return the object array([[0.5544, 0.4456], [0.8811, 0.1189]]).
Ideally arr_to_string would have a precision parameter that controlled the precision of floating points converted to strings, so that you wouldn't get entries like 0.4444444999999999999999999.
There's nothing I can find in the NumPy docs that does this both ways. np.save lets you make a string but then there's no way to load it back in (np.load only works for files).
The challenge is to save not only the data buffer, but also the shape and dtype. np.fromstring reads the data buffer, but as a 1d array; you have to get the dtype and shape from else where.
In [184]: a=np.arange(12).reshape(3,4)
In [185]: np.fromstring(a.tostring(),int)
Out[185]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [186]: np.fromstring(a.tostring(),a.dtype).reshape(a.shape)
Out[186]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
A time honored mechanism to save Python objects is pickle, and numpy is pickle compliant:
In [169]: import pickle
In [170]: a=np.arange(12).reshape(3,4)
In [171]: s=pickle.dumps(a*2)
In [172]: s
Out[172]: "cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I3\nI4\ntp6\ncnumpy\ndtype\np7\n(S'i4'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x06\\x00\\x00\\x00\\x08\\x00\\x00\\x00\\n\\x00\\x00\\x00\\x0c\\x00\\x00\\x00\\x0e\\x00\\x00\\x00\\x10\\x00\\x00\\x00\\x12\\x00\\x00\\x00\\x14\\x00\\x00\\x00\\x16\\x00\\x00\\x00'\np13\ntp14\nb."
In [173]: pickle.loads(s)
Out[173]:
array([[ 0, 2, 4, 6],
[ 8, 10, 12, 14],
[16, 18, 20, 22]])
There's a numpy function that can read the pickle string:
In [181]: np.loads(s)
Out[181]:
array([[ 0, 2, 4, 6],
[ 8, 10, 12, 14],
[16, 18, 20, 22]])
You mentioned np.save to a string, but that you can't use np.load. A way around that is to step further into the code, and use np.lib.npyio.format.
In [174]: import StringIO
In [175]: S=StringIO.StringIO() # a file like string buffer
In [176]: np.lib.npyio.format.write_array(S,a*3.3)
In [177]: S.seek(0) # rewind the string
In [178]: np.lib.npyio.format.read_array(S)
Out[178]:
array([[ 0. , 3.3, 6.6, 9.9],
[ 13.2, 16.5, 19.8, 23.1],
[ 26.4, 29.7, 33. , 36.3]])
The save string has a header with dtype and shape info:
In [179]: S.seek(0)
In [180]: S.readlines()
Out[180]:
["\x93NUMPY\x01\x00F\x00{'descr': '<f8', 'fortran_order': False, 'shape': (3, 4), } \n",
'\x00\x00\x00\x00\x00\x00\x00\x00ffffff\n',
'#ffffff\x1a#\xcc\xcc\xcc\xcc\xcc\xcc##ffffff*#\x00\x00\x00\x00\x00\x800#\xcc\xcc\xcc\xcc\xcc\xcc3#\x99\x99\x99\x99\x99\x197#ffffff:#33333\xb3=#\x00\x00\x00\x00\x00\x80##fffff&B#']
If you want a human readable string, you might try json.
In [196]: import json
In [197]: js=json.dumps(a.tolist())
In [198]: js
Out[198]: '[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]'
In [199]: np.array(json.loads(js))
Out[199]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Going to/from the list representation of the array is the most obvious use of json. Someone may have written a more elaborate json representation of arrays.
You could also go the csv format route - there have been lots of questions about reading/writing csv arrays.
'[[ 0.5544 0.4456], [ 0.8811 0.1189]]'
is a poor string representation for this purpose. It does look a lot like the str() of an array, but with , instead of \n. But there isn't a clean way of parsing the nested [], and the missing delimiter is a pain. If it consistently uses , then json can convert it to list.
np.matrix accepts a MATLAB like string:
In [207]: np.matrix(' 0.5544, 0.4456;0.8811, 0.1189')
Out[207]:
matrix([[ 0.5544, 0.4456],
[ 0.8811, 0.1189]])
In [208]: str(np.matrix(' 0.5544, 0.4456;0.8811, 0.1189'))
Out[208]: '[[ 0.5544 0.4456]\n [ 0.8811 0.1189]]'
Forward to string:
import numpy as np
def array2str(arr, precision=None):
s=np.array_str(arr, precision=precision)
return s.replace('\n', ',')
Backward to array:
import re
import ast
import numpy as np
def str2array(s):
# Remove space after [
s=re.sub('\[ +', '[', s.strip())
# Replace commas and spaces
s=re.sub('[,\s]+', ', ', s)
return np.array(ast.literal_eval(s))
If you use repr() to convert array to string, the conversion will be trivial.
I'm not sure there's an easy way to do this if you don't have commas between the numbers in your inner lists, but if you do, then you can use ast.literal_eval:
import ast
import numpy as np
s = '[[ 0.5544, 0.4456], [ 0.8811, 0.1189]]'
np.array(ast.literal_eval(s))
array([[ 0.5544, 0.4456],
[ 0.8811, 0.1189]])
EDIT: I haven't tested it very much, but you could use re to insert commas where you need them:
import re
s1 = '[[ 0.5544 0.4456], [ 0.8811 -0.1189]]'
# Replace spaces between numbers with commas:
s2 = re.sub('(\d) +(-|\d)', r'\1,\2', s1)
s2
'[[ 0.5544,0.4456], [ 0.8811,-0.1189]]'
and then hand on to ast.literal_eval:
np.array(ast.literal_eval(s2))
array([[ 0.5544, 0.4456],
[ 0.8811, -0.1189]])
(you need to be careful to match spaces between digits but also spaces between a digit an a minus sign).
In my case I found following command helpful for dumping:
string = str(array.tolist())
And for reloading:
array = np.array( eval(string) )
This should work for any dimensionality of numpy array.
numpy.fromstring() allows you to easily create 1D arrays from a string. Here's a simple function to create a 2D numpy array from a string:
import numpy as np
def str2np(strArray):
lItems = []
width = None
for line in strArray.split("\n"):
lParts = line.split()
n = len(lParts)
if n==0:
continue
if width is None:
width = n
else:
assert n == width, "invalid array spec"
lItems.append([float(str) for str in lParts])
return np.array(lItems)
Usage:
X = str2np("""
-2 2
-1 3
0 1
1 1
2 -1
""")
print(f"X = {X}")
Output:
X = [[-2. 2.]
[-1. 3.]
[ 0. 1.]
[ 1. 1.]
[ 2. -1.]]
Related
I need the values from a CSV to have a comma after each individual value as well at the end of each row/array.
I have used tolist() before having these changes. The conversion of numerical values to strings is not wanted.
The code below is what I currently have.
import numpy as np
dataset = open("Dataset.csv")
next(dataset) # Skips first line of dataset
games = np.loadtxt(dataset, delimiter=",")
dataset.close()
print(games)
This is what the code outputs:
[[ 0.228 0.5 0.685 0.378 0.439 0.183 0.387 0.25 0.169]
[ 0.206 0.125 0.686 0.069 0.131 0.778 2.71 0.75 -0.092]]
I am looking for the code to output this:
[[0.228,0.5 ,0.685,0.378,0.439,0.183,0.387,0.25 ,0.169],
[0.206,0.125 ,0.686 ,0.069 ,0.131,0.778 ,2.71 ,0.75 ,-0.092]
You can basically set any formatter you desire to print your output with via np.set_print_optiones (this does not change your original array type and only change the printing format, which I think is what you are looking for). I think this is what you are looking for, but if it is not, you can define your desirable format through this:
#be mindful this creates comma after each float number including the last number in sub-arrays
float_formatter = "{:},".format
np.set_printoptions(formatter={'float_kind':float_formatter})
print(games)
output:
[[0.228, 0.5, 0.685, 0.378, 0.439, 0.183, 0.387, 0.25, 0.169,]
[0.206, 0.125, 0.686, 0.069, 0.131, 0.778, 2.71, 0.75, -0.092,]]
and your datatype is float:
print(games.dtype)
float64
A better option mentioned by #David Buck in comments is to use repr
print(repr(games))
output:
array([[ 0.228, 0.5 , 0.685, 0.378, 0.439, 0.183, 0.387, 0.25 ,
0.169],
[ 0.206, 0.125, 0.686, 0.069, 0.131, 0.778, 2.71 , 0.75 ,
-0.092 ]])
Make sure you understand what python object you have, and what the commas, or lack, means.
With loadtxt you created a numpy array. A simpler way of doing the same:
In [212]: arr = np.arange(12).reshape(2,6)
The repr display for an array is:
In [213]: arr
Out[213]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
The str dislay omits the commas. That's intentional, helping to distinguish an array from a list:
In [214]: print(arr)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]]
In [215]: type(arr)
Out[215]: numpy.ndarray
The print display of a list has commas:
In [216]: print(arr.tolist())
[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11]]
The distinction between a list (or list of lists) and an array is important. Whether the display uses commas or not is superficial.
In python when I say np.array([1,2,3]), the result is
array([1, 2, 3])
but when I say np.array([11,22,3]) the result is
array([11, 22, 3])
which has two spaces before '3' unlike '22' which has one space before. Later i am using map function to read this result from a CSV file with Pandas
appended_data.append({'array': numpyarray})
OutputDataFrame = pd.DataFrame(appended_data).ix[:, columns]
OutputDataFrame.to_csv('name.csv', index=False)
and i need the spacing to be consistent. is there any way to do so ?
The default display for arrays is a uniform field width per element, not a uniform spacing:
In [30]: x=np.array([11,223,3])
In [31]: x
Out[31]: array([ 11, 223, 3])
In [32]: x.tolist() # list display with uniform spacing
Out[32]: [11, 223, 3]
In effect numpy uses a format like:
In [35]: fmt = ' '.join(['%3d','%3d','%3d'])
In [36]: fmt
Out[36]: '%3d %3d %3d'
In [37]: fmt%tuple(x)
Out[37]: ' 11 223 3'
np.savetxt does just that, using the fmt and delimiter that you provide.
csv stands for 'comma separated'. Tabs are also used. If 'white space' is used, good readers are just as happy with one, two or more 'blanks'. Such tables are usually formatted to keep the columns aligned, not to keep the space between numbers constant.
A 3 row array with mixed number sizes:
In [39]: x=np.array([[1,123,32],[34,1,2],[0,23,1000]])
In [40]: x
Out[40]:
array([[ 1, 123, 32],
[ 34, 1, 2],
[ 0, 23, 1000]])
Fixed width csv formatting:
In [41]: np.savetxt('test.csv',x,fmt='%5d', delimiter=',')
In [42]: cat test.csv
1, 123, 32
34, 1, 2
0, 23, 1000
delimited reading:
In [43]: np.genfromtxt('test.csv',delimiter=',',dtype=None)
Out[43]:
array([[ 1, 123, 32],
[ 34, 1, 2],
[ 0, 23, 1000]])
The default mode for Python string split uses generalized white space:
In [44]: ' 11 223 3'.split()
Out[44]: ['11', '223', '3']
Here's an example of a csv with constant spacing (and variable width)
In [45]: np.savetxt('test.csv',x,fmt='%d', delimiter=' ')
In [46]: cat test.csv
1 123 32
34 1 2
0 23 1000
np.genfromtxt('test.csv',dtype=None) reads it just fine.
You could convert the evenly spaced numpy array to a list first:
np.array([11, 22, 3]).tolist()
will give you
[11, 22, 3]
Also, when you map the numpy array, each individual value passed to the function will not have spacing so you don't have to worry about it.
See hpaulj's answer below as it's much more comprehensive than mine.
For a numpy array X, the location of its element X[k[0], ..., k[d-1]] is offset from the location of X[0,..., 0] by k[0]*s[0] + ... + k[d-1]*s[d-1], where (s[0],...,s[d-1]) is the tuple representing X.strides.
As far as I understand nothing in numpy array specs requires that distinct indexes of array X correspond to distinct addresses in memory, the simplest instance of this being a zero value of the stride, e.g. see advanced NumPy section of scipy lectures.
Does the numpy have a built-in predicate to test if the strides and the shape are such that distinct indexes map to distinct memory addresses?
If not, how does one write one, preferably so as to avoid sorting of the strides?
edit: It took me a bit to figure what you are asking about. With striding tricks it's possible to index the same element in a databuffer in different ways, and broadcasting actually does this under the covers. Normally we don't worry about it because it is either hidden or intentional.
Recreating in the strided mapping and looking for duplicates may be the only way to test this. I'm not aware of any existing function that checks it.
==================
I'm not quite sure what you concerned with. But let me illustrate how shape and strides work
Define a 3x4 array:
In [453]: X=np.arange(12).reshape(3,4)
In [454]: X.shape
Out[454]: (3, 4)
In [455]: X.strides
Out[455]: (16, 4)
Index an item
In [456]: X[1,2]
Out[456]: 6
I can get it's index in a flattened version of the array (e.g. the original arange) with ravel_multi_index:
In [457]: np.ravel_multi_index((1,2),X.shape)
Out[457]: 6
I can also get this location using strides - keeping mind that strides are in bytes (here 4 bytes per item)
In [458]: 1*16+2*4
Out[458]: 24
In [459]: (1*16+2*4)/4
Out[459]: 6.0
All these numbers are relative to the start of the data buffer. We can get the data buffer address from X.data or X.__array_interface__['data'], but usually don't need to.
So this strides tells us that to go from entry to the next, step 4 bytes, and to go from one row to the next step 16. 6 is located at one row down, 2 over, or 24 bytes into the buffer.
In the as_strided example of your link, strides=(1*2, 0) produces repeated indexing of specific values.
With my X:
In [460]: y=np.lib.stride_tricks.as_strided(X,strides=(16,0), shape=(3,4))
In [461]: y
Out[461]:
array([[0, 0, 0, 0],
[4, 4, 4, 4],
[8, 8, 8, 8]])
y is a 3x4 that repeatedly indexes the 1st column of X.
Changing one item in y ends up changing one value in X but a whole row in y:
In [462]: y[1,2]=10
In [463]: y
Out[463]:
array([[ 0, 0, 0, 0],
[10, 10, 10, 10],
[ 8, 8, 8, 8]])
In [464]: X
Out[464]:
array([[ 0, 1, 2, 3],
[10, 5, 6, 7],
[ 8, 9, 10, 11]])
as_strided can produce some weird effects if you aren't careful.
OK, maybe I've figured out what's bothering you - can I identify a situation like this where two different indexing tuples end up pointing to the same location in the data buffer? Not that I'm aware of. That y strides contains a 0 is a pretty good indicator.
as_stridedis often used to create overlapping windows:
In [465]: y=np.lib.stride_tricks.as_strided(X,strides=(8,4), shape=(3,4))
In [466]: y
Out[466]:
array([[ 0, 1, 2, 3],
[ 2, 3, 10, 5],
[10, 5, 6, 7]])
In [467]: y[1,2]=20
In [469]: y
Out[469]:
array([[ 0, 1, 2, 3],
[ 2, 3, 20, 5],
[20, 5, 6, 7]])
Again changing 1 item in y ends up changing 2 values in y, but only 1 in X.
Ordinary array creation and indexing does not have this duplicate indexing issue. Broadcasting may do something like, under the cover, where a (4,) array is changed to (1,4) and then to (3,4), effectively replicating rows. I think there's another stride_tricks function that does this explicitly.
In [475]: x,y=np.lib.stride_tricks.broadcast_arrays(X,np.array([.1,.2,.3,.4]))
In [476]: x
Out[476]:
array([[ 0, 1, 2, 3],
[20, 5, 6, 7],
[ 8, 9, 10, 11]])
In [477]: y
Out[477]:
array([[ 0.1, 0.2, 0.3, 0.4],
[ 0.1, 0.2, 0.3, 0.4],
[ 0.1, 0.2, 0.3, 0.4]])
In [478]: y.strides
Out[478]: (0, 8)
In any case, in normal array use we don't have to worry about this ambiguity. We get it only with intentional actions, not accidental ones.
==============
How about this for a test:
def dupstrides(x):
uniq={sum(s*j for s,j in zip(x.strides,i)) for i in np.ndindex(x.shape)}
print(uniq)
print(len(uniq))
print(x.size)
return len(uniq)<x.size
In [508]: dupstrides(X)
{0, 32, 4, 36, 8, 40, 12, 44, 16, 20, 24, 28}
12
12
Out[508]: False
In [509]: dupstrides(y)
{0, 4, 8, 12, 16, 20, 24, 28}
8
12
Out[509]: True
It turns out this test is already implemented in numpy, see mem_overlap.c:842.
The test is exposed as numpy.core.multiarray_tests.internal_overlap(x).
Example:
>>> import numpy as np
>>> from numpy.core.multiarray_tests import internal_overlap
>>> from numpy.lib.stride_tricks import as_strided
Now, create a contiguous array, and use as_strided to create an array with internal overlapping, and confirm this with the testing:
>>> x = np.arange(3*4, dtype=np.float64).reshape((3,4))
>>> y = as_strided(x, shape=(5,4), strides=(16, 8))
>>> y
array([[ 0., 1., 2., 3.],
[ 2., 3., 4., 5.],
[ 4., 5., 6., 7.],
[ 6., 7., 8., 9.],
[ 8., 9., 10., 11.]])
>>> internal_overlap(x)
False
>>> internal_overlap(y)
True
The function is optimized to quickly returns False for Fortran- or C- contiguous arrays.
How can I join two numpy ndarrays to accomplish the following in a fast way, using optimized numpy, without any looping?
>>> a = np.random.rand(2,2)
>>> a
array([[ 0.09028802, 0.2274419 ],
[ 0.35402772, 0.87834376]])
>>> b = np.random.rand(2,2)
>>> b
array([[ 0.4776325 , 0.73690098],
[ 0.69181444, 0.672248 ]])
>>> c = ???
>>> c
array([[ 0.09028802, 0.2274419, 0.4776325 , 0.73690098],
[ 0.09028802, 0.2274419, 0.69181444, 0.672248 ],
[ 0.35402772, 0.87834376, 0.4776325 , 0.73690098],
[ 0.35402772, 0.87834376, 0.69181444, 0.672248 ]])
Not the prettiest, but you could combine hstack, repeat, and tile:
>>> a = np.arange(4).reshape(2,2)
>>> b = a+10
>>> a
array([[0, 1],
[2, 3]])
>>> b
array([[10, 11],
[12, 13]])
>>> np.hstack([np.repeat(a,len(a),0),np.tile(b,(len(b),1))])
array([[ 0, 1, 10, 11],
[ 0, 1, 12, 13],
[ 2, 3, 10, 11],
[ 2, 3, 12, 13]])
Or for a 3x3 case:
>>> a = np.arange(9).reshape(3,3)
>>> b = a+10
>>> np.hstack([np.repeat(a,len(a),0),np.tile(b,(len(b),1))])
array([[ 0, 1, 2, 10, 11, 12],
[ 0, 1, 2, 13, 14, 15],
[ 0, 1, 2, 16, 17, 18],
[ 3, 4, 5, 10, 11, 12],
[ 3, 4, 5, 13, 14, 15],
[ 3, 4, 5, 16, 17, 18],
[ 6, 7, 8, 10, 11, 12],
[ 6, 7, 8, 13, 14, 15],
[ 6, 7, 8, 16, 17, 18]])
What you want is, apparently, the cartesian product of a and b, stacked horizontally. You can use the itertools module to generate the indices for the numpy arrays, then numpy.hstack to stack them:
import numpy as np
from itertools import product
a = np.array([[ 0.09028802, 0.2274419 ],
[ 0.35402772, 0.87834376]])
b = np.array([[ 0.4776325 , 0.73690098],
[ 0.69181444, 0.672248 ],
[ 0.79941110, 0.52273 ]])
a_inds, b_inds = map(list, zip(*product(range(len(a)), range(len(b)))))
c = np.hstack((a[a_inds], b[b_inds]))
This results in a c of:
array([[ 0.09028802, 0.2274419 , 0.4776325 , 0.73690098],
[ 0.09028802, 0.2274419 , 0.69181444, 0.672248 ],
[ 0.09028802, 0.2274419 , 0.7994111 , 0.52273 ],
[ 0.35402772, 0.87834376, 0.4776325 , 0.73690098],
[ 0.35402772, 0.87834376, 0.69181444, 0.672248 ],
[ 0.35402772, 0.87834376, 0.7994111 , 0.52273 ]])
Breaking down the indices thing:
product(range(len(a)), range(len(b)) will generate something that looks like this if you convert it to a list:
[(0, 0), (0, 1), (1, 0), (1, 1)]
You want something like this: [0, 0, 1, 1], [0, 1, 0, 1], so you need to transpose the generator. The idiomatic way to do this is with zip(*zipped_thing). However, if you just directly assign these, you'll get tuples, like this:
[(0, 0, 1, 1), (0, 1, 0, 1)]
But numpy arrays interpret tuples as multi-dimensional indexes, so you want to turn them to lists, which is why I mapped the list constructor onto the result of the product function.
Let's walk through a prospective solution to handle generic cases involving different shaped arrays with some inlined comments to explain the method involved.
(1) First off, we store shapes of input arrays.
ma,na = a.shape
mb,nb = b.shape
(2) Next up, initialize a 3D array with number of columns being the sum of number of columns in input arraysa and b. Use np.empty for this task.
out = np.empty((ma,mb,na+nb),dtype=a.dtype)
(3) Then, set the first axis of the 3D array for the first "na" columns with the rows from a with a[:,None,:]. So, if we assign it to out[:,:,:na], that second colon would indicate to NumPy that we need a broadcasted setting, if possible as always happens with singleton dims in NumPy arrays. In effect, this would be same as tiling/repeating, but possibly in an efficient way.
out[:,:,:na] = a[:,None,:]
(4) Repeat for setting elements from b into output array. This time we would broadcast along the first axis of out with out[:,:,na:], with that first colon helping us do that broadcasting.
out[:,:,na:] = b
(5) Final step is to reshape the output to a 2D shape. This could be done with simply changing the shape with the required 2D shape tuple. Reshaping just changes view and is effectively zero cost.
out.shape = (ma*mb,na+nb)
Condensing everything, the full implementation would look like this -
ma,na = a.shape
mb,nb = b.shape
out = np.empty((ma,mb,na+nb),dtype=a.dtype)
out[:,:,:na] = a[:,None,:]
out[:,:,na:] = b
out.shape = (ma*mb,na+nb)
You can use dstack() and broadcast_arrays():
import numpy as np
a = np.random.randint(0, 10, (3, 2))
b = np.random.randint(10, 20, (4, 2))
np.dstack(np.broadcast_arrays(a[:, None], b)).reshape(-1, a.shape[-1] + b.shape[-1])
Try either np.hstack or np.vstack. This would work even for arrays that are not the same length. All you would need to do is this:
np.hstack(appendedarray[:]) or np.vstack(appendedarray[:])
All arrays are indexable, so you can merge the by just calling:
a[:2],b[:2]
or you can use core numpy stacking functions, should look something like this:
c = np.vstack(a,b)
If I have an array like this:
a = np.array([[ 1, 2, 3, 4],
[ 5 ,6, 7, 8],
[ 9,10,11,12],
[13,14,15,16]])
I want to 'change the resolution', and end up with a smaller array, (say 2 rows by 2 cols, or 2 rows by 4 cols, etc.). I want this resolution change to happen through summation. I need this to work with large arrays, the number of rows, cols of the smaller array will always be a factor of the larger array.
Reducing the above array to a 2 by 2 array would result in (which is what I want):
[[ 14. 22.]
[ 46. 54.]]
I have this function that does it fine:
import numpy as np
def shrink(data, rows, cols):
shrunk = np.zeros((rows,cols))
for i in xrange(0,rows):
for j in xrange(0,cols):
row_sp = data.shape[0]/rows
col_sp = data.shape[1]/cols
zz = data[i*row_sp : i*row_sp + row_sp, j*col_sp : j*col_sp + col_sp]
shrunk[i,j] = np.sum(zz)
return shrunk
print shrink(a,2,2)
print shrink(a,2,1)
#correct output:
[[ 14. 22.]
[ 46. 54.]]
[[ 36.]
[ 100.]]
I've had a long look through the examples, but can't seem to find anything that helps.
Is there a faster way to do this, without needing the loops?
With your example:
a.reshape(2,2,2,2).sum(axis=1).sum(axis=2)
returns:
array([[14, 22],
[46, 54]])
Now let's create a general function…
def shrink(data, rows, cols):
return data.reshape(rows, data.shape[0]/rows, cols, data.shape[1]/cols).sum(axis=1).sum(axis=2)
works for your examples:
In [19]: shrink(a, 2,2)
Out[19]:
array([[14, 22],
[46, 54]])
In [20]: shrink(a, 2,1)
Out[20]:
array([[ 36],
[100]])
I'm sure there is a better/smarter approach without all these horrendous loops...
Here is one way to avoid explicitly looping over every element of data:
def shrink(data, rows, cols):
row_sp = a.shape[0] / rows
col_sp = a.shape[1] / cols
tmp = np.sum(data[i::row_sp] for i in xrange(row_sp))
return np.sum(tmp[:,i::col_sp] for i in xrange(col_sp))
On my machine, this is about 30% faster than your version (for shrink(a, 2, 2)).