I'm currently taking some classes on algorithms and data structures and using Python to implement some of the stuff I've been studying.
At the moment I'm implementing a Stack based on a fixed-size array. Given the particularities of python I opted to use numpy.empty().
For a test I've written I'm basically pushing 9 elements into the stack. Up to that point everything is ok because the resulting array has the 9 elements plus space for another 7.
I started popping elements out and when I reach the critical point of just having 4 elements in an array, I expect the array to copy the elements into a new array of size 8.
The thing is that when I create this new array, instead of being created with empty values is already populated.
Here an image of my terminal at that specific step when debugging with PDB
Is there anything I'm missing out?
EDIT: Seems like if I use Python 3 everything works as expected, this is just the case for Python 2
class StackV2(object):
"""
This is the Stack version based on fixed size arrays
"""
def __init__(self):
self.array = numpy.empty(1, dtype=str)
self.size = 0
def push(self, value):
self.array[self.size] = value
self.size += 1
if len(self.array) == self.size:
self._resize_array(len(self.array) * 2)
def pop(self):
self.array[self.size - 1] = ""
self.size -= 1
if len(self.array) == (4 * self.size):
self._resize_array(len(self.array) / 2)
def _resize_array(self, factor):
new_array = numpy.empty(factor, dtype=str)
print(new_array)
index = 0
for i in range(0, self.size):
new_array[index] = self.array[i]
index += 1
self.array = new_array
Short answer
Use numpy.zeros instead of numpy.empty to get rid of the surprise garbage values in your new arrays.
details
The arrays created by numpy.zeros have all of their elements initialized to a "zero value". For arrays with dtype=str, this will be the empty string ''.
From the Numpy docs:
Notes
empty, unlike zeros, does not set the array values to zero, and may therefore be marginally faster. On the other hand, it requires the user to manually set all the values in the array, and should be used with caution.
The fact that it works in Python 3 (but not Python 2) is undefined behavior. Basically, it's a quirk of the implementation which the Numpy developers didn't plan. The best practice is to not rely on such things in your code. As you've seen, the outcome of an undefined behavior is not guaranteed to be consistent across versions, implementations, different computers that you run your code on, etc.
Also, it sounds like you might be a little bit confused about how Numpy arrays work. A numpy array starts off at a fixed size when you create it. This is unlike a normal Python list [], which grows dynamically as you add values to it.
Also, you don't need both index and i in _resize_array. Just use one or the other, like this:
for i in range(self.size):
new_array[i] = self.array[i]
Aside from that your code is fine.
Related
I wanted to know if there is a way to somehow shift an array by a non integer value in python. Let's say I have a matrix T[i,j] and I want to interpolate the value of T[1.3,4.5] by using T[1,4], T[1,5], T[2,4] and T[2,5]. Is there a simple and fast way to do that?
I have been stuck trying to use scipy.ndimage.shift() for the past few hours but I couldn't understand how to make it work.
EDIT: It seems this can be accomplished with scipy.interpolation.interp2d, though interp2d serves a more general purpose and thus may be slower in this case.
You'd want to do something like iT = interp2d(range(T.shape[0]),range(T.shape[1]),T.T). Please note that, in the T.T at the end, the .T stands for transpose, and has nothing to do with the fact that the array is called T. iT can be accessed as a function, for instance, print(iT(1.1,2.3)).
It will return arrays, as opposed to single values, which indicates that one can pass arrays as arguments too (i.e. compute the value of the interpolation at several points "at once".
I am not aware of any standard way to do this. I'd accomplish it by simply wrapping the array in an instance of some kind of "interpolated array" class. For simplicity, let's start by assuming T is a 1D array:
class InterpolatedArray:
def __init__(self, array):
self.array = array
def __getitem__(self, i):
i0 = int(i)
i1 = i0+1
f = i-i0
return self.array[i0]*(1-f)+self.array[i1]*f
As you can see, it overloads the subscripting routine, so that when you attempt to access (for instance) array[1.1], this returns 0.9*array[1]+0.1*array[2]. You'd have to explicitly build this object from your previous array:
iT = InterpolatedArray(T)
print(iT[1.1])
As for the two-dimensional case, it works the same, just with a little bit more work on the indexing:
class InterpolatedMatrix:
def __init__(self, array):
self.array = array
def __getitem__(self, i, j):
i0 = int(i)
i1 = i0+1
fi = i-i0
j0 = int(j)
j1 = j0+1
fj = j-j0
return self.array[i0,j0]*(1-fi)*(1-fj)\
+self.array[i1,j0]*fi*(1-fj)\
+self.array[i0,j1]*(1-fi)*fj\
+self.array[i1,j1]*fi*fj\
You can probably rewrite some of that code to optimize the amount of operations that are performed.
Note, also, that if for some reason you want to access every index in the image with some small fixed offset (i.e. T[i+0.3,j+0.5] for i,j in T.shape), then it would be better to do this with vectorization, using numpy (something similar may also be possible with scipy).
I am trying to break a long sequence into sub-sequence with a smaller window size by using the get_slice function defined by me.
Then I suddenly realized that my code is too clumsy, since my raw data is already a numpy array, then I need to store it into a list in my get_slice function. After that, when I read each row in the data_matrix, I need another list to stored the information again.
The code works fine, yet the conversion between numpy array and list back and forth seems non-pythonic to me. I wonder if I am doing it right. If not, how to do it more efficiently and more pythonic?
Here's my code:
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
x_slice = []
for i in range(int(len(X)/stride)):
if i*stride < len(X)-windows+1:
x_slice.append(X[i*stride:i*stride+windows])
return np.array(x_slice)
##---------End--------------##
x_list = []
for row in data_matrix:
temp_data = get_slice(row) #getting time slice as numpy array
x_list.append(temp_data) #appending the time slice into a list
X = np.array(x_list) #Converting the list back to numpy array
Putting this here as a semi-complete answer to address your two points - making the code more "pythonic" and more "efficient."
There are many ways to write code and there's always a balance to be found between the amount of numpy code and pure python code used.
Most of that comes down to experience with numpy and knowing some of the more advanced features, how fast the code needs to run, and personal preference.
Personal preference is the most important - you need to be able to understand what your code does and modify it.
Don't worry about what is pythonic, or even worse - numpythonic.
Find a coding style that works for you (as you seem to have done), and don't stop learning.
You'll pick up some tricks (like #B.M.'s answer uses), but for the most part these should be saved for rare instances.
Most tricks tend to require extra work, or only apply in some circumstances.
That brings up the second part of your question.
How to make code more efficient.
The first step is to benchmark it.
Really.
I've been surprised at the number of things I thought would speed up code that barely changed it, or even made it run slower.
Python's lists are highly optimized and give good performance for many things (Although many users here on stackoverflow remain convinced that using numpy can magically make any code faster).
To address your specific point, mixing lists and arrays is fine in most cases. Particularly if
You don't know the size of your data beforehand (lists expand much more efficiently)
You are creating a large number of views into an array (a list of arrays is often cheaper than one large array in this case)
You have irregularly shaped data (arrays must be square)
In your code, case 2 applies. The trick with as_strided would also work, and probably be faster in some cases, but until you've profiled and know what those cases are I would say your code is good enough.
There is very fews case where mixing list and array is necessary. You can efficiently have the same data with only array primitives:
data_matrix=np.add.outer([0,100,1000],np.linspace(1,60,60,dtype=int))
X=np.lib.stride_tricks.as_strided(data_matrix2,shape=(3, 56, 5),strides=(4*60,4,4))
It's just a view. A fresh array can be obtained by X=X.copy().
Appending to the list will be slow. Try a list comprehension to make the numpy array.
something like below
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
return np.array([X[i*stride:i*stride+windows]
for i in range(int(len(X)/stride))
if i*stride < len(X)-windows+1])
##---------End--------------##
X = np.array([get_slice(row) for row in data_matrix])
print(X)
This may be odd, because you have a numpy array of numpy arrays. If you want a 3 dimensional array this is perfectly fine. If you don't want a 3 dimensional array then you may want to vstack or append the arrays.
# X = np.array([get_slice(row) for row in data_matrix])
X = np.vstack((get_slice(row) for row in data_matrix))
List Comprehension speed
I am running Python 3.4.4 on Windows 10.
import timeit
TEST_RUNS = 1000
LIST_SIZE = 2000000
def make_list():
li = []
for i in range(LIST_SIZE):
li.append(i)
return li
def make_list_microopt():
li = []
append = li.append
for i in range(LIST_SIZE):
append(i)
return li
def make_list_comp():
li = [i for i in range(LIST_SIZE)]
return li
print("List Append:", timeit.timeit(make_list, number=TEST_RUNS))
print("List Comprehension:", timeit.timeit(make_list_comp, number=TEST_RUNS))
print("List Append Micro-optimization:", timeit.timeit(make_list_microopt, number=TEST_RUNS))
Output
List Append: 222.00971377954895
List Comprehension: 125.9705268094408
List Append Micro-optimization: 157.25782340883387
I am very surprised with how much the micro-optimization helps. Still, List Comprehensions are a lot faster for large lists on my system.
Suppose I have a very large numpy array a, and I want to add the numerical value 1 to each element of the array. From what I have read so far:
a += 1
is a good way of doing it rather than:
a = a + 1
since in the second case a new array a is created in a different memory slot, while in the first case the old array is effectively replaced in the same memory slot.
Suppose I want to do the following instead:
a = 1-a
What would be the memory efficient way of doing the above?
numpy.subtract(1, a, out=a)
Using the subtract ufunc directly gives you more control than the - operator. Here, we use the out parameter to place the results of the subtraction back into a.
You could do it in place like so:
a *= -1
a += 1
Python has a built in functionality for checking the validity of entire slices: slice.indices. Is there something similar that is built-in for individual indices?
Specifically, I have an index, say a = -2 that I wish to normalize with respect to a 4-element list. Is there a method that is equivalent to the following already built in?
def check_index(index, length):
if index < 0:
index += length
if index < 0 or index >= length:
raise IndexError(...)
My end result is to be able to construct a tuple with a single non-None element. I am currently using list.__getitem__ to do the check for me, but it seems a little awkward/overkill:
items = [None] * 4
items[a] = 'item'
items = tuple(items)
I would like to be able to do
a = check_index(a, 4)
items = tuple('item' if i == a else None for i in range(4))
Everything in this example is pretty negotiable. The only things that are fixed is that I am getting a in a way that can have all of the problems that an arbitrary index can have and that the final result has to be a tuple.
I would be more than happy if the solution used numpy and only really applied to numpy arrays instead of Python sequences. Either one would be perfect for the application I have in mind.
If I understand correctly, you can use range(length)[index], in your example range(4)[-2]. This properly handles negative and out-of-bounds indices. At least in recent versions of Python, range() doesn't literally create a full list so this will have decent performance even for large arguments.
If you have a large number of indices to do this with in parallel, you might get better performance doing the calculation with Numpy vectorized arithmetic, but I don't think the technique with range will work in that case. You'd have to manually do the calculation using the implementation in your question.
There is a function called numpy.core.multiarray.normalize_axis_index which does exactly what I need. It is particularly useful to be because the implementation I had in mind was for numpy array indexing:
from numpy.core.multiarray import normalize_axis_index
>>> normalize_axis_index(3, 4)
3
>>> normalize_axis_index(-3, 4)
1
>>> normalize_axis_index(-5, 4)
...
numpy.core._internal.AxisError: axis -5 is out of bounds for array of dimension 4
The function was added in version 1.13.0. The source for this function is available here, and the documentation source is here.
This is a pretty simple question, I've written a lot in case people find this that were a few hours behind me on the 'WHY WON'T THIS WORK?' train, to help them out
In Matlab, the following code would create a dynamically increasing-in-size array:
for i = 1:5
array(i) = i*2;
end
but I am having some problems with the complexities in Python. The following does work but isn't exactly what I want:
i = []
array = []
for i in range(1, 5):
array.append(i*2)
This works however you can only append to the end. You can also allocate values to a range of cells that already exist - beyond simply 'stick them on the end' (i.e. the below code which could replace cells 14-36 in a 100 cell long list)
i = []
array = list(xrange(1,100)) #creates list from 1 to 99
for i in range(14, 36):
array[i] = i*2 #assign and overwrite previous cell values
Is there some catch-all coding method here that combines the two? A solution to the following code:
i = []
array = list(xrange(1,50)) #creates list from 1 to 49
for i in range(34, 66):
array[i] = i*2
Error message:
IndexError: list assignment index out of range
General differences I've seen so far:
Python starts at cell number [0] not [1]
You can't dynamically update list sizes and so you need to use the append function
(Possibly?) need to preallocate the lists before using them
**
Note for others struggling:
One error that consistently came up was:
TypeError: 'builtin_function_or_method' object does not support item assignment
This was due to trying to assign a value like this:
array.append[i-1] = i*2
See higher up for the correct method.
Misc.
Thanks for any help! I'm sure this is really simple but I have run out of ideas and can't find a solution in previous questions!
Other similar questions that either didn't solve it or I didn't understand:
"TypeError: 'function' object does not support item assignment"
Python Array is read-only, can't append values
http://www.pythonbackend.com/topic/1329787069?reply=3
You're going to need to familiarize yourself with numpy as a minimum if you're going to get close to Matlab-like functionality in python. The following is a useful reference:
https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html
I suggest you try to avoid dynamically increasing the size of your list altogether (in MATLAB as well as python) unless you have a good reason to do so. Maybe you could check the desired (final) size of your list elsewhere in the code, before allocation.
So I would go along the same lines as one of your suggestions, with the difference that you can allocate an empty list like so:
arr = [None] * 100
for i in range(100):
arr[i] = i*2
Or, as John Greenall suggested, use numpy:
import numpy as np
arr = np.empty(100)
for i in range(100):
arr[i] = i*2