Is putting a numpy array inside a list pythonic? - python

I am trying to break a long sequence into sub-sequence with a smaller window size by using the get_slice function defined by me.
Then I suddenly realized that my code is too clumsy, since my raw data is already a numpy array, then I need to store it into a list in my get_slice function. After that, when I read each row in the data_matrix, I need another list to stored the information again.
The code works fine, yet the conversion between numpy array and list back and forth seems non-pythonic to me. I wonder if I am doing it right. If not, how to do it more efficiently and more pythonic?
Here's my code:
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
x_slice = []
for i in range(int(len(X)/stride)):
if i*stride < len(X)-windows+1:
x_slice.append(X[i*stride:i*stride+windows])
return np.array(x_slice)
##---------End--------------##
x_list = []
for row in data_matrix:
temp_data = get_slice(row) #getting time slice as numpy array
x_list.append(temp_data) #appending the time slice into a list
X = np.array(x_list) #Converting the list back to numpy array

Putting this here as a semi-complete answer to address your two points - making the code more "pythonic" and more "efficient."
There are many ways to write code and there's always a balance to be found between the amount of numpy code and pure python code used.
Most of that comes down to experience with numpy and knowing some of the more advanced features, how fast the code needs to run, and personal preference.
Personal preference is the most important - you need to be able to understand what your code does and modify it.
Don't worry about what is pythonic, or even worse - numpythonic.
Find a coding style that works for you (as you seem to have done), and don't stop learning.
You'll pick up some tricks (like #B.M.'s answer uses), but for the most part these should be saved for rare instances.
Most tricks tend to require extra work, or only apply in some circumstances.
That brings up the second part of your question.
How to make code more efficient.
The first step is to benchmark it.
Really.
I've been surprised at the number of things I thought would speed up code that barely changed it, or even made it run slower.
Python's lists are highly optimized and give good performance for many things (Although many users here on stackoverflow remain convinced that using numpy can magically make any code faster).
To address your specific point, mixing lists and arrays is fine in most cases. Particularly if
You don't know the size of your data beforehand (lists expand much more efficiently)
You are creating a large number of views into an array (a list of arrays is often cheaper than one large array in this case)
You have irregularly shaped data (arrays must be square)
In your code, case 2 applies. The trick with as_strided would also work, and probably be faster in some cases, but until you've profiled and know what those cases are I would say your code is good enough.

There is very fews case where mixing list and array is necessary. You can efficiently have the same data with only array primitives:
data_matrix=np.add.outer([0,100,1000],np.linspace(1,60,60,dtype=int))
X=np.lib.stride_tricks.as_strided(data_matrix2,shape=(3, 56, 5),strides=(4*60,4,4))
It's just a view. A fresh array can be obtained by X=X.copy().

Appending to the list will be slow. Try a list comprehension to make the numpy array.
something like below
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
return np.array([X[i*stride:i*stride+windows]
for i in range(int(len(X)/stride))
if i*stride < len(X)-windows+1])
##---------End--------------##
X = np.array([get_slice(row) for row in data_matrix])
print(X)
This may be odd, because you have a numpy array of numpy arrays. If you want a 3 dimensional array this is perfectly fine. If you don't want a 3 dimensional array then you may want to vstack or append the arrays.
# X = np.array([get_slice(row) for row in data_matrix])
X = np.vstack((get_slice(row) for row in data_matrix))
List Comprehension speed
I am running Python 3.4.4 on Windows 10.
import timeit
TEST_RUNS = 1000
LIST_SIZE = 2000000
def make_list():
li = []
for i in range(LIST_SIZE):
li.append(i)
return li
def make_list_microopt():
li = []
append = li.append
for i in range(LIST_SIZE):
append(i)
return li
def make_list_comp():
li = [i for i in range(LIST_SIZE)]
return li
print("List Append:", timeit.timeit(make_list, number=TEST_RUNS))
print("List Comprehension:", timeit.timeit(make_list_comp, number=TEST_RUNS))
print("List Append Micro-optimization:", timeit.timeit(make_list_microopt, number=TEST_RUNS))
Output
List Append: 222.00971377954895
List Comprehension: 125.9705268094408
List Append Micro-optimization: 157.25782340883387
I am very surprised with how much the micro-optimization helps. Still, List Comprehensions are a lot faster for large lists on my system.

Related

How to efficiently make a function call to each row of a 2D ndarray?

I'm implementing a KNN classifier and need to quickly traverse the test set to calculate and store their predicted labels.
The way I use now is to use the list comprehension to get a list, and then turn it into a ndarray, similar to np.array([predict(point) for point in test_set]), but I think it takes time and space, because the for loop of Python is relatively slow and it needs to create another copy. Is there a more efficient way to get such an array?
I know that numpy has apply_along_axis function, but it is said that it only implicitly uses the for loop, which may not improve the performance.
EDIT: I learned a possible way to save memory: match np.fromiter() function and generator, like np.fromiter((predict(point) for point in test_set), int, test_set.shape[0]), which avoids creating a list halfway. Unfortunately, in my program, it seems to run a little slower than the previous method.
the good old way:
def my_func(test_set):
i = 0
test_set_size = len(test_set)
result = [None] * test_set_size
while i < test_set_size:
result[i] = predict(test_set[i])
i = i + 1
return np.array(result)

Is there a difference between adding a scalar to a vector inside a for loop and outside it, using numpy?

I was trying to take advantage of the Broadcasting property of Python while replacing the for loop of this snippet:
import numpy as np
B = np.random.randn(10,1)
k = 25
for i in range(len(B)):
B[i][0]= B[i][0] + k
with this:
for i in range((lenB)):
B=B+k
I observed that I was getting different results. When I tried outside the loop, B = B+k, gave the same results as what I was expecting with B[i][0] = B[i][0] + k
Why is this so? Does Broadcasting follow different rules inside loops?
In your 2nd option you ment to do the following:
B=B+k
As you see, you don't need the for loop and it is MUCH faster than looping over the "vector" (numpy array).
It is some form of "Vectorization" calculation instead of iterative calculation which is better in terms of complexity and readability. Both will yield the same result.
You can see a lot of examples on Vectorization vs Iteration, including running time, here.
And you can see a great video of Andrew Ng going over numpy broadcasting property.

Appending arrays in numpy

I have a loop that reads through a file until the end is reached. On each pass through the loop, I extract a 1D numpy array. I want to append this array to another numpy array in the 2D direction. That is, I might read in something of the form
x = [1,2,3]
and I want to append it to something of the form
z = [[0,0,0],
[1,1,1]]
I know I can simply do z = numpy.append([z],[x],axis = 0) and achieve my desired result of
z = [[0,0,0],
[1,1,1],
[1,2,3]]
My issue comes from the fact that in the first run through the loop, I don't have anything to append to yet because first array read in is the first row of the 2D array. I dont want to have to write an if statement to handle the first case because that is ugly. If I were working with lists I could simply do z = [] before the loop and every time I read in an array, simply do z.append(x) to achieve my desired result. However I can find no way doing a similar procedure in numpy. I can create an empty numpy array, but then I can't append to it in the way I want. Can anyone help? Am I making any sense?
EDIT:
After some more research, I found another workaround that does technically do what I want although I think I will go with the solution given by #Roger Fan given that numpy appending is very slow. I'm posting it here just so its out there.
I can still define z = [] at the beginning of the loop. Then append my arrays with `np.append(z, x). This will ultimately give me something like
z = [0,0,0,1,1,1,1,2,3]
Then, because all the arrays I read in are of the same size, after the loop I can simply resize with `np.resize(n, m)' and get what I'm after.
Don't do it. Read the whole file into one array, using for example numpy.genfromtext().
With this one array, you can then loop over the rows, loop over the columns, and perform other operations using slices.
Alternatively, you can create a regular list, append a lot of arrays to that list, and in the end generate your desired array from the list using either numpy.array(list_of_arrays) or, for more control, numpy.vstack(list_of_arrays).
The idea in this second approach is "delayed array creation": find and organize your data first, and then create the desired array once, already in its final form.
As #heltonbiker mentioned in his answer, something like np.genfromtext is going to be the best way to do this if it fits your needs. Otherwise, I suggest reading the answers to this question about appending to numpy arrays. Basically, numpy array appending is extremely slow and should be avoided whenever possible. There are two much better (and faster by about 20x) solutions:
If you know the length in advance, you can preallocate your array and assign to it.
length_of_file = 5000
results = np.empty(length_of_file)
with open('myfile.txt', 'r') as f:
for i, line in enumerate(f):
results[i] = processing_func(line)
Otherwise, just keep a list of lists or list of arrays and convert it to a numpy array all at once.
results = []
with open('myfile.txt', 'r') as f:
for line in f:
results.append(processing_func(line))
results = np.array(results)

Efficient way for appending numpy array

I will keep it simple.I have a loop that appends new row to a numpy array...what is the efficient way to do this.
n=np.zeros([1,2])
for x in [[2,3],[4,5],[7,6]]
n=np.append(n,x,axis=1)
Now the thing is there is a [0,0] sticking to it so I have to remove it by
del n[0]
Which seems dumb...So please tell me an efficient way to do this.
n=np.empty([1,2])
is even worse it creates an uninitialised value.
A bit of technical explanation for the "why lists" part.
Internally, the problem for a list of unknown length is that it needs to fit in memory somehow regardless of its length. There are essentially two different possibilities:
Use a data structure (linked list, some tree structure, etc.) which makes it possible to allocate memory separately for each new element in a list.
Store the data in a contiguous memory area. This area has to be allocated when the list is created, and it has to be larger than what we initially need. If we get more stuff into the list, we need to try to allocate more memory, preferably at the same location. If we cannot do it at the same location, we need to allocate a bigger block and move all data.
The first approach enables all sorts of fancy insertion and deletion options, sorting, etc. However, it is slower in sequential reading and allocates more memory. Python actually uses the method #2, the lists are stored as "dynamic arrays". For more information on this, please see:
Size of list in memory
What this means is that lists are designed to be very efficient with the use of append. There is very little you can do to speed things up if you do not know the size of the list beforehand.
If you know even the maximum size of the list beforehand, you are probably best off allocating a numpy.array using numpy.empty (not numpy.zeros) with the maximum size and then use ndarray.resize to shrink the array once you have filled in all data.
For some reason numpy.array(l) where l is a list is often slow with large lists, whereas copying even large arrays is quite fast (I just tried to create a copy of a 100 000 000 element array; it took less than 0.5 seconds).
This discussion has more benchmarking on different options:
Fastest way to grow a numpy numeric array
I have not benchmarked the numpy.empty + ndarray.resize combo, but both should be rather microsecond than millisecond operations.
There are three ways to do this, if you already have everything in a list:
data = [[2, 3], [4, 5], [7, 6]]
n = np.array(data)
If you know how big the final array will be:
exp = np.array([2, 3])
n = np.empty((3, 2))
for i in range(3):
n[i, :] = i ** exp
If you don't know how big the final array will be:
exp = np.array([2, 3])
n = []
i = np.random.random()
while i < .9:
n.append(i ** exp)
i = np.random.random()
n = np.array(n)
Just or the record you can start with n = np.empty((0, 2)) but I would not suggest appending to that array in a loop.
You might want to try:
import numpy as np
n = np.reshape([], (0, 2))
for x in [[2,3],[4,5],[7,6]]:
n = np.append(n, [x], axis=0)
Instead of np.append you can also use n = np.vstack([n,x]). I also agree with #Bi Rico that I also would use a list, if n does not need to accessed within the loop.

Vectorizing the addition of results to a numpy array

I have a function that works something like this:
def Function(x):
a = random.random()
b = random.random()
c = OtherFunctionThatReturnsAThreeColumnArray()
results = np.zeros((1,5))
results[0,0] = a
results[0,1] = b
results[0,2] = c[-1,0]
results[0,3] = c[-1,1]
results[0,4] = c[-1,2]
return results
What I'm trying to do is run this function many, many times, appending the returned one row, 5 column results to a running data set. But the append function, and a for-loop are both ruinously inefficient as I understand it, and I'm both trying to improve my code and the number of runs is going to be large enough that that kind of inefficiency isn't doing me any favors.
Whats the best way to do the following such that it induces the least overhead:
Create a new numpy array to hold the results
Insert the results of N calls of that function into the array in 1?
You're correct in thinking that numpy.append or numpy.concatenate are going to be expensive if repeated many times (this is to do with numpy declaring a new array for the two previous arrays).
The best suggestion (If you know how much space you're going to need in total) would be to declare that before you run your routine, and then just put the results in place as they become available.
If you're going to run this nrows times, then
results = np.zeros([nrows, 5])
and then add your results
def function(x, i, results):
<.. snip ..>
results[i,0] = a
results[i,1] = b
results[i,2] = c[-1,0]
results[i,3] = c[-1,1]
results[0,4] = c[-1,2]
Of course, if you don't know how many times you're going to be running function this won't work. In that case, I'd suggest a less elegant approach;
Declare a possibly large results array and add to results[i, x] as above (keeping track of i and the size of results.
When you reach the size of results, then do the numpy.append (or concatenate) on a new array. This is less bad than appending repetitively and shouldn't destroy performance - but you will have to write some wrapper code.
There are other ideas you could pursue. Off the top of my head you could
Write the results to disk, depending on the speed of OtherFunctionThatReturnsAThreeColumnArray and the size of your data this may not be too daft an idea.
Save your results in a list comprehension (forgetting numpy until after the run). If function returned (a, b, c) not results;
results = [function(x) for x in my_data]
and now do some shuffling to get results into the form you need.

Categories