I have a loop that reads through a file until the end is reached. On each pass through the loop, I extract a 1D numpy array. I want to append this array to another numpy array in the 2D direction. That is, I might read in something of the form
x = [1,2,3]
and I want to append it to something of the form
z = [[0,0,0],
[1,1,1]]
I know I can simply do z = numpy.append([z],[x],axis = 0) and achieve my desired result of
z = [[0,0,0],
[1,1,1],
[1,2,3]]
My issue comes from the fact that in the first run through the loop, I don't have anything to append to yet because first array read in is the first row of the 2D array. I dont want to have to write an if statement to handle the first case because that is ugly. If I were working with lists I could simply do z = [] before the loop and every time I read in an array, simply do z.append(x) to achieve my desired result. However I can find no way doing a similar procedure in numpy. I can create an empty numpy array, but then I can't append to it in the way I want. Can anyone help? Am I making any sense?
EDIT:
After some more research, I found another workaround that does technically do what I want although I think I will go with the solution given by #Roger Fan given that numpy appending is very slow. I'm posting it here just so its out there.
I can still define z = [] at the beginning of the loop. Then append my arrays with `np.append(z, x). This will ultimately give me something like
z = [0,0,0,1,1,1,1,2,3]
Then, because all the arrays I read in are of the same size, after the loop I can simply resize with `np.resize(n, m)' and get what I'm after.
Don't do it. Read the whole file into one array, using for example numpy.genfromtext().
With this one array, you can then loop over the rows, loop over the columns, and perform other operations using slices.
Alternatively, you can create a regular list, append a lot of arrays to that list, and in the end generate your desired array from the list using either numpy.array(list_of_arrays) or, for more control, numpy.vstack(list_of_arrays).
The idea in this second approach is "delayed array creation": find and organize your data first, and then create the desired array once, already in its final form.
As #heltonbiker mentioned in his answer, something like np.genfromtext is going to be the best way to do this if it fits your needs. Otherwise, I suggest reading the answers to this question about appending to numpy arrays. Basically, numpy array appending is extremely slow and should be avoided whenever possible. There are two much better (and faster by about 20x) solutions:
If you know the length in advance, you can preallocate your array and assign to it.
length_of_file = 5000
results = np.empty(length_of_file)
with open('myfile.txt', 'r') as f:
for i, line in enumerate(f):
results[i] = processing_func(line)
Otherwise, just keep a list of lists or list of arrays and convert it to a numpy array all at once.
results = []
with open('myfile.txt', 'r') as f:
for line in f:
results.append(processing_func(line))
results = np.array(results)
Related
What I want to do:
I want to create an array and add each Item from a List to the array. This is what I have so far:
count = 0
arr = []
with open(path,encoding='utf-8-sig') as f:
data = f.readlines() #Data is the List
for s in data:
arr[count] = s
count+=1
What am I doing wrong? The Error I get is IndexError: list assignment index out of range
When you try to access arr at index 0, there is not anything there. What you are trying to do is add to it. You should do arr.append(s)
Your arr is an empty array. So, arr[count] = s is giving that error.
Either you initialize your array with empty elements, or use the append method of array. Since you do not know how many elements you will be entering into the array, it is better to use the append method in this case.
for s in data:
arr.append(s)
count+=1
It's worth taking a step back and asking what you're trying to do here.
f is already an iterable of lines: something you can loop over with for line in f:. But it's "lazy"—once you loop over it once, it's gone. And it's not a sequence—you can loop over it, but you can't randomly access it with indexes or slices like f[20] or f[-10:].
f.readlines() copies that into a list of lines: something you can loop over, and index. While files have the readlines method for this, it isn't really necessary—you can convert any iterable to a list just like this by just calling list(f).
Your loop appears to be an attempt to create another list of the same lines. Which you could do with just list(data). Although it's not clear why you need another list in the first place.
Also, the term "array" betrays some possible confusion.
A Python list is a dynamic array, which can be indexed and modified, but can also be resized by appending, inserting, and deleting elements. So, technically, arr is an array.
But usually when people talk about "arrays" in Python, they mean fixed-size arrays, usually of fixed-size objects, like those provided by the stdlib array module, the third-party numpy library, or special-purpose types like the builtin bytearray.
In general, to convert a list or other iterable into any of these is the same as converting into a list: just call the constructor. For example, if you have a list of numbers between 0-255, you can do bytearray(lst) to get a bytearray of the same numbers. Or, if you have a list of lists of float values, np.array(lst) will give you a 2D numpy array of floats. And so on.
So, why doesn't your code work?
When you write arr = [], you're creating a list of 0 elements.
When you write arr[count] = s, you're trying to set the countth element in the list to s. But there is no countth element. You're writing past the end of the list.
One option is to call arr.append(s) instead. This makes the list 1 element longer than it used to be, and puts s in the new slot.
Another option is to create a list of the right size in the first place, like arr = [None for _ in data]. Then, arr[count] = s can replace the None in the countth slot with s.
But if you really just want a copy of data in another list, you're better off just using arr = list(data), or arr = data[:].
And if you don't have any need for another copy, just do arr = data, or just use data as-is—or even, if it works for your needs, just use f in the first place.
Seems like you are coming from matlab or R background. when you do arr=[], it creates an empty list, its not an array.
import numpy
count = 0
with open(path,encoding='utf-8-sig') as f:
data = f.readlines() #Data is the List
size = len(data)
array = numpy.zeros((size,1))
for s in data:
arr[count,0] = s
count+=1
I am trying to break a long sequence into sub-sequence with a smaller window size by using the get_slice function defined by me.
Then I suddenly realized that my code is too clumsy, since my raw data is already a numpy array, then I need to store it into a list in my get_slice function. After that, when I read each row in the data_matrix, I need another list to stored the information again.
The code works fine, yet the conversion between numpy array and list back and forth seems non-pythonic to me. I wonder if I am doing it right. If not, how to do it more efficiently and more pythonic?
Here's my code:
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
x_slice = []
for i in range(int(len(X)/stride)):
if i*stride < len(X)-windows+1:
x_slice.append(X[i*stride:i*stride+windows])
return np.array(x_slice)
##---------End--------------##
x_list = []
for row in data_matrix:
temp_data = get_slice(row) #getting time slice as numpy array
x_list.append(temp_data) #appending the time slice into a list
X = np.array(x_list) #Converting the list back to numpy array
Putting this here as a semi-complete answer to address your two points - making the code more "pythonic" and more "efficient."
There are many ways to write code and there's always a balance to be found between the amount of numpy code and pure python code used.
Most of that comes down to experience with numpy and knowing some of the more advanced features, how fast the code needs to run, and personal preference.
Personal preference is the most important - you need to be able to understand what your code does and modify it.
Don't worry about what is pythonic, or even worse - numpythonic.
Find a coding style that works for you (as you seem to have done), and don't stop learning.
You'll pick up some tricks (like #B.M.'s answer uses), but for the most part these should be saved for rare instances.
Most tricks tend to require extra work, or only apply in some circumstances.
That brings up the second part of your question.
How to make code more efficient.
The first step is to benchmark it.
Really.
I've been surprised at the number of things I thought would speed up code that barely changed it, or even made it run slower.
Python's lists are highly optimized and give good performance for many things (Although many users here on stackoverflow remain convinced that using numpy can magically make any code faster).
To address your specific point, mixing lists and arrays is fine in most cases. Particularly if
You don't know the size of your data beforehand (lists expand much more efficiently)
You are creating a large number of views into an array (a list of arrays is often cheaper than one large array in this case)
You have irregularly shaped data (arrays must be square)
In your code, case 2 applies. The trick with as_strided would also work, and probably be faster in some cases, but until you've profiled and know what those cases are I would say your code is good enough.
There is very fews case where mixing list and array is necessary. You can efficiently have the same data with only array primitives:
data_matrix=np.add.outer([0,100,1000],np.linspace(1,60,60,dtype=int))
X=np.lib.stride_tricks.as_strided(data_matrix2,shape=(3, 56, 5),strides=(4*60,4,4))
It's just a view. A fresh array can be obtained by X=X.copy().
Appending to the list will be slow. Try a list comprehension to make the numpy array.
something like below
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
return np.array([X[i*stride:i*stride+windows]
for i in range(int(len(X)/stride))
if i*stride < len(X)-windows+1])
##---------End--------------##
X = np.array([get_slice(row) for row in data_matrix])
print(X)
This may be odd, because you have a numpy array of numpy arrays. If you want a 3 dimensional array this is perfectly fine. If you don't want a 3 dimensional array then you may want to vstack or append the arrays.
# X = np.array([get_slice(row) for row in data_matrix])
X = np.vstack((get_slice(row) for row in data_matrix))
List Comprehension speed
I am running Python 3.4.4 on Windows 10.
import timeit
TEST_RUNS = 1000
LIST_SIZE = 2000000
def make_list():
li = []
for i in range(LIST_SIZE):
li.append(i)
return li
def make_list_microopt():
li = []
append = li.append
for i in range(LIST_SIZE):
append(i)
return li
def make_list_comp():
li = [i for i in range(LIST_SIZE)]
return li
print("List Append:", timeit.timeit(make_list, number=TEST_RUNS))
print("List Comprehension:", timeit.timeit(make_list_comp, number=TEST_RUNS))
print("List Append Micro-optimization:", timeit.timeit(make_list_microopt, number=TEST_RUNS))
Output
List Append: 222.00971377954895
List Comprehension: 125.9705268094408
List Append Micro-optimization: 157.25782340883387
I am very surprised with how much the micro-optimization helps. Still, List Comprehensions are a lot faster for large lists on my system.
This is basically what I am trying to do:
array = np.array() #initialize the array. This is where the error code described below is thrown
for i in xrange(?): #in the full version of this code, this loop goes through the length of a file. I won't know the length until I go through it. The point of the question is to see if you can build the array without knowing its exact size beforehand
A = random.randint(0,10)
B = random.randint(0,10)
C = random.randint(0,10)
D = random.randint(0,10)
row = [A,B,C,D]
array[i:]= row # this is supposed to add a row to the array with A,C,B,D as column values
This code doesn't work. First of all it complains: TypeError: Required argument 'object' (pos 1) not found. But I don't know the final size of the array.
Second, I know that last line is incorrect but I am not sure how to call this in python/numpy. So how can I do this?
A numpy array must be created with a fixed size. You can create a small one (e.g., one row) and then append rows one at a time, but that will be inefficient. There is no way to efficiently grow a numpy array gradually to an undetermined size. You need to decide ahead of time what size you want it to be, or accept that your code will be inefficient. Depending on the format of your data, you can possibly use something like numpy.loadtxt or various functions in pandas to read it in.
Use a list of 1D numpy arrays, or a list of lists, and then convert it to a numpy 2D array (or use more nesting and get more dimensions if you need to).
import numpy as np
a = []
for i in range(5):
a.append(np.array([1,2,3])) # or a.append([1,2,3])
a = np.asarray(a) # a list of 1D arrays (or lists) becomes a 2D array
print(a.shape)
print(a)
I have a numpy matrix which I filled with data from a *.csv-file
csv = np.genfromtxt (file,skiprows=22)
matrix = np.matrix(csv)
This is a 64x64 matrix which looks like
print matrix
[[...,...,....]
[...,...,.....]
.....
]]
Now I need to take the logarithm math.log10() of every single value and safe it into another 64x64 matrix.
How can I do this? I tried
matrix_lg = np.matrix(csv)
for i in range (0,len(matrix)):
for j in range (0,len(matrix[0])):
matrix_lg[i,j]=math.log10(matrix[i,j])
but this only edited the first array (meaning the first row) of my initial matrix.
It's my first time working with python and I start getting confused.
You can just do:
matrix_lg = numpy.log10(matrix)
And it will do it for you. It's also much faster to do it this vectorized way instead of looping over every entry in python. It will also handle domain errors more gracefully.
FWIW though, the issue with your posted code is that the len() for matrices don't work exactly the same as they do for nested lists. As suggested in the comments, you can just use matrix.shape to get the proper dims to iterate through:
matrix_lg = np.matrix(csv)
for i in range(0,matrix_lg.shape[0]):
for j in range(0,matrix_lg.shape[1]):
matrix_lg[i,j]=math.log10(matrix_lg[i,j])
I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.
In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).
Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.
It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...