comparing 2d arrays one by one to define max values in python - python

Could someone please help me with this issue. In advance, I really appreciate your time and consideration.
I have many same size 2d arrays. The 2d arrays are timeseries and each array is for one day, for an example;
day1=np.array([[4, 5, 6, 8],[9, 5, 3, 5]])
day2=np.array([[6, 0, 0, 1],[6, 1, 8, 1]])
day3=np.array([[5, 2, 7, 9],[4, 3, 7, 7]])
day4=np.array([[1, 0, 0, 7],[4, 7, 7, 3]])
I need to compare the arrays together and define the highest (max) values in each index and the date of the highest value. So for the above arrays (day1, day2, day3, day4), I need two outputs like below;
highest_values=([[6, 5, 7, 9],[9, 7, 8, 7]])
date=(['day2', 'day1', 'day3', 'day3'],['day1', 'day4', 'day2', 'day3'])
I could do that with the following code.
import numpy as np
namelist=['day1','day2','day3','day4']
arrays=np.array([day1,day2,day3,day4])
highest_values=arrays.max(axis=0) # to get the max values
index_of_max=arrays.argmax(axis=0) # to get the indices of max values
date=np.array([[namelist[j] for j in index_of_max[i]] for i in range(len(index_of_max))]) # I used the name of each array as the date and then assigned it to the indices of the max values
But I have thousands of large arrays saved in my computer and I need very big memory to read all the files together and run the above codes. When I run the above code for all file simultaneously, I get the out of memory error.
I need something like a loop that can read the first two arrays and take the outputs (the highest values and dating) of them, then compare the outputs with third array and take the new outputs and then compare the new outputs with the fourth array, and so on.

it sounds like you want something like
arrays=np.array([day1,day2,day3,day4])
highest_values = day1
for array in arrays:
highest_values = np.array([highest_values,array]).max(axis=0)

Related

picking out rows and columns in matrix

In Python, I have defined my matrix the following way
A = [[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[-2, 7, 4, 23]]
and wanted to try printing out the individual columns and rows by the following
print(A[2][:]) and print(A[:][2]) for the 3rd row and 3rd column, respectively.
To my surprise, they both printed the 3rd row.
For the purpose of learning, I am not using Numpy or any math packages.
Not sure why print(A[2][:]) and print(A[:][2]) result in the same output
This seems to follow the same principles of accessing various elements in lists. Namely, by calling A[2] for example, you are calling the entire data in the 3rd element (which in this case is its own list). By further stating A[2][:] you are calling all elements within the 3rd list. The same logic is applied to the A[:][2] call - namely, all the information is called within your A list and then you further stipulate that you want the 3rd element shown - this is namely the 3rd 'sublist' so to say.
To extract a column, you need to manually loop throw all rows and extract the elements, e.g., with a list comprehension:
third_column = [row[2] for row in A]

Different axis indication between np.delete and np.mean in numpy array of Python

I have learned the axis indication of numpy array from how is axis indexed in numpy's array
The article says that, for 2-D array, axis=0 stands for each col in array, and axis=1 for each row in array. It works when I use np.mean that means values by col, but np.delete in axis=0 is different that deletes elements by row.
import numpy as np
arr = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
'''
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
'''
np.mean(arr, 0)
'''
array([5., 6., 7., 8.])
'''
np.delete(arr,1,axis=0)
'''
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
'''
I confuse whether I'm wrong for understanding that?
Why np.mean and np.delete operate in different axis when axis=0 is declared?
The accepted answer to the question you linked to actually says correctly that
Axis 0 is thus the first dimension (the "rows"), and axis 1 is the second dimension (the "columns")
which is what the code does and is the opposite to what you said.
This ought to be the source of your confusion. As we see from your own example:
np.delete(arr,1,axis=0)
'''
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
'''
Row at index 1 is deleted, which is exactly what we want to happen.
This is a 2D example where we have rows and columns but it is important to understand how shapes work in general and then they will make sense in higher dimension. Consider the following example:
[
[
[1, 2],
[3, 4]
],
[
[5, 6],
[7, 8],
],
[
[9, 10],
[11, 12],
]
]
Here, we have 3 grids, each itself is 2x2, so we have something of shape 3x2x2. This is why we have 12 elements in total. Now, how do we know that at axis=0 we have 3 elements? Because if you look at this as a simple array and not some fancy numpy object then len(arr) == 3. Then if you take any of the elements along that axis (any of the "grids" that is), we will see that their length is 2 or len(arr[0]) == 2. That is because each of the grids has 2 rows. Finally, to check how many items each row of each of these grids has, we just have to inspect any one of these rows. Let's look at the second row of the first grid for a change. We will see that: len(arr[0][1]) == 2.
Now, what does np.mean(a, axis=0) mean? It means we will go over each of the items along axis=0 and find their mean. If these items are simply numbers (if a=np.array([1,2,3])) that's easy because the average of 1,2,3 is just the sum of these numbers divided by their quantity.
So, what if we have vectors or grids? What is the average of [2,4,6] and [0,0,0]? The convention is that the average of these to lists is a list of the averages at each index. So in other words it's:
[np.mean([2,0]), np.mean([4,0]), np.mean([6,0])]
which is trivially [1,2,3].
So, why does np.delete behave differently? Well, because the purpose of delete is to remove an element along some axis rather than to perform an aggregation over that axis. So in this particular case, we had 3 grids. So removing one of them will simply leave us with 2 grids. We could alternatively remove the second row of every grid (axis=1). That would leave us with 3 grids but each would have only 1 row instead of 2.
Hopefully, this brings some clarity :)
Usually I like to think about the axis in numpy (or pandas) as an indicator of the axis "along which" computations are carried out.
In this sense when you compute the mean along axis 0, this is, along the rows, you do it for each column. But if you delete along axis 0 it means you scroll along the rows to find the index you will delete.
I think your confusion is possibly coming from the fact that in delete, the axis refers to the axis you are indexing along when finding the section to delete, while in mean, the axis refers to which axis you are averaging along.
In both cases, axis tells the function which axis to "move along" when trying to perform it's operation - for delete it moves down the way when searching for what delete, and for mean it moves down the way when calculating averages

Reduce size of array based on multiple column criteria in python

I need to reduce the size of an array, based on criteria found on another array; I need to look into the relationships and change the value based on the new information. Here is a simplified version of my problem.
I have an array (or dataframe) with my data:
data = np.array([[[[1, 2, 3, 4], [5, 6, 7, 8]]]]).reshape((4,2))
I have another file, of different size, that holds information about the values in the data array:
a = np.array([[1, 1, 2],[2, 3, 4],[3, 5, 6], [4, 7, 8] ]).reshape((4,3))
The information I have in a tells me how I can reduce the size of data, for example a[0] tells me that data[0][0:2] == a[0][1:].
so I can replace the unique value a[0][0:1] with data[0][0:2] (effectively reducing the size of array data
To clarify, array a holds three pieces of information per position, a[0] has the information 1, 1, 2 - now I want to scan through the data array, and when the a[i][1:] is equal to any of the data[i][0:2] or data[i][2:] then I want to replace the value with the a[i][0:1] - is that any clearer?
my final array should be like this:
new_format = np.array([[[[1, 2], [3,4]]]]).reshape((2,2))
There are questions like the following: Filtering a DataFrame based on multiple column criteria
but are only based on filtering based on certain numerical criteria.
I figured out a way to do it, using the pandas library. Probably not the best solution, but worked from me.
In my case I read the data in the pandas library, but for the posted example I can convert the arrays to dataframes
datas = pd.DataFrame(data) ##convert to dataframe
az = pd.DataFrame(a)
datas= datas.rename(columns={'0': '1', '1': '2'}) ## rename columns for comparison with a array
new_format= pd.merge(datas, az, how='right') #do the comparison
new_format = new_format.drop(['1','2'],1) #drop the old columns, keeping only the new format

How to efficiently create a sparse vector in python?

I have a dictionary of keys where each value should be a sparse vector of a huge size (~ 700000 elements, maybe more). How do I efficiently grow / build this data structure.
Right now my implementation works only for smaller sizes.
myvec = defaultdict(list)
for id in id_data:
for item in item_data:
if item in item_data[id]:
myvec[id].append(item * 0.5)
else:
myvec[id].append(0)
The above code when used with huge files quickly eats up all the available memory. I tried removing the myvec[id].append(0) condition and store only non-zero values because the length of each myvec[id] list is constant. That worked on my huge test file with a decent memory consumption but I'd rather find a better way to do it.
I know that there are different type of sparse arrays/matrices for this purpose but I have no intuition which one is better. I tried to use lil_matrix from numpy package instead of myvec dict but it turned out to be much slower than the above code.
So the problem basically boils down to the following two questions:
Is it possible to create a sparse data structure on the fly in python?
How can one create such sparse data structure with decent speed?
Appending to a list (or lists) will always be faster than appending to a numpy.array or to a sparse matrix (which stores data in several numpy arrays). lil is supposed to be the fastest when you have to grow the matrix incrementally, but it still will slower than working directly with lists.
Numpy arrays have a fixed size. So the np.append function actually creates a new array by concatenating the old with the new data.
You example code would be more useful if you gave us some data, so we cut, paste and run.
For simplicity lets define
data_dict=dict(one=[1,0,2,3,0,0,4,5,0,0,6])
Sparse matrices can be created directly from this with:
sparse.coo_matrix(data_dict['one'])
whose attributes are:
data: array([1, 2, 3, 4, 5, 6])
row: array([0, 0, 0, 0, 0, 0], dtype=int32)
col: array([ 0, 2, 3, 6, 7, 10], dtype=int32)
or
sparse.lil_matrix(id_data['one'])
data: array([[1, 2, 3, 4, 5, 6]], dtype=object)
rows: array([[0, 2, 3, 6, 7, 10]], dtype=object)
The coo version times a lot faster.
The sparse matrix only saves the nonzero data, but it also has to save an index. There is also a dictionary format, which uses a tuple (row,col) as the key.
And example of incremental construction is:
llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
llm[0,i]=data_dict['one'][i]
For this small case this incremental approach is faster.
I get even better speed by only adding the nonzero terms to the sparse matrix:
llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
if data_dict['one'][i]!=0:
llm[0,i]=data_dict['one'][i]
I can imagine adapting this to your default dict example. Instead of myvec[id].append(0), you keep a record of where you appended the item * 0.5 values (whether in a separate list, or via a lil_matrix. It would take some experimenting to adapt this idea to a default dictionary.
So basically the goal is to create 2 lists:
data = [1, 2, 3, 4, 5, 6]
cols = [ 0, 2, 3, 6, 7, 10]
Whether you create a sparse matrix from these or not depends on what else you need to do with the data.

Slicing of a NumPy 2d array, or how do I extract an mxm submatrix from an nxn array (n>m)?

I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4x4 and I want to extract a 2x2 array from it.
Here is our array:
from numpy import *
x = range(16)
x = reshape(x,(4,4))
print x
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
The line and columns to remove are the same. The easiest case is when I want to extract a 2x2 submatrix that is at the beginning or at the end, i.e. :
In [33]: x[0:2,0:2]
Out[33]:
array([[0, 1],
[4, 5]])
In [34]: x[2:,2:]
Out[34]:
array([[10, 11],
[14, 15]])
But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn't seem to work:
In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])
I found one way, which is:
In [61]: x[[1,3]][:,[1,3]]
Out[61]:
array([[ 5, 7],
[13, 15]])
First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I'd certainly like to hear it.
Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?
To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).
If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can't use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.
NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):
x.strides
(16, 4)
When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):
y.shape
(2,2)
y.strides
(16, 4)
This way, computing the memory offset for y[i,j] will yield the correct result.
But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won't allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.
This is covered in depth in the NumPy documentation on indexing.
Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:
x[[[1],[3]],[1,3]]
This is because the index arrays are broadcasted to a common shape.
Of course, for this particular example, you can also make do with basic slicing:
x[1::2, 1::2]
As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.
There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.
I don't think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:
a[[1,3],:][:,[1,3]]
I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.
e.g. In your inputs 33 and 34, although you get a 2x2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.
Clearly, this mechanism doesn't carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.
If you want to skip every other row and every other column, then you can do it with basic slicing:
In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]:
array([[ 5, 7],
[13, 15]])
This returns a view, not a copy of your array.
In [51]: y=x[1:4:2,1:4:2]
In [52]: y[0,0]=100
In [53]: x # <---- Notice x[1,1] has changed
Out[53]:
array([[ 0, 1, 2, 3],
[ 4, 100, 6, 7],
[ 8, 9, 10, 11],
[ 12, 13, 14, 15]])
while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:
In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]
In [60]: z
Out[60]:
array([[ 5, 7],
[13, 15]])
In [61]: z[0,0]=0
Note that x is unchanged:
In [62]: x
Out[62]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
If you wish to select arbitrary rows and columns, then you can't use basic slicing. You'll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).
With numpy, you can pass a slice for each component of the index - so, your x[0:2,0:2] example above works.
If you just want to evenly skip columns or rows, you can pass slices with three components
(i.e. start, stop, step).
Again, for your example above:
>>> x[1:4:2, 1:4:2]
array([[ 5, 7],
[13, 15]])
Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.
The syntax you got to do something quite different internally - what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that - creating a third array - including only columns 1 and 3 of the previous array.
I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2
.
Following the solution of previous post for your case the solution looks like:
columns_to_keep = [1,3]
rows_to_keep = [1,3]
An using ix_:
x[np.ix_(rows_to_keep, columns_to_keep)]
Which is:
array([[ 5, 7],
[13, 15]])
I'm not sure how efficient this is but you can use range() to slice in both axis
x=np.arange(16).reshape((4,4))
x[range(1,3), :][:,range(1,3)]

Categories