Iterate over numpy array to fill a python list - python

I'm iterating over a numpy array to apply a function through each element and add the new value to a list so I can keep the original data.
The problem is: it's kinda slow.
Is there a better way to do this (without changing the original array)?
import numpy as np
original_data = np.arange(0,16000, dtype = np.float32)
new_data = [i/max(original_data) for i in original_data]
print('done')

You could simply do:
new_data = original_data/original_data.max()
Numpy already performs this operation element-wise.
In your code there is an extra source of slowness: each call max(original_data) will result in an iteration over all elements from original_data, making your cost proportional to O(n^2).

Related

Numpy Arrays in python

I have this code that contains a for loop to print out this result for me.
Could I transfer this to Numpy Array instead of for loop and less memory?
categorical__unique = df.select_dtypes(['object']).columns
for col in categorical__unique:
print('{} : {} unique value(s)'.
format(col, df[col].nunique()))
I am trying to make this categorial_unique value to an array and then use the functions of the NumPy array instead of for loop.
To convert a dataframe into a matrix you have to use the "to_numpy" function from pandas
categorical_unique = df.select_dtypes(['object']).to_numpy()

Is there a numpy function to find an array in multi dimensional array?

I have a numpy array with n row and p columns.
I want to check if a given row is in my array and find the index.
For exemple I have a numpy array like this :
[[1,0,8,7,2,2],[1,3,7,0,3,0],[1,7,1,0,1,0],[1,9,1,0,6,0],[1,8,1,7,9,0],....]
I want to check if this array [6,0,5,8,2,1] is in my numpy array or and where.
Is there a numpy function for that ?
I'm sorry for asking naive question but I'm quite confuse right now.
You can use == and .all(axis=1) to match entire rows, then use numpy.where() to get the index:
import numpy as np
a = np.array([[1,0,8,7,2,2],[1,3,7,0,3,0],[1,7,1,0,1,0],[1,9,1,0,6,0],[1,8,1,7,9,0], [6,0,5,8,2,1]])
b = np.array([6,0,5,8,2,1])
print(np.where((a==b).all(axis=1)))
Output:
(array([5], dtype=int32),)

How to access random indices from h5 data set?

I have some h5 data that I want to sample from by using some randomly generated indices. However, if the indices are out of increasing order, then the effort fails. Is it possible to select indices, that have been generated randomly, from h5 data sets?
Here is a MWE citing the error:
import h5py
import numpy as np
arr = np.random.random(50).reshape(10,5)
with h5py.File('example1.h5', 'w') as h5fw:
h5fw.create_dataset('data', data=arr)
random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
# TypeError: Indexing elements must be in increasing order
I could sort the indices, but then we lose the randomness component.
As hpaulj mentioned, random indices aren't a problem for numpy arrays in memory. So, yes it's possible to select data with randomly generated indices from h5 data sets read to numpy arrays. The key is having sufficient memory to hold the dataset in memory. The code below shows how to do this:
#random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
arr = h5py.File('example1.h5', 'r')['data'][:]
random_subset = arr[[3,1]]
A potential solution is to pre-sort the desired indices as follow:
idx = np.sort([3,1])
random_subset = h5py.File('example1.h5', 'r')['data'][idx]

Appending numpy array of arrays

I am trying to append an array to another array but its appending them as if it was just one array. What I would like to have is have each array appended on its own index, (withoug having to use a list, i want to use np arrays) i.e
temp = np.array([])
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = np.append(temp, m, axis=0)
On the second iteration lets suppose i get [5,4,15,3,10]
then i would like to have temp as
array([1,4,20,5,3][5,4,15,3,10])
But instead i keep getting [1,4,20,5,3,5,4,15,3,10]
I am new to python but i am sure there is probably a way to concatenate in this way with numpy without using lists?
You have to reshape m in order to have two dimension with
m.reshape(-1, 1)
thus adding the second dimension. Then you could concatenate along axis=1.
np.concatenate(temp, m, axis=1)
List append is much better - faster and easier to use correctly.
temp = []
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = m
Look at the list to see what it created. Then make an array from that:
arr = np.array(temp)
# or `np.vstack(temp)

How to use np.unique on big arrays?

I work with geospatial images in tif format. Thanks to the rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label function).
My idea was to use the unique method of numpy to retrieve the information from these patches as follows:
# identify the clumps
with rio.open(mask) as f:
mask_raster = f.read(1)
class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True)
del mask_raster
# identify the value
with rio.open(src) as f:
src_raster = f.read(1)
src_flat = src_raster.flatten()
del src_raster
values = [src_flat[index] for index in indices]
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})
My problem is this:
For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique tries to allocate an array of the same dim in int64 and I get the following error:
Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64
Is it normal that unique reformats my painting in int64?
Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel np.int32would be sufficent)
Is there another solution using a function I don't know?
The uint64 array is probably allocated during argsort here in the source code.
Since the labels from scipy.ndimage.label are consecutive integers starting at zero you can use numpy.bincount:
num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)
To get values from src you can do the following assignment. It's really inefficient but I don't think it allocates too much memory.
values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster
Maybe scipy.ndimage has a function that better suits the use case.
I think splitting Numpy array into smaller chunks and yield unique:count values will be memory efficient solution as well as changing data type to int16 or similar.
I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion.
As it's slicing the initial raster is faster than I thought :
from scipy import ndimage
import numpy as np
# open the files
with rio.open(mask) as f_mask, rio.open(src) as f_src:
mask_raster = f_mask.read(1)
src_raster = f_src.read(1)
# use patches as slicing material
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
# the value of the patch is the value with the highest count
idx = np.argmax(loc_counts)
counts.append(loc_counts[idx])
values.append(loc_values[idx])
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

Categories