Updating a NumPy array by adding columns - python

I am working with a large dataset and I would like to make a new array by adding columns, updating the array by opening a new file, taking a piece from it and adding this to my new array.
I have already tried the following code:
import numpy as np
Powers = np.array([])
with open('paths powers.tex', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
Pname = data[0:32446,0]
Powers = np.append(Powers,Pname, axis = 1)
np.savetxt("Powers.txt", Powers)
However, what it does here is just adding the stuff from Pname in the bottom of the array, making a large 1D array instead of adding new columns and making an ndarray.
I have also tried this with numpy.insert, numpy.hstack and numpy.concatenate and I tried changing the shape of Pname. Unfortunately, they all give me the same result.

Have you tried numpy.column_stack?
Powers = np.column_stack([Powers,Pname])
However, the array is empty first, so make sure that the array isn't empty before concatenating or you will get a dimension mismatch error:
import numpy as np
Powers = np.array([])
with open('paths powers.tex', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
Pname = data[0:32446,0]
if len(Powers) == 0:
Powers = Pname[:,None]
else:
Powers = np.column_stack([Powers,Pname])
np.savetxt("Powers.txt", Powers)
len(Powers) will check the amount of rows that exist in Powers. At the start, this should be 0 so at the first iteration, this is true and we will need to explicitly make Powers equal to a one column 2D array that consists of the first column in your file. Powers = Pname[:,None] will help you do this, which is the same as Powers = Pname[:,np.newaxis]. This transforms a 1D array into a 2D array with a singleton column. Now, the problem is that when you have 1D arrays in numpy, they are agnostic of whether they are rows or columns. Therefore, you must explicitly convert the arrays into columns before appending. numpy.column_stack takes care of that for you.
However, you'll also need to make sure that the Powers is a 2D matrix with one column the first time the loop iterates. Should you not want to use numpy.column_stack, you can still certainly use numpy.append, but make sure that what you're concatenating to the array is a column. The thing we talked about above should help you do this:
import numpy as np
Powers = np.array([])
with open('paths powers.tex', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
Pname = data[0:32446,0]
if len(Powers) == 0:
Powers = Pname[:,None]
else:
Pname = Pname[:,None]
Powers = np.append(Powers, Pname, axis=1)
np.savetxt("Powers.txt", Powers)
The second statement ensures that the array becomes a 2D array with a singleton column before concatenating.

Related

How to concatenate numpy arrays to create a 2d numpy array

I'm working on using AI to give me better odds at winning Keno. (don't laugh lol)
My issue is that when I gather my data it comes in the form of 1d arrays of drawings at a time. I have different files that have gathered the data and formatted it as well as performed simple maths on the data set. Now I'm trying to get the data into a certain shape for my Neural Network layers and am having issues.
formatted_list = file.readlines()
#remove newline chars
formatted_list = list(filter(("\n").__ne__, formatted_list))
#iterate through each drawing, format the ends and split into list of ints
for i in formatted_list:
i = i[1:]
i = i[:-2]
i = [int(j) for j in i.split(",")]
#convert to numpy array
temp = np.array(i)
#t1 = np.reshape(temp, (-1, len(temp)))
#print(np.shape(t1))
#append to master list
master_list.append(temp)
print(np.shape(master_list))
This gives output of "(292,)" which is correct there are 292 rows of data however they contain 20 columns as well. If I comment in the "#t1 = np.reshape(temp, (-1, len(temp))) #print(np.shape(t1))" it gives output of "(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)", etc. I want all of those rows to be added together and keep the columns the same (292,20). How can this be accomplished?
I've tried reshaping the final list and many other things and had no luck. It either populates each number in the row and adds it to the first dimension, IE (5840,) I was expecting to be able to append each new drawing to a master list, convert to numpy array and reshape it to the 292 rows of 20 columns. It just appears that it want's to keep the single dimension. I've tried numpy.concat also and no luck. Thank you.
You can use vstack to concatenate your master_list.
master_list = []
for array in formatted_list:
master_list.append(array)
master_array = np.vstack(master_list)
Alternatively, if you know the length of your formatted_list containing the arrays and array length you can just preallocate the master_array.
import numpy as np
formatted_list = [np.random.rand(20)]*292
master_array = np.zeros((len(formatted_list), len(formatted_list[0])))
for i, array in enumerate(formatted_list):
master_array[i,:] = array
** Edit **
As mentioned by hpaulj in the comments, np.array(), np.stack() and np.vstack() worked with this input and produced a numpy array with shape (7,20).

Appending numpy array of arrays

I am trying to append an array to another array but its appending them as if it was just one array. What I would like to have is have each array appended on its own index, (withoug having to use a list, i want to use np arrays) i.e
temp = np.array([])
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = np.append(temp, m, axis=0)
On the second iteration lets suppose i get [5,4,15,3,10]
then i would like to have temp as
array([1,4,20,5,3][5,4,15,3,10])
But instead i keep getting [1,4,20,5,3,5,4,15,3,10]
I am new to python but i am sure there is probably a way to concatenate in this way with numpy without using lists?
You have to reshape m in order to have two dimension with
m.reshape(-1, 1)
thus adding the second dimension. Then you could concatenate along axis=1.
np.concatenate(temp, m, axis=1)
List append is much better - faster and easier to use correctly.
temp = []
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = m
Look at the list to see what it created. Then make an array from that:
arr = np.array(temp)
# or `np.vstack(temp)

How to use np.unique on big arrays?

I work with geospatial images in tif format. Thanks to the rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label function).
My idea was to use the unique method of numpy to retrieve the information from these patches as follows:
# identify the clumps
with rio.open(mask) as f:
mask_raster = f.read(1)
class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True)
del mask_raster
# identify the value
with rio.open(src) as f:
src_raster = f.read(1)
src_flat = src_raster.flatten()
del src_raster
values = [src_flat[index] for index in indices]
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})
My problem is this:
For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique tries to allocate an array of the same dim in int64 and I get the following error:
Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64
Is it normal that unique reformats my painting in int64?
Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel np.int32would be sufficent)
Is there another solution using a function I don't know?
The uint64 array is probably allocated during argsort here in the source code.
Since the labels from scipy.ndimage.label are consecutive integers starting at zero you can use numpy.bincount:
num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)
To get values from src you can do the following assignment. It's really inefficient but I don't think it allocates too much memory.
values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster
Maybe scipy.ndimage has a function that better suits the use case.
I think splitting Numpy array into smaller chunks and yield unique:count values will be memory efficient solution as well as changing data type to int16 or similar.
I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion.
As it's slicing the initial raster is faster than I thought :
from scipy import ndimage
import numpy as np
# open the files
with rio.open(mask) as f_mask, rio.open(src) as f_src:
mask_raster = f_mask.read(1)
src_raster = f_src.read(1)
# use patches as slicing material
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
# the value of the patch is the value with the highest count
idx = np.argmax(loc_counts)
counts.append(loc_counts[idx])
values.append(loc_values[idx])
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

using python to read a column vector from excel

OK i think this must be a super simple thing to do, but i keep getting index error messages no matter how i try to format this. my professor is making us multiply a 1X3 row vector by a 3x1 column vector, and i cant get python to read the column vector. the row vector is from cells A1-C1, and the column vector is from cells A3-A5 in my excel spreadsheet. I am using the right "format" for how he wants us to do it, (if i do something that works, but don't format it the way he likes i don't get credit.) the row vector is reading properly in the variable explorer, but i am only getting a 2x2 column vector (with the first column being the 0th column and being all zeros, again how he wants it), I havent even gotten to the multiplication part of the code because i cant get python to read the column vector correctly. here is the code:
import xlwings as xw
import numpy as np
filename = 'C:\\python\\homework4.xlsm'
wb=xw.Workbook(filename)
#initialize vectors
a = np.zeros((1+1,3+1))
b = np.zeros((3+1,1+1))
n=3
#Read a and b vectors from excel
for i in range(1,n+1):
for j in range(1,n+1):
a[i,j] = xw.Range((i,j)).value
'end j'
b[i,j] = xw.Range((i+2,j)).value
'end i'
Something like this should work. The way you iterate over i and j are wrong (plus the initalization of a and b)
#initialize vectors
a = np.zeros((1,3))
b = np.zeros((3,1))
n=3
#Read a and b vectors from excel
for i in range(0,n):
a[0,i] = xw.Range((1,i+1)).value
for i in range (0,n)
b[i,0] = xw.Range((3+i,1)).value
Remember, Python use 0-based indexing and Excel use 1-based indexing.
This code will read out the vectors properly, and then you can check on numpy "scalar product" to produce the multiplication. You can also assign the whole vectors immediately without loop.
import xlwings as xw
import numpy as np
filename = 'C:\\Temp\\Book2.xlsx'
wb=xw.Book(filename).sheets[0]
n=3
#initialize vectors
a = np.zeros((1,n))
b = np.zeros((n,1))
#Read a and b vectors from excel
for j in range(1,n+1):
a[0, j-1] = wb.range((1, j)).value
b[j-1, 0] = wb.range((j+3-1, 1)).value
#Without loop
a = wb.range((1, 1),(1, 3)).value
b = wb.range((3, 1),(5, 1)).value

Fastest way to get elements from a numpy array and create a new numpy array

I have numpy array called data of dimensions 150x4
I want to create a new numpy array called mean of dimensions 3x4 by choosing random elements from data.
My current implementation is:
cols = (data.shape[1])
K=3
mean = np.zeros((K,cols))
for row in range(K):
index = np.random.randint(data.shape[0])
for col in range(cols):
mean[row][col] = data[index][col]
Is there a faster way to do the same?
You can specify the number of random integers in numpy.randint (third argument). Also, you should be familiar with numpy.array's index notations. Here, you can access all the elements in one row by : specifier.
mean = data[np.random.randint(0,len(data),3),:]

Categories