How to concatenate numpy arrays to create a 2d numpy array - python

I'm working on using AI to give me better odds at winning Keno. (don't laugh lol)
My issue is that when I gather my data it comes in the form of 1d arrays of drawings at a time. I have different files that have gathered the data and formatted it as well as performed simple maths on the data set. Now I'm trying to get the data into a certain shape for my Neural Network layers and am having issues.
formatted_list = file.readlines()
#remove newline chars
formatted_list = list(filter(("\n").__ne__, formatted_list))
#iterate through each drawing, format the ends and split into list of ints
for i in formatted_list:
i = i[1:]
i = i[:-2]
i = [int(j) for j in i.split(",")]
#convert to numpy array
temp = np.array(i)
#t1 = np.reshape(temp, (-1, len(temp)))
#print(np.shape(t1))
#append to master list
master_list.append(temp)
print(np.shape(master_list))
This gives output of "(292,)" which is correct there are 292 rows of data however they contain 20 columns as well. If I comment in the "#t1 = np.reshape(temp, (-1, len(temp))) #print(np.shape(t1))" it gives output of "(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)", etc. I want all of those rows to be added together and keep the columns the same (292,20). How can this be accomplished?
I've tried reshaping the final list and many other things and had no luck. It either populates each number in the row and adds it to the first dimension, IE (5840,) I was expecting to be able to append each new drawing to a master list, convert to numpy array and reshape it to the 292 rows of 20 columns. It just appears that it want's to keep the single dimension. I've tried numpy.concat also and no luck. Thank you.

You can use vstack to concatenate your master_list.
master_list = []
for array in formatted_list:
master_list.append(array)
master_array = np.vstack(master_list)
Alternatively, if you know the length of your formatted_list containing the arrays and array length you can just preallocate the master_array.
import numpy as np
formatted_list = [np.random.rand(20)]*292
master_array = np.zeros((len(formatted_list), len(formatted_list[0])))
for i, array in enumerate(formatted_list):
master_array[i,:] = array
** Edit **
As mentioned by hpaulj in the comments, np.array(), np.stack() and np.vstack() worked with this input and produced a numpy array with shape (7,20).

Related

Appending numpy array of arrays

I am trying to append an array to another array but its appending them as if it was just one array. What I would like to have is have each array appended on its own index, (withoug having to use a list, i want to use np arrays) i.e
temp = np.array([])
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = np.append(temp, m, axis=0)
On the second iteration lets suppose i get [5,4,15,3,10]
then i would like to have temp as
array([1,4,20,5,3][5,4,15,3,10])
But instead i keep getting [1,4,20,5,3,5,4,15,3,10]
I am new to python but i am sure there is probably a way to concatenate in this way with numpy without using lists?
You have to reshape m in order to have two dimension with
m.reshape(-1, 1)
thus adding the second dimension. Then you could concatenate along axis=1.
np.concatenate(temp, m, axis=1)
List append is much better - faster and easier to use correctly.
temp = []
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = m
Look at the list to see what it created. Then make an array from that:
arr = np.array(temp)
# or `np.vstack(temp)

Efficient way to remove sections of Numpy array

I am working with a numpy array of features in the following format
[[feat1_channel1,feat2_channel1...feat6_channel1,feat1_channel2,feat2_channel2...]] (so each channel has 6 features and the array shape is 1 x (number channels*features_per_channel) or 1 x total_features)
I am trying to remove specified channels from the feature array, ex: removing channel 1 would mean removing features 1-6 associated with channel 1.
my current method is shown below:
reshaped_features = current_feature.reshape((-1,num_feats))
desired_channels = np.delete(reshaped_features,excluded_channels,axis=0)
current_feature = desired_channels.reshape((1,-1))
where I reshape the array to be number_of_channels x number_of_features, remove the rows corresponding to the channels I want to exclude, and then reshape the array with the desired variables into the original format of being 1 x total_features.
The problem with this method is that it tremendously slows down my code because this process is done 1000s of times so I was wondering if there were any suggestions on how to speed this up or alternative approaches?
As an example, given the following array of features:
[[0,1,2,3,4,5,6,7,8,9,10,11...48,49,50,51,52,53]]
i reshape to below:
[[0,1,2,3,4,5],
[6,7,8,9,10,11],
[12,13,14,15,16,17],
.
.
.
[48,49,50,51,52,53]]
and, as an example, if I want to remove the first two channels then the resulting output should be:
[[12,13,14,15,16,17],
.
.
.
[48,49,50,51,52,53]]
and finally:
[[12,13,14,15,16,17...48,49,50,51,52,53]]
I found a solution that did not use np.delete() which was the main culprit of the slowdown, building off the answer from msi_gerva.
I found the channels I wanted to keep using list comp
all_chans = [1,2,3,4,5,6,7,8,9,10]
features_per_channel = 5
my_data = np.arange(len(all_chans)*features_per_channel)
chan_to_exclude = [1,3,5]
channels_to_keep = [i for i in range(len(all_chans)) if i not in chan_to_exclude]
Then reshaped the array
reshaped = my_data.reshape((-1,features_per_channel))
Then selected the channels I wanted to keep
desired_data = reshaped[channels_to_keep]
And finally reshaped to the desired shape
final_data = desired_data.reshape((1,-1))
These changes made the code ~2x faster than the original method.
With the numerical examples, you provided, I would go with:
import numpy as np
arrays = [ii for ii in range(0,54)];
arrays = np.reshape(arrays,(int(54/6),6));
newarrays = arrays.copy();
remove = [1,3,5];
take = [0,2,4,6,7,8];
arrays = np.delete(arrays,remove,axis=0);
newarrays = newarrays[take];
arrays = list(arrays.flatten());
newarrays = list(newarrays.flatten());

Create array from slices of numpy arrays contained in a list object

I have a pandas dataframe of shape (7761940, 16). I converted it into a list of 7762 numpy arrays using np.array_split, each array of shape (1000, 16) .
Now I need to take a slice of the first 50 elements from each array and create a new array of shape (388100, 16) from them. The number 388100 comes from 7762 arrays multiplied by 50 elements.
I know it is a sort of slicing and indexing but I could not manage it.
If you split the array, you waste memory. If you pad the array to allow a nice reshape, you waste memory. This is not a huge problem, but it can be avoided. One way is to use the arcane np.lib.stride_tricks.as_strided function. This function is dangerous, and we would break some rules with it, but as long as you only want the 50 first elements of a chunk, and the last chunk is longer than 50 elements, everything will be fine:
x = ... # your data as a numpy array
chunks = int(np.ceil(x.shape[0] / 1000))
view = np.lib.stride_tricks.as_strided(x, shape=(chunks, 1000, x.shape[-1]), strides=(np.max(*x.strides) * 1000, *x.strides))
This will create a view of shape (7762, 1000, 16) into the original memory, without making a copy. Since your original array does not have a multiple of 1000 rows, the last plane will have some memory that doesn't belong to you. As long as you don't try to access it, it won't hurt you.
Now accessing the first 50 elements of each plane is trivial:
data = view[:, :50, :]
You can unravel the first dimensions to get the final result:
data.reshape(-1, x.shape[-1])
A much healthier way would be to pad and reshape the original.
After getting benefit from friends comments and some survey, i came up with a solution:
my_data = np.array_split(dataframe, 7762) #split dataframe to a list of 7762 ndarray
#each of 1000x16 dimension
my_list = [] #define new list object
for i in range(0,7762): #loop to iterate over the 7762 ndarrays
my_list.append(my_data[i][0:50, :]) #append first 50 rows from each adarray into my_list
You can do something like this:
Split the data of size (7762000 x 16) to (7762 x 1000 x 16)
data_first_split = np.array_split(data, 7762)
Slice the data to 7762 x 50 x 16, to get the first 50 elements of data_first_split
data_second_split = data_first_split[:, :50, :]
Reshape to get 388100 x 16
data_final = np.reshape(data_second_split, (7762 * 50, 16))
as #hpaulj mentioned, you can also do it using np.vstack. IMO you should also give numpy.strides a look.

Initializing or populating multiple numpy arrays from h5 file groups

I have an h5 file with 5 groups, each group containing a 3D dataset. I am looking to build a for loop that allows me to extract each group into a numpy array and assign the numpy array to an object with the group header name. I am able to get a number of different methods to work with one group, but when I try to build a for loop that applies to code to all 5 groups, it breaks. For example:
import h5py as h5
import numpy as np
f = h5.File("FFM0012.h5", "r+") #read in h5 file
print(list(f.keys())) #['FFM', 'Image'] for my dataset
FFM = f['FFM'] #Generate object with all 5 groups
print(list(FFM.keys())) #['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr'] for my dataset
Amp = FFM['Amp'] #Generate object for 1 group
Amp = np.array(Amp) #Turn into numpy array, this works.
Now when I try to apply the same logic with a for loop:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names ['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr']
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
When I run this code I get "NameError: name 'Amp' is not defined". I've tried initializing the numpy array before the for loop with:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names
Amp = np.array([])
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
This produces the error message "IndexError: too many indices for array"
I've also tried generating a dictionary and creating numpy arrays from the dictionary. That is a similar story where I can get the code to work for one h5 group, but it falls apart when I build the for loop.
Any suggestions are appreciated!
You seem to have jumped to using h5py and numpy before learning much of Python
Amp = np.array([]) # creates a numpy array with 0 elements
for h5_key in h5_keys: # h5_key is set of a new value each iteration
tmp = FFM[h5_key]
h5_key = np.array(tmp) # now you reassign h5_key
print(Amp[30,30,30]) # Amp is the original (0,) shape array
Try this basic python loop, paying attention to the value of i:
alist = [1,2,3]
for i in alist:
print(i)
i = 10
print(i)
print(alist) # no change to alist
f is the file.
FFM = f['FFM']
is a group
Amp = FFM['Amp']
is a dataset. There are various ways of load the dataset into an numpy array. I believe the [...] slicing is the current preferred one. .value used to used but is now deprecated (loading dataset)
Amp = FFM['Amp'][...]
is an array.
alist = [FFM[key][...] for key in h5_keys]
should create a list of arrays from the FFM group.
If the shapes are compatible, you can concatenate the arrays into one array:
np.array(alist)
np.stack(alist)
np.concatatenate(alist, axis=0) # or other axis
etc
adict = {key: FFM[key][...] for key in h5_keys}
should crate of dictionary of array keyed by dataset names.
In Python, lists and dictionaries are the ways of accumulating objects. The h5py groups behave much like dictionaries. Datasets behave much like numpy arrays, though they remain on the disk until loaded with [...].

Updating a NumPy array by adding columns

I am working with a large dataset and I would like to make a new array by adding columns, updating the array by opening a new file, taking a piece from it and adding this to my new array.
I have already tried the following code:
import numpy as np
Powers = np.array([])
with open('paths powers.tex', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
Pname = data[0:32446,0]
Powers = np.append(Powers,Pname, axis = 1)
np.savetxt("Powers.txt", Powers)
However, what it does here is just adding the stuff from Pname in the bottom of the array, making a large 1D array instead of adding new columns and making an ndarray.
I have also tried this with numpy.insert, numpy.hstack and numpy.concatenate and I tried changing the shape of Pname. Unfortunately, they all give me the same result.
Have you tried numpy.column_stack?
Powers = np.column_stack([Powers,Pname])
However, the array is empty first, so make sure that the array isn't empty before concatenating or you will get a dimension mismatch error:
import numpy as np
Powers = np.array([])
with open('paths powers.tex', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
Pname = data[0:32446,0]
if len(Powers) == 0:
Powers = Pname[:,None]
else:
Powers = np.column_stack([Powers,Pname])
np.savetxt("Powers.txt", Powers)
len(Powers) will check the amount of rows that exist in Powers. At the start, this should be 0 so at the first iteration, this is true and we will need to explicitly make Powers equal to a one column 2D array that consists of the first column in your file. Powers = Pname[:,None] will help you do this, which is the same as Powers = Pname[:,np.newaxis]. This transforms a 1D array into a 2D array with a singleton column. Now, the problem is that when you have 1D arrays in numpy, they are agnostic of whether they are rows or columns. Therefore, you must explicitly convert the arrays into columns before appending. numpy.column_stack takes care of that for you.
However, you'll also need to make sure that the Powers is a 2D matrix with one column the first time the loop iterates. Should you not want to use numpy.column_stack, you can still certainly use numpy.append, but make sure that what you're concatenating to the array is a column. The thing we talked about above should help you do this:
import numpy as np
Powers = np.array([])
with open('paths powers.tex', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
Pname = data[0:32446,0]
if len(Powers) == 0:
Powers = Pname[:,None]
else:
Pname = Pname[:,None]
Powers = np.append(Powers, Pname, axis=1)
np.savetxt("Powers.txt", Powers)
The second statement ensures that the array becomes a 2D array with a singleton column before concatenating.

Categories