Split numpy 2D Array with Unknown Length to 3D Array - python

I have a (18, 10525) numpy array.
18 columns with 10525 rows, but the number of rows is not always the same and I must slice the array into 18 columns and groups or windows of 200 rows to feed it to AI.
For example I would like to do
data = np.ones((18, 10525))
data.reshape(-1,18,200)
But 10525 isn't divisible by 200 so I get a ValueError. I would like to get a zero padded array of shape (-1,18,200). I.e. add zeros to data until I can do .reshape(-1,18,200). Thanks in advance.

Assuming you want to fill with zeros here is your solution
data = np.ones((18, 10525))
old_size = np.prod(data.shape)
rounded_up_size = (old_size//(18*200)+1)*18*200
reshaped_arr = np.empty(rounded_up_size)
reshaped_arr[:old_size] = data.reshape(-1)
reshaped_arr[old_size:] = 0
reshaped_arr.reshape(-1,18,200)
Notice that I avoided copying all the data. It's just a view on the old data.

Related

Is there a way to write a python function that will create 'N' arrays? (see body)

I have an numpy array that is shape 20, 3. (So 20 3 by 1 arrays. Correct me if I'm wrong, I am still pretty new to python)
I need to separate it into 3 arrays of shape 20,1 where the first array is 20 elements that are the 0th element of each 3 by 1 array. Second array is also 20 elements that are the 1st element of each 3 by 1 array, etc.
I am not sure if I need to write a function for this. Here is what I have tried:
Essentially I'm trying to create an array of 3 20 by 1 arrays that I can later index to get the separate 20 by 1 arrays.
a = np.load() #loads file
num=20 #the num is if I need to change array size
num_2=3
for j in range(0,num):
for l in range(0,num_2):
array_elements = np.zeros(3)
array_elements[l] = a[j:][l]
This gives the following error:
'''
ValueError: setting an array element with a sequence
'''
I have also tried making it a dictionary and making the dictionary values lists that are appended, but it only gives the first or last value of the 20 that I need.
Your array has shape (20, 3), this means it's a 2-dimensional array with 20 rows and 3 columns in each row.
You can access data in this array by indexing using numbers or ':' to indicate ranges. You want to split this in to 3 arrays of shape (20, 1), so one array per column. To do this you can pick the column with numbers and use ':' to mean 'all of the rows'. So, to access the three different columns: a[:, 0], a[:, 1] and a[:, 2].
You can then assign these to separate variables if you wish e.g. arr = a[:, 0] but this is just a reference to the original data in array a. This means any changes in arr will also be made to the corresponding data in a.
If you want to create a new array so this doesn't happen, you can easily use the .copy() function. Now if you set arr = a[:, 0].copy(), arr is completely separate to a and changes made to one will not affect the other.
Essentially you want to group your arrays by their index. There are plenty of ways of doing this. Since numpy does not have a group by method, you have to horizontally split the arrays into a new array and reshape it.
old_length = 3
new_length = 20
a = np.array(np.hsplit(a, old_length)).reshape(old_length, new_length)
Edit: It appears you can achieve the same effect by rotating the array -90 degrees. You can do this by using rot90 and setting k=-1 or k=3 telling numpy to rotate by 90 k times.
a = np.rot90(a, k=-1)

How to concatenate numpy arrays to create a 2d numpy array

I'm working on using AI to give me better odds at winning Keno. (don't laugh lol)
My issue is that when I gather my data it comes in the form of 1d arrays of drawings at a time. I have different files that have gathered the data and formatted it as well as performed simple maths on the data set. Now I'm trying to get the data into a certain shape for my Neural Network layers and am having issues.
formatted_list = file.readlines()
#remove newline chars
formatted_list = list(filter(("\n").__ne__, formatted_list))
#iterate through each drawing, format the ends and split into list of ints
for i in formatted_list:
i = i[1:]
i = i[:-2]
i = [int(j) for j in i.split(",")]
#convert to numpy array
temp = np.array(i)
#t1 = np.reshape(temp, (-1, len(temp)))
#print(np.shape(t1))
#append to master list
master_list.append(temp)
print(np.shape(master_list))
This gives output of "(292,)" which is correct there are 292 rows of data however they contain 20 columns as well. If I comment in the "#t1 = np.reshape(temp, (-1, len(temp))) #print(np.shape(t1))" it gives output of "(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)", etc. I want all of those rows to be added together and keep the columns the same (292,20). How can this be accomplished?
I've tried reshaping the final list and many other things and had no luck. It either populates each number in the row and adds it to the first dimension, IE (5840,) I was expecting to be able to append each new drawing to a master list, convert to numpy array and reshape it to the 292 rows of 20 columns. It just appears that it want's to keep the single dimension. I've tried numpy.concat also and no luck. Thank you.
You can use vstack to concatenate your master_list.
master_list = []
for array in formatted_list:
master_list.append(array)
master_array = np.vstack(master_list)
Alternatively, if you know the length of your formatted_list containing the arrays and array length you can just preallocate the master_array.
import numpy as np
formatted_list = [np.random.rand(20)]*292
master_array = np.zeros((len(formatted_list), len(formatted_list[0])))
for i, array in enumerate(formatted_list):
master_array[i,:] = array
** Edit **
As mentioned by hpaulj in the comments, np.array(), np.stack() and np.vstack() worked with this input and produced a numpy array with shape (7,20).

Efficient way to remove sections of Numpy array

I am working with a numpy array of features in the following format
[[feat1_channel1,feat2_channel1...feat6_channel1,feat1_channel2,feat2_channel2...]] (so each channel has 6 features and the array shape is 1 x (number channels*features_per_channel) or 1 x total_features)
I am trying to remove specified channels from the feature array, ex: removing channel 1 would mean removing features 1-6 associated with channel 1.
my current method is shown below:
reshaped_features = current_feature.reshape((-1,num_feats))
desired_channels = np.delete(reshaped_features,excluded_channels,axis=0)
current_feature = desired_channels.reshape((1,-1))
where I reshape the array to be number_of_channels x number_of_features, remove the rows corresponding to the channels I want to exclude, and then reshape the array with the desired variables into the original format of being 1 x total_features.
The problem with this method is that it tremendously slows down my code because this process is done 1000s of times so I was wondering if there were any suggestions on how to speed this up or alternative approaches?
As an example, given the following array of features:
[[0,1,2,3,4,5,6,7,8,9,10,11...48,49,50,51,52,53]]
i reshape to below:
[[0,1,2,3,4,5],
[6,7,8,9,10,11],
[12,13,14,15,16,17],
.
.
.
[48,49,50,51,52,53]]
and, as an example, if I want to remove the first two channels then the resulting output should be:
[[12,13,14,15,16,17],
.
.
.
[48,49,50,51,52,53]]
and finally:
[[12,13,14,15,16,17...48,49,50,51,52,53]]
I found a solution that did not use np.delete() which was the main culprit of the slowdown, building off the answer from msi_gerva.
I found the channels I wanted to keep using list comp
all_chans = [1,2,3,4,5,6,7,8,9,10]
features_per_channel = 5
my_data = np.arange(len(all_chans)*features_per_channel)
chan_to_exclude = [1,3,5]
channels_to_keep = [i for i in range(len(all_chans)) if i not in chan_to_exclude]
Then reshaped the array
reshaped = my_data.reshape((-1,features_per_channel))
Then selected the channels I wanted to keep
desired_data = reshaped[channels_to_keep]
And finally reshaped to the desired shape
final_data = desired_data.reshape((1,-1))
These changes made the code ~2x faster than the original method.
With the numerical examples, you provided, I would go with:
import numpy as np
arrays = [ii for ii in range(0,54)];
arrays = np.reshape(arrays,(int(54/6),6));
newarrays = arrays.copy();
remove = [1,3,5];
take = [0,2,4,6,7,8];
arrays = np.delete(arrays,remove,axis=0);
newarrays = newarrays[take];
arrays = list(arrays.flatten());
newarrays = list(newarrays.flatten());

Create array from slices of numpy arrays contained in a list object

I have a pandas dataframe of shape (7761940, 16). I converted it into a list of 7762 numpy arrays using np.array_split, each array of shape (1000, 16) .
Now I need to take a slice of the first 50 elements from each array and create a new array of shape (388100, 16) from them. The number 388100 comes from 7762 arrays multiplied by 50 elements.
I know it is a sort of slicing and indexing but I could not manage it.
If you split the array, you waste memory. If you pad the array to allow a nice reshape, you waste memory. This is not a huge problem, but it can be avoided. One way is to use the arcane np.lib.stride_tricks.as_strided function. This function is dangerous, and we would break some rules with it, but as long as you only want the 50 first elements of a chunk, and the last chunk is longer than 50 elements, everything will be fine:
x = ... # your data as a numpy array
chunks = int(np.ceil(x.shape[0] / 1000))
view = np.lib.stride_tricks.as_strided(x, shape=(chunks, 1000, x.shape[-1]), strides=(np.max(*x.strides) * 1000, *x.strides))
This will create a view of shape (7762, 1000, 16) into the original memory, without making a copy. Since your original array does not have a multiple of 1000 rows, the last plane will have some memory that doesn't belong to you. As long as you don't try to access it, it won't hurt you.
Now accessing the first 50 elements of each plane is trivial:
data = view[:, :50, :]
You can unravel the first dimensions to get the final result:
data.reshape(-1, x.shape[-1])
A much healthier way would be to pad and reshape the original.
After getting benefit from friends comments and some survey, i came up with a solution:
my_data = np.array_split(dataframe, 7762) #split dataframe to a list of 7762 ndarray
#each of 1000x16 dimension
my_list = [] #define new list object
for i in range(0,7762): #loop to iterate over the 7762 ndarrays
my_list.append(my_data[i][0:50, :]) #append first 50 rows from each adarray into my_list
You can do something like this:
Split the data of size (7762000 x 16) to (7762 x 1000 x 16)
data_first_split = np.array_split(data, 7762)
Slice the data to 7762 x 50 x 16, to get the first 50 elements of data_first_split
data_second_split = data_first_split[:, :50, :]
Reshape to get 388100 x 16
data_final = np.reshape(data_second_split, (7762 * 50, 16))
as #hpaulj mentioned, you can also do it using np.vstack. IMO you should also give numpy.strides a look.

Replace row with another row in 3D numpy array

I am trying to replace a specific row of NaN's in a 3-D array (filled with NaN's) with rows of known integer values from a specific column in a text file (ex: 24 rows of column 8). Is there a method to perform this replacement that I have missed in my search for help?
My most recent trial code (of many) is as follows:
import numpy as np
tfile = "C:\...\Lee_Gilmer_MEM_GA_01_02_2015.txt"
data = np.genfromtxt(tfile, dtype=None)
#creation of empty 24 hour global matrix
s_array = np.empty((24,361,720))
s_array[:] = np.NAN
#Get values from column 8
c_data = data[:,7]
#Replace all 24 NaN's slices of row 1 column 1 with corresponding 24 row values from column 8
s_array[:,0:1,0:1] = c_data
print s_array
This produces a result of:
ValueError: could not broadcast input array from shape (24) into shape (24,1,1)
When I print out the shape of c_data, I get:
(24L,)
Is this at all possible to do without having to use a loop and replacing each one individually?
The error message tells you pretty much everything you need to know: the array slice on the left-hand side of the assignment has a shape of (24,1,1), whereas the right-hand side has shape (24,). Since these shapes don't match, numpy raises a ValueError.
There are two ways to solve this:
Make the shape of the LHS (24,) rather than (24, 1, 1). A nice way to do this would be to index with an integer rather than a slice for the last two dimensions:
s_array[:, 0, 0] = c_data
Reshape c_data to match the shape of the LHS:
s_array[:, 0:1, 0:1] = c_data.reshape(24, 1, 1)
I think option 1 is a lot more readable.

Categories