Converting array of pandas Dataframes into 3D NumPy array - python

I have a numpy array of pandas Dataframes which I need to convert into a 3D numpy array of the form (samples, rows, columns) in order to feed into a Keras model for training. I have 46 samples in my dataset and each sample is 1101 rows by 64 columns.
Here is the code for my 1D numpy array of Dataframes:
static_dfs = []
#read in static csvs as pandas df
#static files is my np array of csv files
for x in range(0, static_files.size):
df = pd.read_csv(static_files[x], sep='\t', skiprows=skip_rows, header=(0))
#append df to list
static_dfs.append(df)
#convert list to np array
static_dfs = np.asarray(static_dfs)
Indeed the shape of the array is (46,) [the number of samples].
If I look at one of the Dataframes in the array (static_dfs[0] for instance) the shape is (1101, 64).
I then try to convert this to 3D numpy array:
static_nps = []
for x in range(0, static_dfs.size):
static_nps.append(static_dfs[x].to_numpy())
#convert to numpy array
static_nps = np.asarray(static_nps)
However it gives me this error:
could not broadcast input array from shape (1101,64) into shape (1101)
for the line of code:
#convert to numpy array
static_nps = np.asarray(static_nps)
Worst part is I had it working before, but a collaborator of mine went through my code and edited it after we found a bug in one of our data files. Now I can't seem to get it back to working like before and am stuck :(
The desired shape of my 3D array would look like (46, 1101, 64). If anyone could solve this you would be a huge help! Thanks

Related

Best way to store and represent many 1D numpy float arrays to one 1D numpy array

I'm converting .bed files into 1D numpy float arrays. Later on, I will need this 1D numpy float array. For circa 700 .bed files saving them in 1D numpy array is very costly.
My solution is to convert them into string array and consecutively concatenating them so that I can retrieve them later in the order they are concatenated. Like getting the last array as shown below.
import numpy as np
array_size=10000
number_of_files=700
sample1 = np.random.uniform(low=0.5, high=13.3, size=(1,array_size))
s1 = np.array(["%.2f" % x for x in sample1.reshape(sample1.size)])
np.save('test',s1)
for i in range(number_of_files):
sample = np.random.uniform(low=0.5, high=13.3, size=(1,array_size))
s = np.array(["%.2f" % x for x in sample.reshape(sample.size)])
s_temp=np.load('test.npy',mmap_mode='r')
s_new=['%s_%s' %(x,y) for x,y in zip(s_temp,s)]
np.save('test',s_new)
result=np.load('test.npy')
last=[x.split('_')[100] for x in result]
However, my test code shows that this string array is much more costly.
Storing 1D numpy float of 250M size costs for 1.9 GB.
700 files would make 1330 GM!!!.
Storing 1D numpy string array of 10 K size costs for 143 MB.
(250M/10K)*143MB would make 3575 GB!!!
Do you have any better solution for this representation and later retrieval problem?

Is it possible to convert a numpy array to a csv file and then load it back as a numpy array?

I have a dataset of images. I have successfully iterated through my directories and subdirectories to store the images into numpy array.
I have used the following statement:
Image_array = np.array(Image_array)
My array size is: 100x224x224
This works fine and the images get stored properly. However, I am now trying to save this numpy array into a CSV file. I have flattened the numpy array and have saved it in an array.csv file as shown below:
array = array.flatten('F')
np.savetxt('array.csv', array, delimiter=',', fmt='%d')
The above code just creates 1 CSV file, with one column with the pixel values.
I then attempted to read the CSV data back into a numpy array but the data is heavily messed up when loading as the image is just blurred. The array also displays with '.' after each number which it was not doing prior.
filename = "array.csv"
data = np.loadtxt(filename, delimiter=',')
new= np.array((data).reshape(100,224,224),order='F')
Am I missing something? please assist?
I then attempted to read the csv data back into a numpy array but the data is heavily messed up
Don't use order parameter or use it consistently. You first flatten it with F but you didn't specify order parameter when reshaping it back, which defaults to C. I think you tried to do it but order parameter placed in wrong function, it should be inside reshape.
The array also displays with '.' after each number which it was not doing prior.
Read the data with the same dtype as your prior matrix like data = np.loadtxt(filename, delimiter=',', dype=int). Since you didn't give a dtype it is converted to float I think.
Np.flatten() creates a 1D array (which gives one csv instead of 100). Try splitting the images first and iterating through them, like this:
import numpy as np
x = np.random.rand(100, 244, 244)
images = [x[i,:,:] for i in range(100)]
then your images come out like images[1], images [36], etc so you can save them like this
def makelist():
set = []
for i in range(100):
set.append(f"array{i}.csv")
return set
files = makelist()
for file in files:
for image in images:
np.savetxt(file, image)

Numpy array shape after extraction from Pandas Dataframe

I have a column in a Dataframe where each cell has a (300,) shaped numpy array.
When I extract the values of this column using the .values method, I get a numpy array of shape (N,) where N is the number of rows of the Dataframe. And each element of N is a (300,) array. I would have expected the extracted shape to be (Nx300).
So I would like to shape of the extracted column to be (Nx300). I tried using pd.as_matrix() but this still gets me a numpy array of shape (N,).
Any suggestions?
You can use numpy.concatenate, connvert to list and cast to array:
a = np.random.randint(10, size=300)
print (a.shape)
(300,)
df = pd.DataFrame({ 'A':[a,a,a]})
arr = np.array(np.concatenate(df.values).tolist())
print (arr.shape)
(3, 300)

Pandas Series.as_matrix() doesn't properly convert a series of nd arrays into a single nd array

I have a pandas dataframe where one column is labeled "feature_vector" and contains in it a 1d numpy array with a bunch of numbers. Now, I am needing to use this data in an scikit learn model, so I need it as a single numpy array. So naturally I call DataFrame["feature_vector"].as_matrix() to get the numpy array from the correct series. The only problem is, the as_matrix() function will return an 1d numpy array where each element is an 1d numpy array containing each vector. When this is passed to an sklearn model's .fit() function, it throws an error. What I instead need is a 2d numpy array rather than the 1d array of 1d arrays. I wrote this work around, which uses presumably unnecessary memory and computation time:
x = dataframe["feature_vector"].as_matrix()
#x is a 1d array of 1d arrays.
l = []
for e in x:
l.append(e)
x = np.array(l)
#x is now a single 2d array.
Is this a bug in pandas .as_matrix()? Is there a better work around that doesn't require me to change the structure of the original dataframe?

Convert list of NDarrays to dataframe

I'm trying to convert a list ND arrays to a dataframe in order to do a Isomap on it. But this doesn't convert. Anyone how to convert in such that I can do an Isomap on it?
#Creation and filling of list samples*
samples = list()
for i in range(72):
img =misc.imread('Datasets/ALOI/32/32_r'+str(i*5)+'.png' )
samples.append(img)
...
df = pd.DataFrame(samples) #This doesn't work gives
#ValueError: Must pass 2-d input*
...
iso = manifold.Isomap(n_neighbors=4, n_components=3)
iso.fit(df) #The end goal of my DataFrame
That is obvious, isn't it? All images are 2D data, rows and columns. Stacking them in a list causes it to gain a third dimension. DataFrames are by nature 2D. Hence the error.
You have 2 possible fixes:
Create a Panel.
wp = pd.Panel.from_dict(zip(samples, [str(i*5) for i in range(72)]))
Stack your arrays one on top of the other, or side by side:
# On top of another:
df = pd.concat([pd.DataFrame(sample) for sample in samples], axis=0,
keys=[str(i*5) for i in range(72)])
# Side by side:
df = pd.concat([pd.DataFrame(sample) for sample in samples], axis=1,
keys=[str(i*5) for i in range(72)])
Another way to do it is to convert your 2D arrays (images) to 1D arrays (that are expected by sklearn) using the reshape method on the images:
for i in range(yourRange):
img = misc.imread(yourFile)
samples.append(img.reshape(-1))
df = pd.DataFrame(samples)
Olivera almost had it.
the problem
When you run misc.imread, the output is a NxM (2D) array. Putting this in a list, makes it 3D. DataFrame expects a 2D input.
the fix
Before it goes in the list, the array should be 'flattened' using ravel:
img =misc.imread('Datasets/ALOI/32/32_r'+str(i*5)+'.png' ).ravel()
why .reshape(-1) doesn't work
Reshaping the array preserves the array's rank. Instead of converting it to an Nx1 array, you want it to be Nx(nothing), which is what ravel() does.

Categories