Large Numpy array causses Error when trying to load

Large Numpy array causses Error when trying to load - python

Edit : For those looking for answers, my file was corrupted hpauli suggested, knowing the shape the array should have, I opened the file using open(filenam) as append mode and appended 0 until there was a correct amount of data, then I split the files in 2 and loaded the first part, then splitted the second half and splitted itself into 2 more parts, etc... until I had recovered most of my data.
I'm making an ai using image recognition, so I recorded each frames of me playing into a numpy array. It worked just fine when the first time I exported all the images and got the 6 thousand of them. Now, I was recording a lot more data but suddently I get this erro with no change in my code or environement
Traceback (most recent call last):
File "D:\Dev\Fall-Guys-AI-Race\utils\CreateImages.py", line 6, in
>
data = np.load("D:/Dev/Fall-Guys-AI-Race/data/training_data.npy", allow_pickle=True)
File "D:\Program Files\Python39\lib\site-packages\numpy\lib\npyio.py", line 430, in load
>
return format.read_array(fid, allow_pickle=allow_pickle,
File "D:\Program Files\Python39\lib\site-packages\numpy\lib\format.py", line 786, in read_array
>
array.shape = shape
ValueError: cannot reshape array of size 2147483648 into shape (14460,224,224,3)
Here is my CreateImages.py :
import cv2, os
import numpy as np
listing = os.listdir("D:/Dev/Fall-Guys-AI-Race/data/")
for j in range(1):
data = np.load("D:/Dev/Fall-Guys-AI-Race/data/training_data.npy", allow_pickle=True)
targets = np.load("D:/Dev/Fall-Guys-AI-Race/data/target_data.npy", allow_pickle=True)
print(f'Image Data Shape: {data.shape}')
print(f'targets Shape: {targets.shape}')
# Lets see how many of each type of move we have.
unique_elements, counts = np.unique(targets, return_counts=True)
# Store both data and targets in a list.
# We may want to shuffle down the road.
holder_list = []
for i, image in enumerate(data):
holder_list.append([data[i], targets[i]])
count_up = 0
count_left = 0
count_right = 0
count_jump = 0
count_down = 0
for data in holder_list:
#writes data to image in correct folder, skipped because lots of lines:
cv2.imwrite(f"*my_path*{count_left}.png", data[0])
print("done")
print(count_down, count_up, count_jump, count_left, count_right)
Thanks for the help
Edit: I cant even load the array (which is stored as a file) so i don't think I can modify it

It appears that you are attempting to load numpy arrays, and the new arrays are larger than the previously loaded arrays.
The error notice informs you that the array size you are attempting to load cannot be moulded into the appropriate shape.
This might occur because the data size has expanded and the array cannot fit in memory, resulting in the error.
To fix this, consider
saving the data in smaller pieces or splitting the data into
different files and loading them independently.

Related

How can I get my dicom splitter/divider to save split/divided images?

I have cobbled together some code on python, to try and work through a folder of dicom files, splitting each image in two.
All my dicom files are X-rays of both the left and right feet, and I need to separate them.
To do this I am adapting some code produced by #g_unit seen here
Unfortunately - this attempt results in two unaltered copies of the original file - unsplit. It does work when writing the files as PNG or JPG, but not when writing as dicoms. My test image in the console also looks good.
In my below example, I am using a folder with only one file in it. I will adapt to write the new files and filenames after I get my single sample to work.
import matplotlib.pyplot as plt
import pydicom
import pydicom as pd
import os
def main():
path = 'C:/.../test_block_out/'
# iterate through the names of contents of the folder
for file in os.listdir(path):
# create the full input path and read the file
input_path = os.path.join(path, file)
dataset = pd.dcmread(input_path)
shape = dataset.pixel_array.shape
# get the half of the x dimension. For the y dimension use shape[0]
half_x = int(shape[1] / 2)
# slice the halves
# [first_axis, second_axis] so [:,:half_x] means slice all from first axis, slice 0 to half_x from second axis
left_part = dataset.pixel_array[:, :half_x].tobytes()
right_part = dataset.pixel_array[:,half_x:].tobytes()
#Save halves
path_to_left_image = 'C:.../test_file/left.dcm'
path_to_right_image = 'C:.../test_file/right.dcm'
dataset.save_as(path_to_left_image, left_part)
dataset.save_as(path_to_right_image, right_part)
#print test image
plt.imshow(dataset.pixel_array[:, :half_x])
#plt.imshow(dataset.pixel_array[:,half_x:])
if __name__ == '__main__':
main()
I have tried to write the pixel array to dataset.PixelData - but this throws the error:
ValueError: The length of the pixel data in the dataset (5120000 bytes) doesn't match the expected length (10240000 bytes). The dataset may be corrupted or there may be an issue with the pixel data handler.
Which makes sense, since its half my original dimensions. It will write a DCM, but I cannot load this DCM into any dicom viewer tools ('Decode error!')
Is there a way to get this to write the files as DCMs, not PNGs? Or will the DCMs always bug if the dimensions are incorrect?

A kind colleague has helped by providing the answer.
The issue was that I was saving "dataset", not "left_part".
The solution was to create a new pydicom object , deep copying the dcm file, and then modifying the copy.
Code below:
# iterate through the names of contents of the folder
for file in os.listdir(path):
# create the full input path and read the file
input_path = os.path.join(path, file)
dataset = pd.dcmread(input_path)
left_part = copy.deepcopy(dataset)
right_part = copy.deepcopy(dataset)
shape = dataset.pixel_array.shape
# get the half of the x dimension. For the y dimension use shape[0]
half_x = int(shape[1] / 2)
# slice the halves
# [first_axis, second_axis] so [:,:half_x] means slice all from first axis, slice 0 to half_x from second axis
left_part.PixelData = dataset.pixel_array[:, :half_x].tobytes()
left_part['Columns'].value=half_x
right_part.PixelData = dataset.pixel_array[:,half_x:].tobytes()
right_part['Columns'].value=shape[1]-half_x
#Save halves
path_to_left_image = os.path.join(path, 'left_'+file)
path_to_right_image = os.path.join(path, 'right_'+file)
left_part.save_as(path_to_left_image)
right_part.save_as(path_to_right_image)
#print test image
plt.imshow(left_part.pixel_array)
plt.show()

.npz file contains different datatypes, how do I plot the images contained within the .npz file?

im currently trying to learn Python through a project at the moment, and I have been given an .npz file
containing different datatypes in them; I have explored the data using
cell_data = np.load("C:/Users/cell-data.npz")
d = dict(zip(("data:{}".format(k) for k in cell_data), (cell_data[k] for k in cell_data))) print(d)
Which gives me this as an output.
Ive also run cell_data.files, telling me that the file contains ['images', 'counts', 'folds', 'compressed', 'allow_pickle']
How would I retrieve individual images, and plot them? Usually if it was just a single image in an .npz file I would use plt.imshow('thatfile.npz'), but im unsure how to do this when there are multiple files of different datatypes within the .npz.
I have also tried the following:
images = cell_data["images"]
counts = cell_data["counts"]
folds = cell_data["folds"]
X0 = images[folds == 0]
Y0 = counts[folds == 0]
plt.imshow(images, cmap='gray')
plt.show()
However this doesnt seem to be working, I experience the error:
TypeError: Invalid shape (2351, 256, 256, 3) for image data
Any help would be appreciated, thank you

Running out of memory when building features (converting images into derived features [numpy arrays])?

I copied some image data to an instance on Google Cloud (8 vCPU's, 64GB memory, Tesla K80 GPU) and am running into memory problems when converting the raw data into features, and changing the data structure of the output. Eventually I'd like to use the derived features in Keras/Tensorflow neural net.
Process
After copying the data to a storage bucket, I run a build_features.py function to convert the raw data into processed data for the neural network. In this pipeline, I first take each raw image and put it into a list x (which stores the derived features).
Since I'm working with a large number of images (tens of thousands of images that are type float32 and have dimensions 250x500x3) the list x becomes quite large. Each element of x is numpy array that stores the image in shape 250x500x3.
Problem 1 - reduced memory as list x grows
I took 2 screenshots that show available memory decreasing as x grows (below). I'm eventually able to complete this step but I'm only left with a few GB of memory so I definitely want to fix this (in the future I want to work with larger data sets). How can I build features in a way where I'm not limited by the size of x?
Problem 2 - Memory error when converting x into numpy array
The step where the instance actually fails is the following:
x = np.array(x)
The failure message is:
Traceback (most recent call last):
File "build_features.py", line 149, in <module>
build_features(pipeline='9_11_2017_fan_3_lights')
File "build_features.py", line 122, in build_features
x = np.array(x)
MemoryError
How can I adjust this step so that I don't run out of memory?

Your code has two copies of every image - one in the list, and one in the array:
images = []
for i in range(many):
images[i] = load_img(i) # here's the first image
x = np.array(images) # joint them all together into a second copy
Just load the images straight into the array
x = np.zeros((many, 250, 500, 3)
for i in range(many):
x[i] = load_img(i)
Which means that you only hold a copy of one image at a time.
If you don't know the size or dtype of the image ahead of time, or don't want to hard code it, you can use:
x0 = load_img(0)
x = np.zeros((many,) + x0.shape, x0.dtype)
for i in range(1, many):
x[i] = load_img(i)
Having said that, you're on a tricky path here. If you don't have enough room to store your dataset twice in memory, you also don't have room to compute y = x + 1.
You might want to consider using np.float16 to buy more storage, at the cost of precision

Tensorflow ValueError: setting an array element with a sequence with images

I've looked through many forum sites trying to find out the solution but can't get it.
I am trying to use Tensorflow (Python 3, Win 10 64 bit) with my own set of images. When I run it, I get a ValueError. Specifically:
Traceback (most recent call last):
File "B:\Josh\Programming\Python\imgpredict\predict.py", line 62, in <module>
sess.run(train_step, feed_dict={imgs:batchX, lbls: batchY})
File "C:\Users\Josh\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Users\Josh\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 968, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File "C:\Users\Josh\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
My code is:
import tensorflow as tf
import numpy as np
import os
import sys
import cv2
content = [] # Where images are stored
labels_list = []
########## File opening function
with open("data/cats/files.txt") as ff:
for line in ff:
line = line.rstrip()
content.append(line)
#################################
########## Labels opening function
with open("data/cats/labels.txt") as fff:
for linee in fff:
linee = linee.rstrip()
labels_list.append(linee)
labels_list = np.array(labels_list)
###############################
def create_batches(batch_size):
images1 = []
for img1 in content:
thedata = cv2.imread(img1)
thedata = tf.contrib.layers.flatten(thedata)
images1.append(thedata)
images1 = np.asarray(images1)
images1 = np.array(images1)
while(True):
for i in range(0,298,10):
yield(images1[i:i+batch_size],labels_list[i:i+batch_size])
imgs = tf.placeholder(dtype=tf.float32,shape=[None,262144])
lbls = tf.placeholder(dtype=tf.float32,shape=[None,10])
W = tf.Variable(tf.zeros([262144,10]))
b = tf.Variable(tf.zeros([10]))
y_ = tf.nn.softmax(tf.matmul(imgs,W) + b)
cross_entropy = tf.reduce_mean(-tf.reduce_sum(lbls * tf.log(y_),reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for i in range(10000):#########################################
for (batchX,batchY) in create_batches(10):
for inn, imgs in enumerate(batchX):
batchX[inn] = imgs.eval()
sess.run(train_step, feed_dict={imgs:batchX, lbls: batchY})
correct_prediction = tf.equal(tf.argmax(y_,1),tf.argmax(lbls,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
print(sess.run(accuracy, feed_dict={imgs:content, lbls:labels_list}))
I don't know if the error is from my images, or my labels. I've tried lot's of suggestions from other SO questions, Reddit, Google Plus, GitHub Issues, etc but to no avail. My GitHub link for the project is: https://github.com/supamonkey2000/jm-uofa
and the project folder is "imgpredict"
Any help appreciated. Thanks in advance

In this case, I think you are seeing this error because you are passing a tensorflow object to the feed_dict when you are running the training. It could be a tensorflow object as a result of the flattening method you used:
thedata = tf.contrib.layers.flatten(thedata)
which will return a flattened tensor (more info in the docs), that for some reason isn't being properly evaluated.
Following this answer to get past this issue you need to supply a numpy array to the feed dict. You could instead try:
thedata.flatten()
which will flatten the array to a vector. I tried it and it at least got rid of the error.
Beyond that, like Ofer Sadan pointed out, there are some fundamental issues with your approach. The most obvious one to me is that you are initializing you weight matrix to the image size (512 x 512 = 262144), but since your are loading 3 channel images (RGB color images) you end up with a flattened array three times that size (512 x 512 x 3 channels = 786432) so the training will fail anyway. Try converting to grayscale if the color isn't important to you training data(thedata = cv2.cvtColor(thedata, cv2.COLOR_BGR2GRAY).

I apologize for this isn't a complete answer to the error, but I see many problems with your code that could generate it.
First, is with the create_batches function. You use a list for images1, a tensor for thedata, you append all those tensors to the list and then convert that list to a numpy array. That is very bad practice.
Second problem there - it is supposed to yield both images and labels, but the labels are not processed in that function at all and arrive from the global value. Because of that, I see no reason to assume that they even match the images when you do this:
yield(images1[i:i+batch_size],labels_list[i:i+batch_size])
After all that, it appears that your batchX is a list of tensors, so you again transform each of them to an array (with imgs.eval()). After all that god only knows what the actual shape of the arrays are now, and the error itself is usually an indication that the batchX is not of a proper "rectangular" shape to be converted from a list into an array (for example if one of the elements is an array of a certain length and the others of different length).
My suggestion, rewrite your function, simplify it, don't use tensors in it, and don't use normal lists in there too. It should return a simple numpy array of a shape that fits to sess.run(train_step, feed_dict={imgs:batchX, lbls: batchY})

How can I manipulate my data to allow a random forest to run on it?

I want to train a random forest on a bunch of matrices (first link below for an example). I want to classify them as either "g" or "b" (good or bad, a or b, 1 or 0, it doesn't matter).
I've called the script randfore.py. I am currently using 10 examples, but I will be using a much bigger data set once I actually get this up and running.
Here is the code:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import os
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
working_dir = os.getcwd() # Grabs the working directory
directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located
sources = list() # Just sets up a list here which is going to become the input for the random forest
for i in range(10):
cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from
sources.append(cutoutfile) # add it to our sources list
targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad)
sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary?
# Training sets
X_train = sources[:8] # Inputs
y_train = targets[:8] # Targets
# Random Forest
rf = RandomForestClassifier(n_estimators=10)
rf_fit = rf.fit(X_train, y_train)
Here is the current error output:
Traceback (most recent call last):
File "randfore.py", line 31, in <module>
rf_fit = rf.fit(X_train, y_train)
File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
I tried making the dtype = object, but it hasn't helped. I'm just not sure what sort of manipulation I need to perform to have this work.
I think the problem is because the files I appending to sources aren't just numbers but a mix of numbers, commas, and various square brackets (it's basically a big matrix). Is there a natural way to import this? The square brackets in particular are probably an issue.
Before I converted sources to a DataFrame I was getting the following error:
ValueError: cannot copy sequence with size 99 to array axis with dimension 1
This is due to the dimensions of my input (100 lines long) and my target which has 10 rows and 1 column.
Here is the contents of the first file that's read into cutouts (they're all the exact same style) to be used as the input:
https://pastebin.com/tkysqmVu
And here is the contents of faketargets.dat, the targets:
https://pastebin.com/632RBqWc
Any ideas? Help greatly appreciated. I am sure there is a lot of fundamental confusion going on here.

Try writing:
X_train = sources.values[:8] # Inputs
y_train = targets.values[:8] # Targets
I hope this will solve your problem!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large Numpy array causses Error when trying to load - python

Related

How can I get my dicom splitter/divider to save split/divided images?

.npz file contains different datatypes, how do I plot the images contained within the .npz file?

Running out of memory when building features (converting images into derived features [numpy arrays])?

Tensorflow ValueError: setting an array element with a sequence with images

How can I manipulate my data to allow a random forest to run on it?

Categories

Resources