theano csv to pkl file

theano csv to pkl file - python

I am trying to make a pkl file to be loaded into theano from a csv starting point
import numpy as np
import csv
import gzip, cPickle
from numpy import genfromtxt
import theano
import theano.tensor as T
#Open csv file and read in data
csvFile = "filename.csv"
my_data = genfromtxt(csvFile, delimiter=',', skip_header=1)
data_shape = "There are " + repr(my_data.shape[0]) + " samples of vector length " + repr(my_data.shape[1])
num_rows = my_data.shape[0] # Number of data samples
num_cols = my_data.shape[1] # Length of Data Vector
total_size = (num_cols-1) * num_rows
data = np.arange(total_size)
data = data.reshape(num_rows, num_cols-1) # 2D Matrix of data points
data = data.astype('float32')
label = np.arange(num_rows)
print label.shape
#label = label.reshape(num_rows, 1) # 2D Matrix of data points
label = label.astype('float32')
print data.shape
#Read through data file, assume label is in last col
for i in range(my_data.shape[0]):
label[i] = my_data[i][num_cols-1]
for j in range(num_cols-1):
data[i][j] = my_data[i][j]
#Split data in terms of 70% train, 10% val, 20% test
train_num = int(num_rows * 0.7)
val_num = int(num_rows * 0.1)
test_num = int(num_rows * 0.2)
DataSetState = "This dataset has " + repr(data.shape[0]) + " samples of length " + repr(data.shape[1]) + ". The number of training examples is " + repr(train_num)
print DataSetState
train_set_x = data[:train_num]
train_set_y = label[:train_num]
val_set_x = data[train_num+1:train_num+val_num]
val_set_y = label[train_num+1:train_num+val_num]
test_set_x = data[train_num+val_num+1:]
test_set_y = label[train_num+val_num+1:]
# Divided dataset into 3 parts. split by percentage.
train_set = train_set_x, train_set_y
val_set = val_set_x, val_set_y
test_set = test_set_x, val_set_y
dataset = [train_set, val_set, test_set]
f = gzip.open(csvFile+'.pkl.gz','wb')
cPickle.dump(dataset, f, protocol=2)
f.close()
When I run the resulting pkl file through Thenao, (as a DBN or SdA) it pretrains just fine, which makes me think the data is stored correctly.
However when it comes to finetune I get the following error:
epoch 1, minibatch 2775/2775, validation error 0.000000 %
Traceback (most recent call last):
File "SdA_custom.py", line 489, in
test_SdA()
File "SdA_custom.py", line 463, in test_SdA
test_losses = test_model()
File "SdA_custom.py", line 321, in test_score
return [test_score_i(i) for i in xrange(n_test_batches)]
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 606, in __call__
storage_map=self.fn.storage_map)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 595, in __call__
outputs = self.fn()
ValueError: Input dimension mis-match. (input[0].shape[0] = 10, input[1].shape[0] = 3)
Apply node that caused the error: Elemwise{neq,no_inplace}(argmax, Subtensor{int64:int64:}.0)
Inputs types: [TensorType(int64, vector), TensorType(int32, vector)]
Inputs shapes: [(10,), (3,)]
Inputs strides: [(8,), (4,)]
Inputs values: ['not shown', array([0, 0, 0], dtype=int32)]
Backtrace when the node is created:
File "/home/dean/Documents/DeepLearningRepo/DeepLearningTutorials-master/code/logistic_sgd.py", line 164, in errors
return T.mean(T.neq(self.y_pred, y))
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
10 is the size of my batch, if I change to a batch size of 1 I get the following:
ValueError: Input dimension mis-match. (input[0].shape[0] = 1, input[1].shape[0] = 0)
I think I am storing the labels wrong when I make a pkl, but I can't seem to spot what is happening or why changing the batch alters the error
Hope you can help!

Saw this just now as was looking for similar error I was getting. Posting a reply so that it might help someone looking for similar error. For me the error resolved when I changed n_out to 2 from 1 in dbn_test() parameter list. n_out was the number of labels rather than number of output layers.

Related

How to implement multiprocessing in a for loop inside a function

I've built some code to minimize the sum of the weighted least squares of some residuals. I first read all the data from a .gz file and then process it on the code below (details are irrelevant). I want to use multiprocessing in order to speed up the "runFit" function.
My code is below:
"""
Fit 3D lines to cylinders
"""
from timeit import default_timer as timer
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
from scipy.optimize import minimize
from numba import jit
from multiprocessing import Pool
def readData(filename):
"Read compressed data."
return np.loadtxt(filename, delimiter=",")
#jit(nopython=True)
def weightedResiduals(unknown, wire_coords, radii, d_radii, d_zcoords):
"Calculates the sum of the weighted residuals"
y_intercept = unknown[0]
z_intercept = unknown[1]
xy_slope = unknown[2]
xz_slope = unknown[3]
intercept_vector = np.array([0, y_intercept, z_intercept])
gradient_vector = np.array([1, xy_slope, xz_slope])
gradient_vector /= np.linalg.norm(gradient_vector)
result = 0
for index in range(np.shape(wire_coords)[0]):
distance = np.linalg.norm(np.cross((wire_coords[index]-intercept_vector), gradient_vector)) - radii[index]
weight = (d_radii[index]**2 + d_zcoords[index]**2)**(-1/2)
result += (weight * distance)**2
return result
def runFit(inputfilename, outputfilename):
"""
Parameters
----------
inputfilename : string
input data file name for fitting.
outputfilename : string
result storage file name.
Returns
-------
counter : int
number of successful fits; 100% would be twice the number
of events (two lines per event).
"""
counter = 0
#Reading the required data set
fulldata = readData(inputfilename)
#Defining the output array and filling in the first two columns
event_no = int(fulldata[-1,0])
result = np.zeros((2*event_no, 10))
result[:,0] = np.repeat(np.arange(1, event_no+1), 2)
line_no_array = np.empty((2*event_no,))
line_no_array[::2] = 1
line_no_array[1::2] = 2
result[:,1] = line_no_array
def singleEventFit(event):
#Using masking to obtain required rows
mask = (fulldata==event)
desired_rows = mask[:, 0]
#Calculating the fitted line variables using weighted least squares
for line in range(1,3):
#Extracting the desired rows from the full data array
desired_array = fulldata[np.logical_and(desired_rows,(fulldata==line)[:,1])]
#Extracting grouped data from the desired rows
wire_coords = desired_array[:,2:5]
wire_x_coords = wire_coords[:,0]
wire_y_coords = wire_coords[:,1]
wire_z_coords = wire_coords[:,2]
radii = desired_array[:,5]
d_radii, d_zcoords = desired_array[:,6], desired_array[:,7]
#Estimating an initial guess for the fitted line variables
x_min_index = np.argmin(np.abs(wire_x_coords))
x_max_index = np.argmax(np.abs(wire_x_coords))
y_intercept_guess = wire_y_coords[x_min_index]
z_intercept_guess = wire_z_coords[x_min_index]
xy_slope_guess = (wire_y_coords[x_max_index]-wire_y_coords[x_min_index])/(wire_x_coords[x_max_index]-wire_x_coords[x_min_index])
xz_slope_guess = (wire_z_coords[x_max_index]-wire_z_coords[x_min_index])/(wire_x_coords[x_max_index]-wire_x_coords[x_min_index])
init = np.array([y_intercept_guess, z_intercept_guess, xy_slope_guess, xz_slope_guess])
#Minimizing the sum of the weighted residuals
fit_vars = minimize(weightedResiduals, init, args=(wire_coords, radii, d_radii, d_zcoords), tol=1e-5)
if fit_vars.success == True:
y_intercept, z_intercept = fit_vars.x[0], fit_vars.x[1]
xy_slope, xz_slope = fit_vars.x[2], fit_vars.x[3]
#Using the half of the inverse of the Hessian matrix as the covariance matrix to recover errors
std_array = np.sqrt(np.diag(0.5*fit_vars.hess_inv))
#Inputting the variables and their errors on the output array
result[2*event+line-3, 2], result[2*event+line-3, 4] = y_intercept, xy_slope
result[2*event+line-3, 6], result[2*event+line-3, 8] = z_intercept, xz_slope
result[2*event+line-3, 3], result[2*event+line-3, 5] = std_array[0], std_array[2]
result[2*event+line-3, 7], result[2*event+line-3, 9] = std_array[1], std_array[3]
with Pool() as pool:
pool.map(singleEventFit, [event for event in range(1, event_no+1)])
#Returning resulting array as a text file
np.savetxt(outputfilename, result, delimiter=',')
return counter
start = timer()
if __name__=='__main__':
print("Successful Plots: " + str(runFit("tendata.txt.gz", "output.txt.gz")))
end = timer()
print("Time: " + str(end-start) + "s")
However, I get the following traceback:
Traceback (most recent call last):
File "C:\Users\vanes\Downloads\Python Project\untitled0.py", line 113, in <module>
print("Successful Plots: " + str(runFit("tendata.txt.gz", "output.txt.gz")))
File "C:\Users\vanes\Downloads\Python Project\untitled0.py", line 105, in runFit
pool.map(singleEventFit, [event for event in range(1, event_no+1)])
File "C:\Users\vanes\anaconda3\lib\multiprocessing\pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\vanes\anaconda3\lib\multiprocessing\pool.py", line 771, in get
raise self._value
File "C:\Users\vanes\anaconda3\lib\multiprocessing\pool.py", line 537, in _handle_tasks
put(task)
File "C:\Users\vanes\anaconda3\lib\multiprocessing\connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "C:\Users\vanes\anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'runFit.<locals>.singleEventFit'
Is there any way that I can use multiprocessing in order to speed up the for-loop?
After reviewing the internet, the recommendation was to move the inner function outside and make it global. However, this can't work since I need variables defined inside "runFit()" in order to execute the loop.

How to generate predictions on testing triplets dataset after training Siamese network

I have a dataset of images and two txt files in which each line contains the id of three pictures, the first one is for training and tells me that the first picture is most similar to the second one than to the third one. The second one is for testing: I have to predict wether the first image is most similar to the first or the second one for each line.
To do this I have trained a siamese network utilising triplet loss using as guideline this article: https://keras.io/examples/vision/siamese_network/
After training the network I do not know how to proceed to evaluate my testing dataset, to prepare the data I have done:
with open('test_triplets.txt') as f:
lines2 = f.readlines()
lines2 = [line.split('\n', 1)[0] for line in lines2]
anchor2 = [line.split()[0] for line in lines2]
pic1 = [line.split()[1] for line in lines2]
pic2 = [line.split()[2] for line in lines2]
anchor2 = ['food/' + item + '.jpg' for item in anchor2]
pic1 = ['food/' + item + '.jpg' for item in pic1]
pic2 = ['food/' + item + '.jpg' for item in pic2]
anchor2_dataset = tf.data.Dataset.from_tensor_slices(anchor2)
pic1_dataset = tf.data.Dataset.from_tensor_slices(pic1)
pic2_dataset = tf.data.Dataset.from_tensor_slices(pic2)
test_dataset = tf.data.Dataset.zip((anchor2_dataset, pic1_dataset, pic2_dataset))
test_dataset = test_dataset.map(preprocess_triplets)
test_dataset = test_dataset.batch(32, drop_remainder=False)
test_dataset = test_dataset.prefetch(8)
I have then tried to utilise a for loop as follows, but the running time is too high since I have around 50000 lines in the txt file.
n_images = len(anchor2)
results = np.zeros((n_images,2))
for i in range(n_images):
sample = next(iter(test_dataset))
anchor, positive, negative = sample
anchor_embedding, positive_embedding, negative_embedding = (
embedding(resnet.preprocess_input(anchor)),
embedding(resnet.preprocess_input(positive)),
embedding(resnet.preprocess_input(negative)),
)
cosine_similarity = metrics.CosineSimilarity()
positive_similarity = cosine_similarity(anchor_embedding, positive_embedding)
results[i,0] = positive_similarity.numpy()
negative_similarity = cosine_similarity(anchor_embedding, negative_embedding)
results[i,1] = negative_similarity.numpy()
How can I do to be able to generate predictions on my testing triplets ? My objective would be to have a vector [n_testing_triplets x 1] where each line is 1 if the first pic is most similar to the anchor or 0 otherwise.

You can stack your images first, then calculate all embedings in parallel like this :
import numpy as np
stack = np.stack([anchor0, positive0, negative0, ..., anchor999, positive999, negative999])
# then you calculate all embeding at the same time like this
embeddings = list(embedding(resnet.preprocess_input(stack)).numpy())
Then you compare the embeding as you want, in a loop :
cosine_similarity = metrics.CosineSimilarity()
positive_similarity = cosine_similarity(embeddings [0] , embeddings [1])
whatever_storage = positive_similarity.numpy()
negative_similarity = cosine_similarity(embeddings [0] , embeddings [2])
whatever_storage = negative_similarity.numpy()

lightfm error: Not all estimated parameters are finite, your model may have diverged

I'm running this very simple code:
def csr_values_analysis(values):
num_zeros = 0
num_ones = 0
num_other = 0
for v in values:
if v == 0:
num_zeros += 1
elif v == 1:
num_ones += 1
else:
num_other += 1
return num_zeros, num_ones, num_other
print("Reading user_features.npz")
with open("/path/to/user_features.npz", "rb") as in_file:
user_features_csr = sp.load_npz(in_file)
print("User features read, shape: {}".format(user_features_csr.shape))
print("Data values analysis: zeros: %i, ones: %i, other: %i" % csr_values_analysis(user_features_csr.data))
print("Reading item_features.npz")
with open("/path/to/item_features.npz", "rb") as in_file:
item_features_csr = sp.load_npz(in_file)
print("Item features read, shape: {}".format(item_features_csr.shape))
print("Data values analysis: zeros: %i, ones: %i, other: %i" % csr_values_analysis(item_features_csr.data))
print("Reading interactions.npz")
with open("/path/to/interactions.npz", "rb") as in_file:
interactions_csr = sp.load_npz(in_file)
print("Interactions read, shape: {}".format(interactions_csr.shape))
print("Data values analysis: zeros: %i, ones: %i, other: %i" % csr_values_analysis(interactions_csr.data))
interactions_coo = interactions_csr.tocoo()
# Run lightfm
print("Running lightfm...")
model = LightFM(loss='warp')
model.fit(interactions_coo, user_features=user_features_csr, item_features=item_features_csr, epochs=20, num_threads=2, verbose=True)
With the following output:
Reading user_features.npz
User features read, shape: (827568, 105)
Data values analysis: zeros: 0, ones: 3153032, other: 0
Reading item_features.npz
Item features read, shape: (67339359, 36)
Data values analysis: zeros: 0, ones: 25259081, other: 0
Reading interactions.npz
Interactions read, shape: (827568, 67339359)
Data values analysis: zeros: 0, ones: 172388, other: 0
Running lightfm...
Epoch 0
Traceback (most recent call last):
File "training.py", line 92, in <module>
model.fit(interactions_coo, user_features=user_features_csr, item_features=item_features_csr, epochs=20, num_threads=2, verbose=True)
File "/usr/lib64/python3.6/site-packages/lightfm/lightfm.py", line 479, in fit
verbose=verbose)
File "/usr/lib64/python3.6/site-packages/lightfm/lightfm.py", line 578, in fit_partial
self._check_finite()
File "/usr/lib64/python3.6/site-packages/lightfm/lightfm.py", line 413, in _check_finite
raise ValueError("Not all estimated parameters are finite,"
ValueError: Not all estimated parameters are finite, your model may have diverged. Try decreasing the learning rate or normalising feature values and sample weights
All my Scipy sparse matrices are normalized (i.e. the values are 0 or 1).
I've tried to change the learning schedule and the learning rate with no results.
I've checked this only occurs when I add the item features to the equation. There is no error when running lightfm only with interactions, or intereactions + user features.
AFAIK, I've installed the latest version:
$ pip freeze | grep lightfm
lightfm==1.15
Any idea? Thanks!
UPDATE 1
I was wondering if my sparse matrices were too much sparse... Nevertheless, I've tried with extremely very little shapes, and the same error arises:
>>> import scipy.sparse as sp
>>> import numpy as np
>>> import lightfm
>>> uf_row = np.array([2,4,9])
>>> uf_col = np.array([4,9,3])
>>> uf_data = np.array([1,1,1])
>>> if_row = np.array([0,3])
>>> if_col = np.array([9,7])
>>> if_data = np.array([1,1])
>>> i_row = np.array([1])
>>> i_col = np.array([8])
>>> i_data = np.array([1])
>>> uf_csr = sp.csr_matrix((uf_data, (uf_row, uf_col)), shape=(10, 10))
>>> if_csr = sp.csr_matrix((if_data, (if_row, if_col)), shape=(10, 10))
>>> i_csr = sp.csr_matrix((i_data, (i_row, i_col)), shape=(10, 10))
>>> model = lightfm.LightFM(loss='warp')
>>> model.fit(i_csr.tocoo(), user_features=uf_csr, item_features=if_csr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/site-packages/lightfm/lightfm.py", line 479, in fit
verbose=verbose)
File "/usr/lib64/python3.6/site-packages/lightfm/lightfm.py", line 578, in fit_partial
self._check_finite()
File "/usr/lib64/python3.6/site-packages/lightfm/lightfm.py", line 413, in _check_finite
raise ValueError("Not all estimated parameters are finite,"
ValueError: Not all estimated parameters are finite, your model may have diverged. Try decreasing the learning rate or normalising feature values and sample weights
Definitely, I'm doing something wrong...
UPDATE 2
I think I found the problem... I did the following experiment:
>>> uf_csr = sp.csr_matrix((np.array([1]),(np.array([0]), np.array([0]))),shape=(20,20))
>>> if_csr = sp.csr_matrix((np.array([1]),(np.array([0]), np.array([0]))),shape=(20,20))
>>> i_csr = sp.csr_matrix((np.array([1]),(np.array([1]), np.array([1]))),shape=(20,20))
>>> model = lightfm.LightFM(loss='warp')
>>> model.fit(i_csr.tocoo(), user_features=uf_csr, item_features=if_csr, epochs=20)
Traceback (most recent call last):
...
ValueError: Not all estimated parameters are finite, your model may have diverged. Try decreasing the learning rate or normalising feature values and sample weights
I.e. I had the Exception as usual. Now, if you observe the interaction matrix, it has an interaction regarding a user and an item having all their features set to 0 in the user and item feature matrices, respectively. So, let's change this, in the user features matrix, for instance:
>>> uf_csr = sp.csr_matrix((np.array([1,1]),(np.array([0,1]), np.array([0,0]))),shape=(20,20))
>>> model.fit(i_csr.tocoo(), user_features=uf_csr, item_features=if_csr, epochs=20)
<lightfm.lightfm.LightFM object at 0x7f2d39ea3490>
Et voilà!
We can do the same with the item features matrix:
>>> uf_csr = sp.csr_matrix((np.array([1]),(np.array([0]), np.array([0]))),shape=(20,20))
>>> if_csr = sp.csr_matrix((np.array([1,1]),(np.array([0,1]), np.array([0,0]))),shape=(20,20))
>>> model.fit(i_csr.tocoo(), user_features=uf_csr, item_features=if_csr, epochs=20)
<lightfm.lightfm.LightFM object at 0x7f2d39ea3490>
So, I'll try to find a way of filtering interactions related to all-zero user and item features and I'll post it ;)

As explained in my last querion update, the problem was with users and items having all their features set to zero, and the occurence of intereactions related to one of these users and one of these items at the same time.
Being said that, my first thought was to remove interactions related to these users and items, but that could affect the recommendation to those users, or the recommendation of those items.
Thus, a different solution could be to expand the user and item features matrices with a diagonal matrix, in order at least such a feature (the user him/herself) is set to 1.
0 0 0 0 0 0 1 0 0
0 1 0 --> 0 1 0 0 1 0
1 0 0 1 0 0 0 0 1

could not broadcast input array from shape (20,310,310) into shape (20)

I'm trying to detect lung cancer nodules using DICOM files. The main steps in cancer detection included following steps.
1) Preprocessing
* Converting the pixel values to Hounsfield Units (HU)
* Resampling to an isomorphic resolution to remove variance in scanner resolution
*Lung segmentation
2) Training the data set using preprocessed images in Tensorflow CNN
3) Testing and validation
I followed few online tutorials to do this.
I need to combine the given solutions in
1) https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial
2) https://www.kaggle.com/sentdex/first-pass-through-data-w-3d-convnet.
I could implement the example in link two. But since it is lack ok lung segmentation and few other preprocessing steps I need to combine the steps in link one with link two. But I'm getting number of errors while doing it. Since I'm new to python can someone please help me in solving it.
There are 20 patient folders and each patient folder has number of slices, which are dicom files.
For the process_data method , slices_path of each patient and patient number was sent.
def process_data(slices,patient,labels_df,img_px_size,hm_slices):
try:
label=labels_df.get_value(patient,'cancer')
patient_pixels = get_pixels_hu(slices)
segmented_lungs2, spacing = resample(patient_pixels, slices, [1,1,1])
new_slices=[]
segmented_lung = segment_lung_mask(segmented_lungs2, False)
segmented_lungs_fill = segment_lung_mask(segmented_lungs2, True)
segmented_lungs=segmented_lungs_fill-segmented_lung
#This method returns smallest integer not less than x.
chunk_sizes =math.ceil(len(segmented_lungs)/HM_SLICES)
for slice_chunk in chunks(segmented_lungs,chunk_sizes):
slice_chunk=list(map(mean,zip(*slice_chunk))) #list - []
#print (slice_chunk)
new_slices.append(slice_chunk)
print(len(segmented_lungs), len(new_slices))
if len(new_slices)==HM_SLICES-1:
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES-2:
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES+2:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
if len(new_slices)==HM_SLICES+1:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
print('LENGTH ',len(segmented_lungs), len(new_slices))
except Exception as e:
# again, some patients are not labeled, but JIC we still want the error if something
# else is wrong with our code
print(str(e))
#print(len(new_slices))
if label==1: label=np.array([0,1])
elif label==0: label=np.array([1,0])
return np.array(new_slices),label
Main method
# Some constants
#data_dir = '../../CT_SCAN_IMAGE_SET/IMAGES/'
#patients = os.listdir(data_dir)
#labels_df=pd.read_csv('../../CT_SCAN_IMAGE_SET/stage1_labels.csv',index_col=0)
#patients.sort()
#print (labels_df.head())
much_data=[]
much_data2=[]
for num,patient in enumerate(patients):
if num%100==0:
print (num)
try:
slices = load_scan(data_dir + patients[num])
img_data,label=process_data(slices,patients[num],labels_df,IMG_PX_SIZE,HM_SLICES)
much_data.append([img_data,label])
#much_data2.append([processed,label])
except:
print ('This is unlabeled data')
np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data)
#np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data2)
The preprocessing part works fine but when I'm trying to enter the final out put to a Convolutional NN and train the data set , Following is the error I'm receiving including some of the comments that I had put
0
shape hu
(113, 512, 512)
Resize factor
[ 2.49557522 0.6015625 0.6015625 ]
shape
(282, 308, 308)
chunk size
15
282 19
LENGTH 282 20
Tensor("Placeholder:0", dtype=float32)
..........1.........
..........2.........
..........3.........
..........4.........
WARNING:tensorflow:From C:\Research\Python_installation\lib\site-packages\tensorflow\python\util\tf_should_use.py:170: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
..........5.........
..........6.........
Epoch 1 completed out of 20 loss: 0
..........7.........
Traceback (most recent call last):
File "C:\Research\LungCancerDetaction\sendbox2.py", line 436, in <module>
train_neural_network(x)
File "C:\Research\LungCancerDetaction\sendbox2.py", line 424, in train_neural_network
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\framework\ops.py", line 606, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\framework\ops.py", line 3928, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Research\Python_installation\lib\site-packages\tensorflow\python\client\session.py", line 968, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File "C:\Research\Python_installation\lib\site-packages\numpy\core\numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not broadcast input array from shape (20,310,310) into shape (20)
I think it is the issue with the 'segmented_lungs=segmented_lungs_fill-segmented_lung'
In the working example,
segmented_lungs=[cv2.resize(each_slice,(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in patient_pixels]
Please help me in solving this. I'm unable to proceed since some time. If anything is not clear please let me know.
Following is the whole code that had tried.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import dicom
import os
import scipy.ndimage
import matplotlib.pyplot as plt
import cv2
import math
import tensorflow as tf
from skimage import measure, morphology
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
# Some constants
data_dir = '../../CT_SCAN_IMAGE_SET/IMAGES/'
patients = os.listdir(data_dir)
labels_df=pd.read_csv('../../CT_SCAN_IMAGE_SET/stage1_labels.csv',index_col=0)
patients.sort()
print (labels_df.head())
#Image pixel array watching
for patient in patients[:10]:
#label is to get the label of the patient. This is what done in the .get_value method.
label=labels_df.get_value(patient,'cancer')
path=data_dir+patient
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
#You have dicom files and they have attributes.
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
print (len(slices),slices[0].pixel_array.shape)
#If u need to see many slices and resize the large pixelated 2D images into 150*150 pixelated images
IMG_PX_SIZE=50
HM_SLICES=20
for patient in patients[:1]:
#label is to get the label of the patient. This is what done in the .get_value method.
label=labels_df.get_value(patient,'cancer')
path=data_dir+patient
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
#You have dicom files and they have attributes.
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
#This shows the pixel arrayed image related to the second slice of each patient
#subplot
fig=plt.figure()
for num,each_slice in enumerate(slices[:16]):
print (num)
y=fig.add_subplot(4,4,num+1)
#down sizing everything. Resize the imag size as their pixel values are 512*512
new_image=cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE))
y.imshow(new_image)
plt.show()
print (len(patients))
###################################################################################
def get_pixels_hu(slices):
image = np.array([s.pixel_array for s in slices])
# Convert to int16 (from sometimes int16),
# should be possible as values should always be low enough (<32k)
image = image.astype(np.int16)
# Set outside-of-scan pixels to 0
# The intercept is usually -1024, so air is approximately 0
image[image == -2000] = 0
# Convert to Hounsfield units (HU)
for slice_number in range(len(slices)):
intercept = slices[slice_number].RescaleIntercept
slope = slices[slice_number].RescaleSlope
if slope != 1:
image[slice_number] = slope * image[slice_number].astype(np.float64)
image[slice_number] = image[slice_number].astype(np.int16)
image[slice_number] += np.int16(intercept)
return np.array(image, dtype=np.int16)
#The next problem is each patient is got different number of slices . This is a performance issue.
# Take the slices and put that into a list of slices and chunk that list of slices into fixed numer of
#chunk of slices and averaging those chunks.
#yield is like 'return'. It returns a generator
def chunks(l,n):
for i in range(0,len(l),n):
#print ('Inside yield')
#print (i)
yield l[i:i+n]
def mean(l):
return sum(l)/len(l)
def largest_label_volume(im, bg=-1):
vals, counts = np.unique(im, return_counts=True)
counts = counts[vals != bg]
vals = vals[vals != bg]
if len(counts) > 0:
return vals[np.argmax(counts)]
else:
return None
def segment_lung_mask(image, fill_lung_structures=True):
# not actually binary, but 1 and 2.
# 0 is treated as background, which we do not want
binary_image = np.array(image > -320, dtype=np.int8)+1
labels = measure.label(binary_image)
# Pick the pixel in the very corner to determine which label is air.
# Improvement: Pick multiple background labels from around the patient
# More resistant to "trays" on which the patient lays cutting the air
# around the person in half
background_label = labels[0,0,0]
#Fill the air around the person
binary_image[background_label == labels] = 2
# Method of filling the lung structures (that is superior to something like
# morphological closing)
if fill_lung_structures:
# For every slice we determine the largest solid structure
for i, axial_slice in enumerate(binary_image):
axial_slice = axial_slice - 1
labeling = measure.label(axial_slice)
l_max = largest_label_volume(labeling, bg=0)
if l_max is not None: #This slice contains some lung
binary_image[i][labeling != l_max] = 1
binary_image -= 1 #Make the image actual binary
binary_image = 1-binary_image # Invert it, lungs are now 1
# Remove other air pockets insided body
labels = measure.label(binary_image, background=0)
l_max = largest_label_volume(labels, bg=0)
if l_max is not None: # There are air pockets
binary_image[labels != l_max] = 0
return binary_image
#Loading the files
#Load the scans in given folder path
def load_scan(path):
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
try:
slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
except:
slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
for s in slices:
s.SliceThickness = slice_thickness
return slices
def resample(image, scan, new_spacing=[1,1,1]):
# Determine current pixel spacing
spacing = np.array([scan[0].SliceThickness] + scan[0].PixelSpacing, dtype=np.float32)
resize_factor = spacing / new_spacing
new_real_shape = image.shape * resize_factor
new_shape = np.round(new_real_shape)
real_resize_factor = new_shape / image.shape
new_spacing = spacing / real_resize_factor
print ('Resize factor')
print (real_resize_factor)
image = scipy.ndimage.interpolation.zoom(image, real_resize_factor, mode='nearest')
print ('shape')
print (image.shape)
return image, new_spacing
'''def chunks(l,n):
for i in range(0,len(l),n):
#print ('Inside yield')
#print (i)
yield l[i:i+n]
def mean(l):
return sum(l)/len(l)'''
#processing data
def process_data(slices,patient,labels_df,img_px_size,hm_slices):
#for patient in patients[:10]:
#label is to get the label of the patient. This is what done in the .get_value method.
try:
label=labels_df.get_value(patient,'cancer')
print ('label process data')
print (label)
#path=data_dir+patient
#slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
#You have dicom files and they have attributes.
slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
#This shows the pixel arrayed image related to the second slice of each patient
patient_pixels = get_pixels_hu(slices)
print ('shape hu')
print (patient_pixels.shape)
segmented_lungs2, spacing = resample(patient_pixels, slices, [1,1,1])
#print ('Pix shape')
#print (segmented_lungs2.shape)
#segmented_lungs=np.array(segmented_lungs2).tolist()
new_slices=[]
segmented_lung = segment_lung_mask(segmented_lungs2, False)
segmented_lungs_fill = segment_lung_mask(segmented_lungs2, True)
segmented_lungs=segmented_lungs_fill-segmented_lung
#print ('length of segmented lungs')
#print (len(segmented_lungs))
#print ('Shape of segmented lungs......................................')
#print (segmented_lungs.shape)
#print ('hiiii')
#segmented_lungs=[cv2.resize(each_slice,(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in segmented_lungs3]
#print ('bye')
#print ('length of slices')
#print (len(slices))
#print ('shape of slices')
#print (slices.shape)
#print (each_slice.pixel_array)
#This method returns smallest integer not less than x.
chunk_sizes =math.ceil(len(segmented_lungs)/HM_SLICES)
print ('chunk size ')
print (chunk_sizes)
for slice_chunk in chunks(segmented_lungs,chunk_sizes):
slice_chunk=list(map(mean,zip(*slice_chunk))) #list - []
#print (slice_chunk)
new_slices.append(slice_chunk)
print(len(segmented_lungs), len(new_slices))
if len(new_slices)==HM_SLICES-1:
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES-2:
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES-3:
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
new_slices.append(new_slices[-1])
if len(new_slices)==HM_SLICES+2:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
if len(new_slices)==HM_SLICES+1:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
if len(new_slices)==HM_SLICES+3:
new_val =list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
del new_slices[HM_SLICES]
new_slices[HM_SLICES-1]=new_val
print('LENGTH ',len(segmented_lungs), len(new_slices))
except Exception as e:
# again, some patients are not labeled, but JIC we still want the error if something
# else is wrong with our code
print(str(e))
#print(len(new_slices))
if label==1: label=np.array([0,1])
elif label==0: label=np.array([1,0])
return np.array(new_slices),label
# Some constants
#data_dir = '../../CT_SCAN_IMAGE_SET/IMAGES/'
#patients = os.listdir(data_dir)
#labels_df=pd.read_csv('../../CT_SCAN_IMAGE_SET/stage1_labels.csv',index_col=0)
#patients.sort()
#print (labels_df.head())
much_data=[]
much_data2=[]
for num,patient in enumerate(patients):
if num%100==0:
print (num)
try:
slices = load_scan(data_dir + patients[num])
img_data,label=process_data(slices,patients[num],labels_df,IMG_PX_SIZE,HM_SLICES)
much_data.append([img_data,label])
#much_data2.append([processed,label])
except:
print ('This is unlabeled data')
np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data)
#np.save('muchdata-{}-{}-{}.npy'.format(IMG_PX_SIZE,IMG_PX_SIZE,HM_SLICES),much_data2)
IMG_SIZE_PX = 50
SLICE_COUNT = 20
n_classes=2
batch_size=10
x = tf.placeholder('float')
y = tf.placeholder('float')
keep_rate = 0.8
def conv3d(x, W):
return tf.nn.conv3d(x, W, strides=[1,1,1,1,1], padding='SAME')
def maxpool3d(x):
# size of window movement of window as you slide about
return tf.nn.max_pool3d(x, ksize=[1,2,2,2,1], strides=[1,2,2,2,1], padding='SAME')
def convolutional_neural_network(x):
# # 5 x 5 x 5 patches, 1 channel, 32 features to compute.
weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,3,1,32])),
# 5 x 5 x 5 patches, 32 channels, 64 features to compute.
'W_conv2':tf.Variable(tf.random_normal([3,3,3,32,64])),
# 64 features
'W_fc':tf.Variable(tf.random_normal([54080,1024])),
'out':tf.Variable(tf.random_normal([1024, n_classes]))}
biases = {'b_conv1':tf.Variable(tf.random_normal([32])),
'b_conv2':tf.Variable(tf.random_normal([64])),
'b_fc':tf.Variable(tf.random_normal([1024])),
'out':tf.Variable(tf.random_normal([n_classes]))}
# image X image Y image Z
x = tf.reshape(x, shape=[-1, IMG_SIZE_PX, IMG_SIZE_PX, SLICE_COUNT, 1])
conv1 = tf.nn.relu(conv3d(x, weights['W_conv1']) + biases['b_conv1'])
conv1 = maxpool3d(conv1)
conv2 = tf.nn.relu(conv3d(conv1, weights['W_conv2']) + biases['b_conv2'])
conv2 = maxpool3d(conv2)
fc = tf.reshape(conv2,[-1, 54080])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc'])+biases['b_fc'])
fc = tf.nn.dropout(fc, keep_rate)
output = tf.matmul(fc, weights['out'])+biases['out']
return output
much_data = np.load('muchdata-50-50-20.npy')
# If you are working with the basic sample data, use maybe 2 instead of 100 here... you don't have enough data to really do this
train_data = much_data[:-4]
validation_data = much_data[-4:]
def train_neural_network(x):
print ('..........1.........')
prediction = convolutional_neural_network(x)
print ('..........2.........')
#cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
print ('..........3.........')
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(cost)
print ('..........4.........')
hm_epochs = 20
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
successful_runs = 0
total_runs = 0
print ('..........5.........')
for epoch in range(hm_epochs):
epoch_loss = 0
for data in train_data:
total_runs += 1
try:
X = data[0]
Y = data[1]
_, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
epoch_loss += c
successful_runs += 1
except Exception as e:
# I am passing for the sake of notebook space, but we are getting 1 shaping issue from one
# input tensor. Not sure why, will have to look into it. Guessing it's
# one of the depths that doesn't come to 20.
pass
#print(str(e))
print ('..........6.........')
print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)
print ('..........7.........')
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
print('Done. Finishing accuracy:')
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
print('fitment percent:',successful_runs/total_runs)
print (x)
# Run this locally:
train_neural_network(x)
P.S : resample() , segment_lung_mask() methods can be found from link 1.

For training you have
for data in train_data:
total_runs += 1
try:
X = data[0]
Y = data[1]
_, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
So x and y are, respectively, the first two elements of a single row of train_data.
However, when calculating the accuracy you have
print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
So x is the first element of all rows of validation_data, which gives it dimensions of (20,310,310), which can't be broadcast to a placeholder of dimension (20). Ditto for y. (Broadcasting means that if you gave it a tensor of dimensions (20, 310) it would know to take each of the 310 columns and feed it to the placeholder separately. It can't figure out what to do with a tensor of (20, 310, 310).)
Incidentally, when you declare your placeholders it's a good idea to specify their dimensions, using None for the dimension depending on the number of separate examples. This way the program can warn you when dimensions don't match up.

The error message seems to indicate that the placeholder tensors x and y have not been defined correctly. They should have the same shape as the input values X = data[0] and Y = data[1], such as
x = tf.placeholder(shape=[20,310,310], dtype=tf.float32)
# if y is a scalar:
y = tf.placeholder(shape=[], dtype=tf.float32)

Misunderstanding of LSTM time-based data preparation

I am trying to replicate Chevalier's LSTM Human Activity Recognition algorithm and came across a problem when I realized that my methods did not match that of the algorithm. As a follow-up from this question, I was able to produce a result for load_X by this method:
In[0]:
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
with open(signal_type_path, 'r') as csvfile:
reader = csv.reader(csvfile)
next(reader)
for serie in [row[1:2] for row in reader]:
#X_signals.append([np.array([row[1:2] for row in reader],dtype=np.float32) for row in reader])
X_signals.append(np.array(serie, dtype=np.int32))
file.close()
return (np.transpose(np.transpose(X_signals), (1, 0)))
X_train_signals_paths = [
DATASET_PATH + TRAIN + signal + "_train.csv" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + signal + "_test.csv" for signal in INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
print(X_train)
Out[0]:
[[ 6]
[ 6]
...,
[13]
[13]
[13]]
However I looked over Chevalier's methods a little more and I observed something interesting when I did len(X_train[0]) and len(X_train[0][0]). It seems the way I formatted my x-values is much different than how Chevalier's x-values are. My original CSV file can be found here and the original txt file for Chevalier's X_train can be found here. The following is Chevalier's code for comparison to mine:
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
file = open(signal_type_path, 'r')
# Read dataset from disk, dealing with text files' syntax
X_signals.append(
[np.array(serie, dtype=np.float32) for serie in [
row.replace(' ', ' ').strip().split(' ') for row in file
]]
)
file.close()
return np.transpose(np.array(X_signals), (1, 2, 0))
X_train_signals_paths = [
DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
The following is from Chevalier's "Additional Parameters" section and is the main reason for my confusion:
training_data_count = len(X_train) # 7352 training series (with 50% overlap between each serie)
test_data_count = len(X_test) # 2947 testing series
n_steps = len(X_train[0]) # 128 timesteps per series
n_input = len(X_train[0][0]) # 9 input parameters per timestep
What I observe is that this 50% overlap means that the separate evaluated time intervals are overlapping like 0-64, 32-96, 64-128, 96-etc. One fact that I do know is that 7352 is the number of rows in X_train.txt. The [0] and [0][0] mean that it is selecting the 0th column of the X_train array and the 0th column and 0th row of X_train respectively. What my code is currently doing is transposing each of my data points individually. That is why when I evaluated len(X_train[0]) I received a 1 and with len(X_train[0][0]) I received an error:
TypeError Traceback (most recent call last)
<ipython-input-255-14523e544e49> in <module>()
2 test_data_count = len(list(X_test))
3 n_steps = len(X_train[0])
----> 4 n_input = len(list(X_train)[0][0])
5 print(training_data_count, test_data_count, n_steps, n_input)
TypeError: object of type 'numpy.int32' has no len()
I am wondering what I should do to reformat my data to match the intended formatting of Chevalier in the txt file? What do the numbers in the "Additional Parameters" section of the Chevalier's git mean and how can I tailor them to my current model?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

theano csv to pkl file - python

Saw this just now as was looking for similar error I was getting. Posting a reply so that it might help someone looking for similar error. For me the error resolved when I changed n_out to 2 from 1 in dbn_test() parameter list. n_out was the number of labels rather than number of output layers.

Related

How to implement multiprocessing in a for loop inside a function

How to generate predictions on testing triplets dataset after training Siamese network

lightfm error: Not all estimated parameters are finite, your model may have diverged

could not broadcast input array from shape (20,310,310) into shape (20)

Misunderstanding of LSTM time-based data preparation

Categories

Resources