Related
I am working with a 4-D array input to a CNN network. The input array has the following shape
print('X_train shape: ', X_train.shape)
X_train shape: (47204, 1, 100, 4)
Data description:
The input data consists of a 47204 instances (fixed-length segments as far CNN requirement). Each instance (1, 100, 4) i.e. 1 segment contains 100-GPS points, and for each point, 4-corresponding point kinematics (max_speed, avg_speed, max_acc, avg_acc) are stored, thus the (1, 100, 4). Labels are stored in a separate y_train array of shape (47204,) for 5 classes [0..4].
print(y_train)
[3 3 0 ... 2 3 4]
To get a better sense of my X_train array, I show the first 3 elements below:
print(X_train[1:3])
[
[[[ 3.82280987e+00 2.16802350e-01 7.49917451e-02 3.44416369e-04]
[ 3.38707371e+00 2.02210055e-01 1.61751110e-03 1.93745950e-03]
[ 2.49202215e+00 1.60605262e-01 8.43561351e-03 2.40057917e-03]
...
[ 2.00022316e+00 2.70020923e-01 5.40441673e-02 3.57212151e-03]
[ 3.25199744e-01 9.06990382e-02 1.46808316e-02 1.65841315e-03]
[2.96587589e-01 0.00000000e+00 6.13293351e-04 4.16518187e-03]]]
[[[ 1.07209176e+00 7.27038312e-02 6.62777026e-03 2.04611951e-04]
[ 1.06194285e+00 5.05005456e-02 4.05676569e-03 3.72293433e-04]
[ 1.02849748e+00 2.12558178e-02 2.95477005e-03 5.56584054e-04]
...
[ 4.51962909e-03 5.63125736e-04 5.98474074e-04 1.63036715e-05]
[ 2.83026181e-03 2.35855075e-03 1.25789358e-03 2.15331510e-06]
[8.49078543e-03 2.16840434e-19 9.43423077e-04 1.29198906e-05]]]
[[[ 7.51127665e+00 3.14033478e-01 6.85170617e-02 7.73415075e-04]
[ 7.42307262e+00 1.33868251e-01 4.10564823e-02 1.16131460e-03]
[ 7.35818066e+00 1.23886976e-02 3.02312582e-02 1.28312101e-03]
...
[ 7.40826167e+00 1.19388656e-01 4.00874715e-02 2.04909489e-04]
[ 7.23779176e+00 1.33269965e-01 1.20430502e-02 1.58195900e-04]
[ 7.11697001e+00 4.68002105e-02 5.42478400e-02 3.58101318e-05]]]
]
Task:
I am required to create a machine learning model (e.g. random forest) using the 4 kinematics (max_speed, avg_speed, max_acc, avg_acc) as features. This requires navigating each instance and getting these features for the 100-points in the instance.
Clearly, the number of samples will then be 4720400 (i.e. 47204 x 100), so would also match each value to the corresponding label of its instances, i.e. y_train will then be (4720400,).
The expected input would then be like:
max_speed avg_speed max_acc avg_acc class
0 3.82280987e+00 2.16802350e-01 7.49917451e-02 3.44416369e-04 3
1 3.38707371e+00 2.02210055e-01 1.61751110e-03 1.93745950e-03 3
2 2.49202215e+00 1.60605262e-01 8.43561351e-03 2.40057917e-03 3
...
I have being thinking about how to do this all through the week, all ideas evaporated. How do I do this, please?
You can reshape your X_train array from (47204, 1, 100, 4) to (4720400, 4) simply with:
X_train_reshaped = X_train.reshape(4720400, 4)
It preserves the data order and the total number of elements will be the same.
Similarly, you can expand y_train array using repeat command:
Y_train_reshaped = numpy.repeat(Y_train, 100)
Note the 100 for repeat command. Since you had one label for 100 data points, we will expand these items 100 times. This command will preserve data order too so all instances will have the same original label.
I am trying to reproduce Tensorflow tutorial code from here which is supposed to download CSV file and preprocess data (up to combining numerical data together).
The reproducible example goes as follows:
import tensorflow as tf
print("TF version is: {}".format(tf.__version__))
# Download data
train_url = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
test_url = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_path = tf.keras.utils.get_file("train.csv", train_url)
test_path = tf.keras.utils.get_file("test.csv", test_url)
# Get data into batched dataset
def get_dataset(path):
dataset = tf.data.experimental.make_csv_dataset(path
,batch_size=5
,num_epochs=1
,label_name='survived'
,na_value='?'
,ignore_errors=True)
return dataset
raw_train_dataset = get_dataset(train_path)
raw_test_dataset = get_dataset(test_path)
# Define numerical and categorical column lists
def get_df_batch(dataset):
for batch,label in dataset.take(1):
df = pd.DataFrame()
df['survived'] = label.numpy()
for key, value in batch.items():
df[key] = value.numpy()
return df
dfb = get_df_batch(raw_train_dataset)
num_columns = [i for i in dfb if (dfb[i].dtype != 'O' and i!='survived')]
cat_columns = [i for i in dfb if dfb[i].dtype == 'O']
# Combine numerical columns into one `numerics` column
class Pack():
def __init__(self,names):
self.names = names
def __call__(self,features, labels):
num_features = [features.pop(name) for name in self.names]
num_features = [tf.cast(feat, tf.float32) for feat in num_features]
num_features = tf.stack(num_features, axis=1)
features["numerics"] = num_features
return features, labels
packed_train = raw_train_dataset.map(Pack(num_columns))
# Show what we got
def show_batch(dataset):
for batch, label in dataset.take(1):
for key, value in batch.items():
print("{:20s}: {}".format(key,value.numpy()))
show_batch(packed_train)
TF version is: 2.0.0
sex : [b'female' b'female' b'male' b'male' b'male']
class : [b'Third' b'First' b'Second' b'First' b'Third']
deck : [b'unknown' b'E' b'unknown' b'C' b'unknown']
embark_town : [b'Queenstown' b'Cherbourg' b'Southampton' b'Cherbourg' b'Queenstown']
alone : [b'n' b'n' b'y' b'n' b'n']
numerics : [[ 28. 1. 0. 15.5 ]
[ 40. 1. 1. 134.5 ]
[ 32. 0. 0. 10.5 ]
[ 49. 1. 0. 89.1042]
[ 2. 4. 1. 29.125 ]]
Then I try, and fail, combine numeric features in a functional way:
#tf.function
def pack_func(row, num_columns=num_columns):
features, labels = row
num_features = [features.pop(name) for name in num_columns]
num_features = [tf.cast(feat, tf.float32) for feat in num_features]
num_features = tf.stack(num_features, axis=1)
features['numerics'] = num_features
return features, labels
packed_train = raw_train_dataset.map(pack_func)
Partial traceback:
ValueError: in converted code:
:3 pack_func *
features, labels = row
ValueError: too many values to unpack (expected 2)
2 questions here:
How features and labels are get assigned in def __call__(self,features, labels): in the definition of Class Pack. My intuition they should be passed in as defined variables, though I absolutely do not understand where they get defined.
When I do
for row in raw_train_dataset.take(1):
print(type(row))
print(len(row))
f,l = row
print(f)
print(l)
I see that row in raw_train_dataset is a tuple2, which can be successfully unpacked into features and labels. Why it cannot be done via map API? Can you suggest the right way of combining numerical features in functional way?
Many thanks in advance!!!
After some research and trial the answer to the second question seems to be:
def pack_func(features, labels, num_columns=num_columns):
num_features = [features.pop(name) for name in num_columns]
num_features = [tf.cast(feat, tf.float32) for feat in num_features]
num_features = tf.stack(num_features, axis=1)
features['numerics'] = num_features
return features, labels
packed_train = raw_train_dataset.map(pack_func)
show_batch(packed_train)
sex : [b'male' b'male' b'male' b'female' b'male']
class : [b'Third' b'Third' b'Third' b'First' b'Third']
deck : [b'unknown' b'unknown' b'unknown' b'E' b'unknown']
embark_town : [b'Southampton' b'Southampton' b'Queenstown' b'Cherbourg' b'Queenstown']
alone : [b'y' b'n' b'n' b'n' b'y']
numerics : [[24. 0. 0. 8.05 ]
[14. 5. 2. 46.9 ]
[ 2. 4. 1. 29.125 ]
[39. 1. 1. 83.1583]
[21. 0. 0. 7.7333]]
I want a tensorflow function, which accepts a 3D matrix and an array ( shape of the array is similar to the first dimension of a 3D matrix ) and I want to slice the elements from each 2D matrix inside the 3D matrix based on the given array. The equivalent numpy looks like as follows. The basic idea is to picking all hidden states of each input in a batch ( avoid the padded ) in a dynamic rnn
import numpy as np
a = np.random.uniform(-1,1,(3,5,7))
a_length = np.random.randint(5,size=(3))
a_tf = tf.convert_to_tensor(a)
a_length_tf = tf.convert_to_tensor(a_length)
res = []
for index, length_ in enumerate(a_length):
res.extend(a[index,:length_,:])
res = np.array(res)
Output
print(a_length)
array([1, 4, 4])
print(res)
array([[-0.060161 , 0.36000953, 0.46160677, -0.66576281, 0.28562044,
-0.60026872, 0.08034777],
[ 0.04776443, 0.38018207, -0.73352382, 0.61847258, -0.89731857,
0.57264147, -0.88192537],
[ 0.92657628, 0.6236141 , 0.41977008, 0.88720247, 0.44639323,
0.26165976, 0.2678753 ],
[-0.78125831, 0.76756136, -0.05716537, -0.64696257, 0.48918477,
0.15376225, -0.41974593],
[-0.625326 , 0.3509537 , -0.7884495 , 0.11773297, 0.23713942,
0.30296786, 0.12932378],
[ 0.88413986, -0.10958306, 0.9745586 , 0.8975006 , 0.23023047,
-0.89991669, -0.60032688],
[ 0.33462775, 0.62883724, -0.81839566, -0.70312966, -0.00246936,
-0.95542994, -0.33035891],
[-0.26355579, -0.58104982, -0.54748412, -0.30236209, -0.74270132,
0.46329941, 0.34277915],
[ 0.92837516, -0.06748299, 0.32837354, -0.62863672, 0.86226447,
0.63604586, 0.0905248 ]])
print(a)
array([[[-0.060161 , 0.36000953, 0.46160677, -0.66576281,
0.28562044, -0.60026872, 0.08034777],
[ 0.26379226, 0.67066755, -0.90139221, -0.86862163,
0.36405595, 0.71342926, -0.1265208 ],
[ 0.15007877, 0.82065234, 0.03984378, -0.20038364,
-0.09945102, 0.71605241, -0.55865999],
[ 0.27132257, -0.84289149, -0.15493576, 0.74683429,
-0.71159896, 0.50397217, -0.99025404],
[ 0.51546368, 0.45460343, 0.87519031, 0.0332339 ,
-0.53474897, -0.01733648, -0.02886814]],
[[ 0.04776443, 0.38018207, -0.73352382, 0.61847258,
-0.89731857, 0.57264147, -0.88192537],
[ 0.92657628, 0.6236141 , 0.41977008, 0.88720247,
0.44639323, 0.26165976, 0.2678753 ],
[-0.78125831, 0.76756136, -0.05716537, -0.64696257,
0.48918477, 0.15376225, -0.41974593],
[-0.625326 , 0.3509537 , -0.7884495 , 0.11773297,
0.23713942, 0.30296786, 0.12932378],
[ 0.44550219, -0.38828221, 0.35684203, 0.789946 ,
-0.8763921 , 0.90155917, -0.75549455]],
[[ 0.88413986, -0.10958306, 0.9745586 , 0.8975006 ,
0.23023047, -0.89991669, -0.60032688],
[ 0.33462775, 0.62883724, -0.81839566, -0.70312966,
-0.00246936, -0.95542994, -0.33035891],
[-0.26355579, -0.58104982, -0.54748412, -0.30236209,
-0.74270132, 0.46329941, 0.34277915],
[ 0.92837516, -0.06748299, 0.32837354, -0.62863672,
0.86226447, 0.63604586, 0.0905248 ],
[ 0.70272633, 0.17122912, -0.58209965, 0.55557024,
-0.46295566, -0.33845157, -0.62254313]]])
Here is a way to do that using tf.boolean_mask:
import tensorflow as tf
import numpy as np
# NumPy/Python implementation
a = np.random.uniform(-1,1,(3,5,7)).astype(np.float32)
a_length = np.random.randint(5,size=(3)).astype(np.int32)
res = []
for index, length_ in enumerate(a_length):
res.extend(a[index,:length_,:])
res = np.array(res)
# TensorFlow implementation
a_tf = tf.convert_to_tensor(a)
a_length_tf = tf.convert_to_tensor(a_length)
# Make a mask for all wanted elements
mask = tf.range(tf.shape(a)[1]) < a_length_tf[:, tf.newaxis]
# Apply mask
res_tf = tf.boolean_mask(a_tf, mask)
# Test
with tf.Session() as sess:
print(np.allclose(sess.run(res_tf), res))
Output:
True
I have some MRI scans files of extention .img/.hdr/.gif. I am completely new to it.
How can I work with it? How can I look at a slice of MRI scan in that '.img' array?
I've found
import nibabel as nib
img = nib.load('./OAS1_0001_MR1_mpr_n4_anon_111_t88_masked_gfc_fseg.img')
print(img)
Which shows
<class 'nibabel.spm2analyze.Spm2AnalyzeImage'>
data shape (176, 208, 176, 1)
affine:
[[ -1. 0. 0. 87.5]
[ 0. 1. 0. -103.5]
[ 0. 0. 1. -87.5]
[ 0. 0. 0. 1. ]]
metadata:
<class 'nibabel.spm2analyze.Spm2AnalyzeHeader'> object, endian='>'
sizeof_hdr : 348
data_type : b'\x00B\x00\x00YA\x00\x00\xe8#'
db_name : b'\x00\x00\x14#\x00\x00\x9d?\x00\x00\xbf>\x00\x005>'
extents : 16384
session_error : 0
regular : b'r'
hkey_un0 : b' '
dim : [ 4 176 208 176 1 0 0 0]
vox_units : b'mm'
cal_units : b'7\x00\x00\x006\x16'
unused1 : 0
datatype : uint8
bitpix : 8
dim_un0 : 0
pixdim : [ 0.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.65311180e-41 1.60658869e-41 1.55614194e-41 1.50821754e-41]
vox_offset : 0.0
scl_slope : nan
scl_inter : 0.0
funused3 : 0.0
cal_max : 0.0
cal_min : 0.0
compressed : 0
verified : 0
glmax : 3
glmin : 0
descrip : b' '
aux_file : b' '
orient : b''
origin : [8224 8224 8224 8224 8192]
generated : b' '
scannum : b' '
patient_id : b' '
exp_date : b' '
exp_time : b' '
hist_un0 : b' '
views : 0
vols_added : 0
start_field : 0
field_skip : 0
omax : 0
omin : 0
smax : 0
smin : 0
But there is no something where I can get slices of MRI scan. How can I plot it? I have assumption that img.get_data() should help me...
For plotting you can use
from nilearn import plotting
plotting.plot_anat(img, title="plot_anat")
and to extract data you are right you can use img.get_data()
Judging by the shape of your data (data shape (176, 208, 176, 1), it's a 3D image so you need to specify which plane you want to have plotted in 2D. There are three options:
Axial or horizontal or transverse plane; divides the body into head and tail (Z-axis)
Coronal or frontal plane; divides the body into posterior and anterior portions (Y-axis)
Sagittal or longitudal plane; divides the body into right and left (X-axis)
You can find more details on the anatomical planes in Wikipedia along with a very helpful diagram.
So for your data, once you've read it as you describe, you can get out a normal array by calling
im_data = img.get_fdata() (I'm getting a depreciation warning for get_data())
The data array will have the same shape so if you want to plot a center slice along one of the planes, you just write:
import matplotlib.pyplot as plt
center_slice = im_data[:,:,88, 0]
fig, ax = plt.subplots(1,1)
ax.imshow(center_slice, cmap="gray")
plt.show()
Note that the image won't necessarily be correctly positioned (rotation, axes, etc.). Sometimes it's enough to rotate it 90 degrees or shift manually using scipy.ndimage.shift(). nilearn plotting source code has a comprehensive list of transformations and modifications to the default MPL axes so worth a look.
I try to implement a PCA in Python. My goal is to create a version which behaves similarly to Matlab's PCA implementation. However, I think I miss a crucial point as my tests partly produce a results with the wrong sign(+/-).
Can you find a mistake the algorithm? Why the signs are sometimes different?
An implementation of PCA based on eigen vectors:
new_array_rank=4
A_mean = np.mean(A, axis=0)
A = A - A_mean
covariance_matrix = np.cov(A.T)
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
new_index = np.argsort(eigen_values)[::-1]
eigen_vectors = eigen_vectors[:,new_index]
eigen_values = eigen_values[new_index]
eigen_vectors = eigen_vectors[:,:new_array_rank]
return np.dot(eigen_vectors.T, A.T).T
My test values:
array([[ 0.13298325, 0.2896928 , 0.53589224, 0.58164269, 0.66202221,
0.95414116, 0.03040784, 0.26290471, 0.40823539, 0.37783385],
[ 0.90521267, 0.86275498, 0.52696221, 0.15243867, 0.20894357,
0.19900414, 0.50607341, 0.53995902, 0.32014539, 0.98744942],
[ 0.87689087, 0.04307512, 0.45065793, 0.29415066, 0.04908066,
0.98635538, 0.52091338, 0.76291385, 0.97213094, 0.48815925],
[ 0.75136801, 0.85946751, 0.10508436, 0.04656418, 0.08164919,
0.88129981, 0.39666754, 0.86325704, 0.56718669, 0.76346602],
[ 0.93319721, 0.5897521 , 0.75065047, 0.63916306, 0.78810679,
0.92909485, 0.23751963, 0.87552313, 0.37663086, 0.69010429],
[ 0.53189229, 0.68984247, 0.46164066, 0.29953259, 0.10826334,
0.47944168, 0.93935082, 0.40331874, 0.18541041, 0.35594587],
[ 0.36399075, 0.00698617, 0.61030608, 0.51136309, 0.54185601,
0.81383604, 0.50003674, 0.75414875, 0.54689801, 0.9957493 ],
[ 0.27815017, 0.65417397, 0.57207255, 0.54388744, 0.89128334,
0.3512483 , 0.94441934, 0.05305929, 0.77389942, 0.93125228],
[ 0.80409485, 0.2749575 , 0.22270875, 0.91869706, 0.54683128,
0.61501493, 0.7830902 , 0.72055598, 0.09363186, 0.05103846],
[ 0.12357816, 0.29758902, 0.87807485, 0.94348706, 0.60896429,
0.33899019, 0.36310027, 0.02380186, 0.67207071, 0.28638936]])
My result of the PCA with eigen vectors:
array([[ 5.09548931e-01, -3.97079651e-01, -1.47555867e-01,
-3.55343967e-02, -4.92125732e-01, -1.78191399e-01,
-3.29543974e-02, 3.71406504e-03, 1.06404170e-01,
-1.66533454e-16],
[ -5.15879041e-01, 6.40833419e-01, -7.54601587e-02,
-2.00776798e-01, -7.07247669e-02, 2.68582368e-01,
-1.66124362e-01, 1.03414828e-01, 7.76738500e-02,
5.55111512e-17],
[ -4.42659342e-01, -5.13297786e-01, -1.65477203e-01,
5.33670847e-01, 2.00194213e-01, 2.06176265e-01,
1.31558875e-01, -2.81699724e-02, 6.19571305e-02,
-8.32667268e-17],
[ -8.50397468e-01, 5.14319846e-02, -1.46289906e-01,
6.51133920e-02, -2.83887201e-01, -1.90516618e-01,
1.45748370e-01, 9.49464768e-02, -1.05989648e-01,
4.16333634e-17],
[ -1.61040296e-01, -3.47929944e-01, -1.19871598e-01,
-6.48965493e-01, 7.53188055e-02, 1.31730340e-01,
1.33229858e-01, -1.43587499e-01, -2.20913989e-02,
-3.40005801e-16],
[ -1.70017435e-01, 4.22573148e-01, 4.81511942e-01,
2.42170125e-01, -1.18575764e-01, -6.87250591e-02,
-1.20660307e-01, -2.22865482e-01, -1.73666882e-02,
-1.52655666e-16],
[ 6.90841779e-02, -2.86233901e-01, -4.16612350e-01,
9.38935057e-03, 3.02325120e-01, -1.61783482e-01,
-3.55465509e-01, 1.15323059e-02, -5.04619674e-02,
4.71844785e-16],
[ 5.26189089e-01, 6.81324113e-01, -2.89960115e-01,
2.01781673e-02, 3.03159463e-01, -2.11777986e-01,
2.25937548e-01, -5.49219872e-05, 3.66268329e-02,
-1.11022302e-16],
[ 6.68680313e-02, -2.99715813e-01, 8.53428694e-01,
-1.30066853e-01, 2.31410283e-01, -1.02860624e-01,
1.95449586e-02, 1.30218425e-01, 1.68059569e-02,
2.22044605e-16],
[ 9.68303353e-01, 4.80944309e-02, 2.62865615e-02,
1.44821658e-01, -1.47094421e-01, 3.07366196e-01,
1.91849667e-02, 5.08517759e-02, -1.03558238e-01,
1.38777878e-16]])
Test result of the same data using Matlab's PCA function:
array([[ -5.09548931e-01, 3.97079651e-01, 1.47555867e-01,
3.55343967e-02, -4.92125732e-01, -1.78191399e-01,
-3.29543974e-02, -3.71406504e-03, -1.06404170e-01,
-0.00000000e+00],
[ 5.15879041e-01, -6.40833419e-01, 7.54601587e-02,
2.00776798e-01, -7.07247669e-02, 2.68582368e-01,
-1.66124362e-01, -1.03414828e-01, -7.76738500e-02,
-0.00000000e+00],
[ 4.42659342e-01, 5.13297786e-01, 1.65477203e-01,
-5.33670847e-01, 2.00194213e-01, 2.06176265e-01,
1.31558875e-01, 2.81699724e-02, -6.19571305e-02,
-0.00000000e+00],
[ 8.50397468e-01, -5.14319846e-02, 1.46289906e-01,
-6.51133920e-02, -2.83887201e-01, -1.90516618e-01,
1.45748370e-01, -9.49464768e-02, 1.05989648e-01,
-0.00000000e+00],
[ 1.61040296e-01, 3.47929944e-01, 1.19871598e-01,
6.48965493e-01, 7.53188055e-02, 1.31730340e-01,
1.33229858e-01, 1.43587499e-01, 2.20913989e-02,
-0.00000000e+00],
[ 1.70017435e-01, -4.22573148e-01, -4.81511942e-01,
-2.42170125e-01, -1.18575764e-01, -6.87250591e-02,
-1.20660307e-01, 2.22865482e-01, 1.73666882e-02,
-0.00000000e+00],
[ -6.90841779e-02, 2.86233901e-01, 4.16612350e-01,
-9.38935057e-03, 3.02325120e-01, -1.61783482e-01,
-3.55465509e-01, -1.15323059e-02, 5.04619674e-02,
-0.00000000e+00],
[ -5.26189089e-01, -6.81324113e-01, 2.89960115e-01,
-2.01781673e-02, 3.03159463e-01, -2.11777986e-01,
2.25937548e-01, 5.49219872e-05, -3.66268329e-02,
-0.00000000e+00],
[ -6.68680313e-02, 2.99715813e-01, -8.53428694e-01,
1.30066853e-01, 2.31410283e-01, -1.02860624e-01,
1.95449586e-02, -1.30218425e-01, -1.68059569e-02,
-0.00000000e+00],
[ -9.68303353e-01, -4.80944309e-02, -2.62865615e-02,
-1.44821658e-01, -1.47094421e-01, 3.07366196e-01,
1.91849667e-02, -5.08517759e-02, 1.03558238e-01,
-0.00000000e+00]])
The sign and other normalization choices for eigenvectors are arbitrary. Matlab and numpy norm the eigenvectors in the same way, but the sign is arbitrary and can depend on details of the linear algebra library that is used.
When I wrote the numpy equivalent of matlab's princomp, then I just normalized the sign of the eigenvectors when I compared them to those of matlab in my unit tests.