Tensorflow dataset splitting does not work

Tensorflow dataset splitting does not work - python

I recently tried to use the tf.data API. I created an images dataset and has to split in into train/val/test. I'm using the below method using ds.take and ds.skip but always get train_ds correctly and no data in test_ds and val_ds.
DATASET_SIZE = 2000
train_size = int(0.7 * DATASET_SIZE) # 1400
val_size = int(0.15 * DATASET_SIZE) # 300
test_size = int(0.15 * DATASET_SIZE) # 300
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
test_ds = ds.skip(train_size+val_size).take(test_size)
When I run the below:
for image, label in train_ds.take(1):
print("Image shape: ", image.shape)
print("Label: ", label.numpy())
I see the output as:
Image shape: (32, 400, 400, 3)
Label: [39 23 21 27 28 18 28 30 28 44 34 37 21 39 35 26 48 37 41 30 22 36 46 28
34 38 33 32 36 35 25 24]
But if I try to use from test_ds.take(1) or val_ds.take(1) in the above, there is no output. It seems test_ds and val_ds empty dataset. Also, when I use val_ds later in my model.fit() function, I don't see val_loss because of this.
I could use other techniques that would work for me but want to understand the reason/ what's wrong I am doing here?

Related

Pre-processing single feature containing different scales

How do I preprocess this data containing a single feature with different scales? This will then be used for supervised machine learning classification.
Data
import pandas as pd
import numpy as np
np.random.seed = 4
df_eur_jpy = pd.DataFrame({"value": np.random.default_rng().uniform(0.07, 3.85, 50)})
df_usd_cad = pd.DataFrame({"value": np.random.default_rng().uniform(0.0004, 0.02401, 50)})
df_usd_cad["ticker"] = "usd_cad"
df_eur_jpy["ticker"] = "eur_jpy"
df = pd.concat([df_eur_jpy,df_usd_cad],axis=0)
df.head(1)
value ticker
0 0.161666 eur_jpy
We can see the different tickers contain data with a different scale when looking at the max/min of this groupby:
df.groupby("ticker")["value"].agg(['min', 'max'])
min max
ticker
eur_jpy 0.079184 3.837519
usd_cad 0.000405 0.022673
I have many tickers in my real data and would like to combine all of these in the one feature (pandas column) and use with an estimator in sci-kit learn for supervised machine learning classification.

If I Understand Carefully (IIUC), you can use the min-max scaling formula:
You can apply this formula to your dataframe with implemented sklearn.preprocessing.MinMaxScaler like below:
from sklearn.preprocessing import MinMaxScaler
df2 = df.pivot(columns='ticker', values='value')
# ticker eur_jpy usd_cad
# 0 3.204568 0.021455
# 1 1.144708 0.013810
# ...
# 48 1.906116 0.002058
# 49 1.136424 0.022451
df2[['min_max_scl_eur_jpy', 'min_max_scl_usd_cad']] = MinMaxScaler().fit_transform(df2[['eur_jpy', 'usd_cad']])
print(df2)
Output:
ticker eur_jpy usd_cad min_max_scl_eur_jpy min_max_scl_usd_cad
0 3.204568 0.021455 0.827982 0.896585
1 1.144708 0.013810 0.264398 0.567681
2 2.998154 0.004580 0.771507 0.170540
3 1.916517 0.003275 0.475567 0.114361
4 0.955089 0.009206 0.212517 0.369558
5 3.036463 0.019500 0.781988 0.812471
6 1.240505 0.006575 0.290608 0.256373
7 1.224260 0.020711 0.286163 0.864584
8 3.343022 0.020564 0.865864 0.858280
9 2.710383 0.023359 0.692771 0.978531
10 1.218328 0.008440 0.284540 0.336588
11 2.005472 0.022898 0.499906 0.958704
12 2.056680 0.016429 0.513916 0.680351
13 1.010388 0.005553 0.227647 0.212368
14 3.272408 0.000620 0.846543 0.000149
15 2.354457 0.018608 0.595389 0.774092
16 3.297936 0.017484 0.853528 0.725720
17 2.415297 0.009618 0.612035 0.387285
18 0.439263 0.000617 0.071386 0.000000
19 3.335262 0.005988 0.863740 0.231088
20 2.767412 0.013357 0.708375 0.548171
21 0.830678 0.013824 0.178478 0.568255
22 1.056041 0.007806 0.240138 0.309336
23 1.497400 0.023858 0.360896 1.000000
24 0.629698 0.014088 0.123489 0.579604
25 3.758559 0.020663 0.979556 0.862509
26 0.964214 0.010302 0.215014 0.416719
27 3.680324 0.023647 0.958150 0.990918
28 3.169445 0.017329 0.818372 0.719059
29 1.898905 0.017892 0.470749 0.743299
30 3.322663 0.020508 0.860293 0.855869
31 2.735855 0.010578 0.699741 0.428591
32 2.264645 0.017853 0.570816 0.741636
33 2.613166 0.021359 0.666173 0.892456
34 1.976168 0.001568 0.491888 0.040928
35 3.076169 0.013663 0.792852 0.561335
36 3.330470 0.013048 0.862429 0.534891
37 3.600527 0.012340 0.936318 0.504426
38 0.653994 0.008665 0.130137 0.346288
39 0.587896 0.013134 0.112052 0.538567
40 0.178353 0.011326 0.000000 0.460781
41 3.727127 0.016738 0.970956 0.693658
42 1.719622 0.010939 0.421696 0.444123
43 0.460177 0.021131 0.077108 0.882665
44 3.124722 0.010328 0.806136 0.417826
45 1.011988 0.007631 0.228085 0.301799
46 3.833281 0.003896 1.000000 0.141076
47 3.289872 0.017223 0.851322 0.714495
48 1.906116 0.002058 0.472721 0.062020
49 1.136424 0.022451 0.262131 0.939465

ValueError: too many values to unpack 3

1 i#:coding:utf-8
2 #0导入模块，生成模拟数据集
3 import tensorflow as tf
4 import numpy as np
5 BATCH_SIZE = 8
6 seed = 23455
7
8 #给予seed产生随机数
9 rng = np.random.RandomState(seed)
10 #随机数返回32行2列矩阵 表示32组 体积和重量 作为输入数据集
11 X = rng.rand(32,3)
12
13 Y = [[int(x0+x1<1)] for (x0,x1) in X]
14 print "X:\n",X
15 print "Y:\n",Y
16
17 #1定义神经网络的输入，参数和输出，定义向前传播过程
18 x = tf.placeholder(tf.float32, shape=(None, 2))
19 y_= tf.placeholder(tf.float32, shape=(None, 1))
20
21 w1= tf.variable(tf.random([2,3], stddev=1, seed=1))
22 w2= tf.variable(tf.random([3,1], stddev=1, seed=1))
23
24 a =tf.matmul(x,w1)
25 y =tf.matmul(a,w2)
26
27 #定义损失函数集反向传播方法
28 loss = tf.reduce_mean(tf.square(y-y_))
29 #train_step = tf.train.MomentumOptimizer(0.001,0.9).minimize(loss)
30 #train_step = tf.train.AdamOptimizer(0.001).minimize(loss)
31
32 #3生成会话，训练steps轮
33 with tf.Session() as sess:
34 init_op = tf.global_variables_initializer()
35 sess.run(init_op)
36 # 输出目前未经训练的参数取值。
37 print "w1:\n", sess.run(w1)
38 print "w2:\n", sess.run(w2)
39 print "\n"
40
41 #train the model
42 STEPS=3000
43 for i in range(STEPS):
44 start =(i*BATCH_SIZE) % 32
45 end = start + BATCH_SIZE
46 sess.run(train_step, feed_dict={x: X[start:end], y_: Y[start:end]})
47 if i % 500 == 0:
48 total_loss = sess.run(loss, feed_dict={x: X, y_: Y})
49 print("After %d training steps(s), loss on all data in %g" % (i, total_loss))
50
51 #output the trained value of variables
52 print "\n"
53 print "w1:\n", sess.run(w1)
54 print "w2:\n", sess.run(w2)
File "tf3_6.py", line 13, in
Y = [[int(x0+x1<1)] for (x0,x1) in X]
ValueError: too many values to unpack.
The code i don't think it is wrong but i still noticed the value error so i hope you guys to help me cope this question thanks a lot

The shape of X is (32, 3), but in your list comprehension, you are only trying to unpack 2 values:
Y = [[int(x0+x1<1)] for (x0,x1) in X]
Either change the shape of your array of rands:
X = rng.rand(32,2)
Or throw away the third rand in your list comp:
Y = [[int(x0+x1<1)] for (x0,x1, _) in X]

Understand tensorflow slice operation

I am confused about the follow code:
import tensorflow as tf
import numpy as np
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.framework import dtypes
'''
Randomly crop a tensor, then return the crop position
'''
def random_crop(value, size, seed=None, name=None):
with ops.name_scope(name, "random_crop", [value, size]) as name:
value = ops.convert_to_tensor(value, name="value")
size = ops.convert_to_tensor(size, dtype=dtypes.int32, name="size")
shape = array_ops.shape(value)
check = control_flow_ops.Assert(
math_ops.reduce_all(shape >= size),
["Need value.shape >= size, got ", shape, size],
summarize=1000)
shape = control_flow_ops.with_dependencies([check], shape)
limit = shape - size + 1
begin = tf.random_uniform(
array_ops.shape(shape),
dtype=size.dtype,
maxval=size.dtype.max,
seed=seed) % limit
return tf.slice(value, begin=begin, size=size, name=name), begin
sess = tf.InteractiveSession()
size = [10]
a = tf.constant(np.arange(0, 100, 1))
print (a.eval())
a_crop, begin = random_crop(a, size = size, seed = 0)
print ("offset: {}".format(begin.eval()))
print ("a_crop: {}".format(a_crop.eval()))
a_slice = tf.slice(a, begin=begin, size=size)
print ("a_slice: {}".format(a_slice.eval()))
assert (tf.reduce_all(tf.equal(a_crop, a_slice)).eval() == True)
sess.close()
outputs:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
offset: [46]
a_crop: [89 90 91 92 93 94 95 96 97 98]
a_slice: [27 28 29 30 31 32 33 34 35 36]
There are two tf.slice options:
(1). called in function random_crop, such as tf.slice(value, begin=begin, size=size, name=name)
(2). called as a_slice = tf.slice(a, begin=begin, size=size)
The parameters (values, begin and size) of those two slice operations are the same.
However, why the printed values a_crop and a_slice are different and tf.reduce_all(tf.equal(a_crop, a_slice)).eval() is True?
Thanks
EDIT1
Thanks #xdurch0, I understand the first question now.
Tensorflow random_uniform seems like a random generator.
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
size = [10]
np_begin = np.random.randint(0, 50, size=1)
tf_begin = tf.random_uniform(shape = [1], minval=0, maxval=50, dtype=tf.int32, seed = 0)
a = tf.constant(np.arange(0, 100, 1))
a_slice = tf.slice(a, np_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, np_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, tf_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, tf_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
sess.close()
output
a_slice: [42 43 44 45 46 47 48 49 50 51]
a_slice: [42 43 44 45 46 47 48 49 50 51]
a_slice: [41 42 43 44 45 46 47 48 49 50]
a_slice: [29 30 31 32 33 34 35 36 37 38]

The confusing thing here is that tf.random_uniform (like every random operation in TensorFlow) produces a new, different value on each evaluation call (each call to .eval() or, in general, each call to tf.Session.run). So if you evaluate a_crop you get one thing, if you then evaluate a_slice you get a different thing, but if you evaluate tf.reduce_all(tf.equal(a_crop, a_slice)) you get True, because all is being computed in a single evaluation step, so only one random value is produced and it determines the value of both a_crop and a_slice. Another example is this, if you run tf.stack([a_crop, a_slice]).eval() you will get a tensor with to equal rows; again, only one random value was produced. More generally, if you call tf.Session.run with multiple tensors to evaluate, all the computations in that call will use the same random values.
As a side note, if you actually need a random value in a computation that you want to maintain for a later computation, the easiest thing would be to just retrieve if with tf.Session.run, along with any other needed computation, to feed it back later through feed_dict; or you could have a tf.Variable and store the random value there. A more advanced possibility would be to use partial_run, an experimental API that allows you to evaluate part of the computation graph and continue evaluating it later, while maintaining the same state (i.e. the same random values, among other things).

Which activation function to use for sequence prediction in a lstm network in Keras?

I have built a network to predict 4 time series simultaneously, say [respiration, blood-pressure, pulse, spo2], using TimeDistributed layer in Keras. Before feeding into the network I normalize the series using sklearn.preprocessing.StandardScaler function. Here is a snapshot of original data and normalized data:
Original: Normalized:
resp sysbp pulse spo2 resp sysbp pulse spo2
18 111.5 71 97 -0.322154 -0.007753 -0.865683 0.051831
18 109.5 71 97 -0.322154 -0.067897 -0.865683 0.051831
19 122 70 97 -0.151163 0.308004 -0.922641 0.051831
18 128 72 98 -0.322154 0.488436 -0.808725 0.292901
18 125 71 96 -0.322154 0.39822 -0.865683 -0.189238
20 113 71 96 0.019828 0.037355 -0.865683 -0.189238
16 121 71 96 -0.664136 0.277932 -0.865683 -0.189238
20 119 71 97 0.019828 0.217788 -0.865683 0.051831
18 119 71.5 97 -0.322154 0.217788 -0.837204 0.051831
19 119 88 97 -0.151163 0.217788 0.102603 0.051831
16 119 88 97 -0.664136 0.217788 0.102603 0.051831
14 119 87 97 -1.006117 0.217788 0.045645 0.051831
19 119 88 98 -0.151163 0.217788 0.102603 0.292901
29 119 92 96 1.558744 0.217788 0.330435 -0.189238
The ranges before & after normalization respectively are as follows:
resp=[0,99] & [-3.4,13.5], sysbp=[0,269] & [-3.3,4.7], pulse=[0,204] & [-4.9,6.7], spo2=[0,100] & [-23.3,0.77]
I have structured the data in timesteps of length 200, mentioned batch size as 100 and the number of dimensions, as we have seen above, is 4. Hence, the structure of my neural network looks like:
batch_size=100
x = Input(batch_shape=(batch_size,200,4) , name='input')
mask = Masking(mask_value=0., name='input_masked')(x)
lstm1 = Bidirectional(LSTM(4, name="lstm1", dropout=0.25, recurrent_dropout=0.1, return_sequences=True, stateful=True))(mask)
output1 = TimeDistributed(Dense(4, activation='relu'), name='output1')(lstm1)
model = Model(inputs=x, outputs=output1)
optimizer = Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='mean_absolute_error', metrics=['accuracy'])
history = model.fit(X_train, [y_train1, y_train2], batch_size=batch_size, epochs=500, verbose=1)
The network is trained with no errors and gives accuracy of 85%. Now but when I predict my test data using the trained network, the output ranges of the predicted data fields are positive. Thus, after inverse scaling, the higher values are predicted nicely, but the lower values are not predicted at all. The lowest it goes is till the mean of the variable.
To experiment, I even tried with PReLU activation function, and some of the predicted values were negative, but still it was hardly catching up with the lowest possible values. The questions I have are:
Is the structure and approach of sequence prediction is correct or am
I missing something?
And, is there a better activation function which I should be using?

With ReLU you should scale your values from 0 to 1 not just remove mean and deviation which is what you are doing.
ReLu Activation function - Ask yourself: What happends when you are using negative values with this activation function?

how to implement tensorflow's next_batch for own data

In the tensorflow MNIST tutorial the mnist.train.next_batch(100) function comes very handy. I am now trying to implement a simple classification myself. I have my training data in a numpy array. How could I implement a similar function for my own data to give me the next batch?
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
Xtr, Ytr = loadData()
for it in range(1000):
batch_x = Xtr.next_batch(100)
batch_y = Ytr.next_batch(100)

The link you posted says: "we get a "batch" of one hundred random data points from our training set". In my example I use a global function (not a method like in your example) so there will be a difference in syntax.
In my function you'll need to pass the number of samples wanted and the data array.
Here is the correct code, which ensures samples have correct labels:
import numpy as np
def next_batch(num, data, labels):
'''
Return a total of `num` random samples and labels.
'''
idx = np.arange(0 , len(data))
np.random.shuffle(idx)
idx = idx[:num]
data_shuffle = [data[ i] for i in idx]
labels_shuffle = [labels[ i] for i in idx]
return np.asarray(data_shuffle), np.asarray(labels_shuffle)
Xtr, Ytr = np.arange(0, 10), np.arange(0, 100).reshape(10, 10)
print(Xtr)
print(Ytr)
Xtr, Ytr = next_batch(5, Xtr, Ytr)
print('\n5 random samples')
print(Xtr)
print(Ytr)
And a demo run:
[0 1 2 3 4 5 6 7 8 9]
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
5 random samples
[9 1 5 6 7]
[[90 91 92 93 94 95 96 97 98 99]
[10 11 12 13 14 15 16 17 18 19]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]]

In order to shuffle and sampling each mini-batch, the state whether a sample has been selected inside the current epoch should also be considered. Here is an implementation which use the data in the above answer.
import numpy as np
class Dataset:
def __init__(self,data):
self._index_in_epoch = 0
self._epochs_completed = 0
self._data = data
self._num_examples = data.shape[0]
pass
#property
def data(self):
return self._data
def next_batch(self,batch_size,shuffle = True):
start = self._index_in_epoch
if start == 0 and self._epochs_completed == 0:
idx = np.arange(0, self._num_examples) # get all possible indexes
np.random.shuffle(idx) # shuffle indexe
self._data = self.data[idx] # get list of `num` random samples
# go to the next batch
if start + batch_size > self._num_examples:
self._epochs_completed += 1
rest_num_examples = self._num_examples - start
data_rest_part = self.data[start:self._num_examples]
idx0 = np.arange(0, self._num_examples) # get all possible indexes
np.random.shuffle(idx0) # shuffle indexes
self._data = self.data[idx0] # get list of `num` random samples
start = 0
self._index_in_epoch = batch_size - rest_num_examples #avoid the case where the #sample != integar times of batch_size
end = self._index_in_epoch
data_new_part = self._data[start:end]
return np.concatenate((data_rest_part, data_new_part), axis=0)
else:
self._index_in_epoch += batch_size
end = self._index_in_epoch
return self._data[start:end]
dataset = Dataset(np.arange(0, 10))
for i in range(10):
print(dataset.next_batch(5))
the output is:
[2 8 6 3 4]
[1 5 9 0 7]
[1 7 3 0 8]
[2 6 5 9 4]
[1 0 4 8 3]
[7 6 2 9 5]
[9 5 4 6 2]
[0 1 8 7 3]
[9 7 8 1 6]
[3 5 2 4 0]
the first and second (3rd and 4th,...) mini-batch correspond to one whole epoch..

I use Anaconda and Jupyter.
In Jupyter if you run ?mnist you get:
File: c:\programdata\anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py
Docstring: Datasets(train, validation, test)
In folder datesets you shall find mnist.py which contains all methods including next_batch.

The answer which is marked up above I tried the algorithm by that algorithm I am not getting results so I searched on kaggle and I saw really amazing algorithm which worked really well. Best result try this. In below algorithm **Global variable takes the input you declared above in which you read your data set.**
epochs_completed = 0
index_in_epoch = 0
num_examples = X_train.shape[0]
# for splitting out batches of data
def next_batch(batch_size):
global X_train
global y_train
global index_in_epoch
global epochs_completed
start = index_in_epoch
index_in_epoch += batch_size
# when all trainig data have been already used, it is reorder randomly
if index_in_epoch > num_examples:
# finished epoch
epochs_completed += 1
# shuffle the data
perm = np.arange(num_examples)
np.random.shuffle(perm)
X_train = X_train[perm]
y_train = y_train[perm]
# start next epoch
start = 0
index_in_epoch = batch_size
assert batch_size <= num_examples
end = index_in_epoch
return X_train[start:end], y_train[start:end]

If you would not like to get shape mismatch error in your tensorflow session run
then use the below function instead of the function provided in the first solution above (https://stackoverflow.com/a/40995666/7748451) -
def next_batch(num, data, labels):
'''
Return a total of `num` random samples and labels.
'''
idx = np.arange(0 , len(data))
np.random.shuffle(idx)
idx = idx[:num]
data_shuffle = data[idx]
labels_shuffle = labels[idx]
labels_shuffle = np.asarray(labels_shuffle.values.reshape(len(labels_shuffle), 1))
return data_shuffle, labels_shuffle

Yet another implementation:
from typing import Tuple
import numpy as np
class BatchMaker(object):
def __init__(self, feat: np.array, lab: np.array) -> None:
if len(feat) != len(lab):
raise ValueError("Expected feat and lab to have the same number of samples")
self.feat = feat
self.lab = lab
self.indexes = np.arange(len(feat))
np.random.shuffle(self.indexes)
self.pos = 0
# "BatchMaker, BatchMaker, make me a batch..."
def next_batch(self, batch_size: int) -> Tuple[np.array, np.array]:
if self.pos + batch_size > len(self.feat):
np.random.shuffle(self.indexes)
self.pos = 0
batch_indexes = self.indexes[self.pos: self.pos + batch_size]
self.pos += batch_size
return self.feat[batch_indexes], self.lab[batch_indexes]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow dataset splitting does not work - python

Related

Pre-processing single feature containing different scales

ValueError: too many values to unpack 3

Understand tensorflow slice operation

Which activation function to use for sequence prediction in a lstm network in Keras?

how to implement tensorflow's next_batch for own data

Categories

Resources