I'm trying to analyse some Python code to identify where specific functions are being called and which arguments are being passed.
For instance, suppose I have an ML script that contains a model.fit(X_train,y_train). I want to find this line in the script, identify what object is being fit (i.e., model), and to identify X_train and y_train as the arguments (as well as any others).
I'm new to AST, so I don't know how to do this in an efficient way.
So far, I've been able to locate the line in question by iterating through a list of child nodes (using ast.iter_child_nodes) until I arrive at the ast.Call object, and then calling its func.attr, which returns "fit". I can also get "X_train" and "y_train" with args.
The problem is that I have to know where it is in advance in order to do it this way, so it's not particularly useful. The idea would be for it to obtain the information I'm looking for automatically.
Additionally, I have not been able to find a way to determine that model is what is calling fit.
You can traverse the ast and search for ast.Call nodes where the name is fit:
import ast
def fit_calls(tree):
for i in ast.walk(tree):
if isinstance(i, ast.Call) and isinstance(i.func, ast.Attribute) and i.func.attr == 'fit':
yield {'model_obj_str':ast.unparse(i.func.value),
'model_obj_ast':i.func.value,
'args':[ast.unparse(j) for j in i.args],
'kwargs':{j.arg:ast.unparse(j.value) for j in i.keywords}}
Test samples:
#https://www.tensorflow.org/api_docs/python/tf/keras/Model
sample_1 = """
model = tf.keras.models.Model(
inputs=inputs, outputs=[output_1, output_2])
model.compile(optimizer="Adam", loss="mse", metrics=["mae", "acc"])
model.fit(x, (y, y))
model.metrics_names
"""
sample_2 = """
optimizer = tf.keras.optimizers.SGD()
model.compile(optimizer, loss='mse', steps_per_execution=10)
model.fit(dataset, epochs=2, steps_per_epoch=10)
"""
sample_3 = """
x = np.random.random((2, 3))
y = np.random.randint(0, 2, (2, 2))
_ = model.fit(x, y, verbose=0)
"""
#https://scikit-learn.org/stable/developers/develop.html
sample_4 = """
estimator = estimator.fit(data, targets)
"""
sample_5 = """
y_predicted = SVC(C=100).fit(X_train, y_train).predict(X_test)
"""
print([*fit_calls(ast.parse(sample_1))])
print([*fit_calls(ast.parse(sample_2))])
print([*fit_calls(ast.parse(sample_3))])
print([*fit_calls(ast.parse(sample_4))])
print([*fit_calls(ast.parse(sample_5))])
Output:
[{'model_obj_str': 'model', 'model_obj_ast': <ast.Name object at 0x1007737c0>,
'args': ['x', '(y, y)'], 'kwargs': {}}]
[{'model_obj_str': 'model', 'model_obj_ast': <ast.Name object at 0x1007731f0>,
'args': ['dataset'], 'kwargs': {'epochs': '2', 'steps_per_epoch': '10'}}]
[{'model_obj_str': 'model', 'model_obj_ast': <ast.Name object at 0x100773d00>,
'args': ['x', 'y'], 'kwargs': {'verbose': '0'}}]
[{'model_obj_str': 'estimator', 'model_obj_ast': <ast.Name object at 0x100773ca0>,
'args': ['data', 'targets'], 'kwargs': {}}]
[{'model_obj_str': 'SVC(C=100)', 'model_obj_ast': <ast.Call object at 0x100773130>,
'args': ['X_train', 'y_train'], 'kwargs': {}}]
Related
For my ML-project with pytorch I am dividing my initial data set into a training and testing set, using a custom function which ensures that all labels in my original data set exist in both training and testing set:
(working_indices, working_labels, testing_indices, testing_labels,) = split_dataset_equally_random(
target_labels=train_labels,
percentage_to_split=100 - TEST_PERCENTAGE,
random_seed=-1,
)
unique_testing_indices = set([label.detach().numpy().tolist() for label in testing_labels])
count_of_unique_testing_indices = {}
for entry in unique_testing_indices:
count_of_unique_testing_indices[entry] = testing_labels.detach().numpy().tolist().count(entry)
testing_labels_from_indices = [train_labels[index] for index in testing_indices]
#count_of_unique_testing_indices = {i: count for (i, testing_labels.detach().numpy().tolist().count(i)) in unique_testing_indices}
print(f"Unique testing indices: {unique_testing_indices}")
print(f"Unique testing indices from labels: {set([index.detach().numpy().tolist() for index in testing_labels_from_indices])}")
print(f"Number of unique testing indices: {count_of_unique_testing_indices}")
For my current application that gives me the output
Unique testing indices: {0, 1, 2, 3}
Unique testing indices from labels: {0, 1, 2, 3}
Number of unique testing indices: {0: 2160, 1: 4104, 2: 3024, 3: 1080}
Now, for testing I use the following code:
test_loader = DataLoader(
TensorDataset(feat, torch_labels),
batch_size=self.Model.module_options["batch_size"],
sampler=SubsetRandomSampler(testing_indices),
)
accuracy, predictions, prediction_distributions, actual_labels = self.Model.evaluate_model(
test_data_loader=test_loader
)
with evaluate_model() being defined as
def evaluate_model(self, test_data_loader=None):
"""_summary_
Args:
test_data_loader (_type_, optional): _description_. Defaults to None.
Returns:
_type_: _description_
"""
self.model.eval()
if test_data_loader is not None:
predictions, actuals = list(), list()
for (inputs, targets) in test_data_loader:
#print(f"{inputs}, {targets}")
print(f"{set(targets.detach().numpy().tolist())}")
inputs, targets = inputs.to(self.device), targets.to(self.device)
# yhat, yhat_x = self.model(inputs).to("cpu")
yhat, yhat_x = self.model(inputs)
yhat = yhat.to("cpu")
yhat_x = yhat_x.to("cpu")
# print(yhat.detach().numpy(), yhat_x)
yhat = yhat.detach().numpy()
actual = targets.to("cpu").numpy()
actual = actual.reshape((len(actual), 1))
# yhat = yhat.round()
predictions.append(yhat)
actuals.append(actual)
prediction_distributions, actuals = np.vstack(predictions), np.vstack(actuals)
predictions = np.argmax(prediction_distributions, axis=1)
acc = accuracy_score(actuals, predictions)
return acc, predictions, prediction_distributions, actuals
else:
print("Test_data_loader is none")
return -1, -1, -1, -1
Unfortunately, sometimes I run into the issue that my sampler only picks features corresponding to two or three of the four labels during testing, i.e. one or two labels will be completely omitted during the entire run, which then is also reflected in the predictions (i.e. when plotting the used labels during testing I might only get {0, 1, 2} instead of {0, 1, 2, 3} during the entire testing run).
Why is that happening, and how can I avoid it in future sessions? The problem goes away by simply re-running the evaluation function.
I have a Keras model with 1 input and 2 outputs in TensorFlow 2. When calling model.fit I want to pass dataset as x=train_dataset and call model.fit once. The train_dataset is made with tf.data.Dataset.from_generator which yields: x1, y1, y2.
The only way I can run training is the following:
for x1, y1,y2 in train_dataset:
model.fit(x=x1, y=[y1,y2],...)
How to tell TensorFlow to unpack variables and train without the explicit for loop? Using the for loop makes many things less practical, as well as usage of train_on_batch.
If I want to run model.fit(train_dataset, ...) the function doesn't understand what x and y are, even the model is defined like:
model = Model(name ='Joined_Model',inputs=self.x, outputs=[self.network.y1, self.network.y2])
It throws an error that it is expecting 2 targets while getting 1, even the dataset has 3 variables, which can be iterated trough in the loop.
The dataset and mini-batch are generated as:
def dataset_joined(self, n_epochs, buffer_size=32):
dataset = tf.data.Dataset.from_generator(
self.mbatch_gen_joined,
(tf.float32, tf.float32,tf.int32),
(tf.TensorShape([None, None, self.n_feat]),
tf.TensorShape([None, None, self.n_feat]),
tf.TensorShape([None, None])),
[tf.constant(n_epochs)]
)
dataset = dataset.prefetch(buffer_size)
return dataset
def mbatch_gen_joined(self, n_epochs):
for _ in range(n_epochs):
random.shuffle(self.train_s_list)
start_idx, end_idx = 0, self.mbatch_size
for _ in range(self.n_iter):
s_mbatch_list = self.train_s_list[start_idx:end_idx]
d_mbatch_list = random.sample(self.train_d_list, end_idx-start_idx)
s_mbatch, d_mbatch, s_mbatch_len, d_mbatch_len, snr_mbatch, label_mbatch, _ = \
self.wav_batch(s_mbatch_list, d_mbatch_list)
x_STMS_mbatch, xi_bar_mbatch, _ = \
self.training_example(s_mbatch, d_mbatch, s_mbatch_len,
d_mbatch_len, snr_mbatch)
#seq_mask_mbatch = tf.cast(tf.sequence_mask(n_frames_mbatch), tf.float32)
start_idx += self.mbatch_size; end_idx += self.mbatch_size
if end_idx > self.n_examples: end_idx = self.n_examples
yield x_STMS_mbatch, xi_bar_mbatch, label_mbatch
Keras models expect the Python generators or tf.data.Dataset objects provide the input data as a tuple with the format of (input_data, target_data) (or (input_data, target_data, sample_weights)). Each of input_data or target_data could and should be a list/tuple if the model has multiple input/output layers. Therefore, in your code, the generated data should also be compatible with this expected format:
yield x_STMS_mbatch, (xi_bar_mbatch, label_mbatch) # <- the second element is a tuple itself
Also, this should be considered in the arguments passed to the from_generator method as well:
dataset = tf.data.Dataset.from_generator(
self.mbatch_gen_joined,
output_types=(
tf.float32,
(tf.float32, tf.int32)
),
output_shapes=(
tf.TensorShape([None, None, self.n_feat]),
(
tf.TensorShape([None, None, self.n_feat]),
tf.TensorShape([None, None])
)
),
args=(tf.constant(n_epochs),)
)
Use yield(x1, [y1,y2]) so model.fit will understand your generator output.
From the Tensorflow dataset guide it says
It is often convenient to give names to each component of an element,
for example if they represent different features of a training
example. In addition to tuples, you can use collections.namedtuple or
a dictionary mapping strings to tensors to represent a single element
of a Dataset.
dataset = tf.data.Dataset.from_tensor_slices(
{"a": tf.random_uniform([4]),
"b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types) # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes) # ==> "{'a': (), 'b': (100,)}"
https://www.tensorflow.org/guide/datasets
And this is very useful in Keras. If you pass a dataset object to model.fit, the names of the components can be used to match the inputs of your Keras model. Example:
image_input = keras.Input(shape=(32, 32, 3), name='img_input')
timeseries_input = keras.Input(shape=(None, 10), name='ts_input')
x1 = layers.Conv2D(3, 3)(image_input)
x1 = layers.GlobalMaxPooling2D()(x1)
x2 = layers.Conv1D(3, 3)(timeseries_input)
x2 = layers.GlobalMaxPooling1D()(x2)
x = layers.concatenate([x1, x2])
score_output = layers.Dense(1, name='score_output')(x)
class_output = layers.Dense(5, activation='softmax', name='class_output')(x)
model = keras.Model(inputs=[image_input, timeseries_input],
outputs=[score_output, class_output])
train_dataset = tf.data.Dataset.from_tensor_slices(
({'img_input': img_data, 'ts_input': ts_data},
{'score_output': score_targets, 'class_output': class_targets}))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
model.fit(train_dataset, epochs=3)
So it would be useful for look up, add, and change names to components in tf dataset objects. What is the best way to go about doing these tasks?
You can use map to bring modifications to your dataset, if that is what you are looking for. For example, to transform a plain tuple output to a dict with meaningful names,
import tensorflow as tf
# dummy example
ds_ori = tf.data.Dataset.zip((tf.data.Dataset.range(0, 10), tf.data.Dataset.range(10, 20)))
ds_renamed = ds_ori.map(lambda x, y: {'input': x, 'output': y})
batch_ori = ds_ori.make_one_shot_iterator().get_next()
batch_renamed = ds_renamed.make_one_shot_iterator().get_next()
with tf.Session() as sess:
print(sess.run(batch_ori))
print(sess.run(batch_renamed))
# (0, 10)
# {'input': 0, 'output': 10}
While the accepted answer is good for changing names of (existing)components, it does not talk about 'addition'. This can be done as follows:
y_dataset = x_dataset.map(fn1)
where you can define fn1 as you want
#tf.function
def fn1(x):
##use x to derive additional columns u want. Set the shape as well
y = {}
y.update(x)
y['new1'] = new1
y['new2'] = new2
return y
I optimized my keras model using hyperopt. Now how do we save the best optimized keras model and its weights to disk.
My code:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import roc_auc_score
import sys
X = []
y = []
X_val = []
y_val = []
space = {'choice': hp.choice('num_layers',
[ {'layers':'two', },
{'layers':'three',
'units3': hp.uniform('units3', 64,1024),
'dropout3': hp.uniform('dropout3', .25,.75)}
]),
'units1': hp.choice('units1', [64,1024]),
'units2': hp.choice('units2', [64,1024]),
'dropout1': hp.uniform('dropout1', .25,.75),
'dropout2': hp.uniform('dropout2', .25,.75),
'batch_size' : hp.uniform('batch_size', 20,100),
'nb_epochs' : 100,
'optimizer': hp.choice('optimizer',['adadelta','adam','rmsprop']),
'activation': 'relu'
}
def f_nn(params):
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import Adadelta, Adam, rmsprop
print ('Params testing: ', params)
model = Sequential()
model.add(Dense(output_dim=params['units1'], input_dim = X.shape[1]))
model.add(Activation(params['activation']))
model.add(Dropout(params['dropout1']))
model.add(Dense(output_dim=params['units2'], init = "glorot_uniform"))
model.add(Activation(params['activation']))
model.add(Dropout(params['dropout2']))
if params['choice']['layers']== 'three':
model.add(Dense(output_dim=params['choice']['units3'], init = "glorot_uniform"))
model.add(Activation(params['activation']))
model.add(Dropout(params['choice']['dropout3']))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=params['optimizer'])
model.fit(X, y, nb_epoch=params['nb_epochs'], batch_size=params['batch_size'], verbose = 0)
pred_auc =model.predict_proba(X_val, batch_size = 128, verbose = 0)
acc = roc_auc_score(y_val, pred_auc)
print('AUC:', acc)
sys.stdout.flush()
return {'loss': -acc, 'status': STATUS_OK}
trials = Trials()
best = fmin(f_nn, space, algo=tpe.suggest, max_evals=100, trials=trials)
print 'best: '
print best
Trials class object stores many relevant information related with each iteration of hyperopt. We can also ask this object to save trained model.
You have to make few small changes in your code base to achieve this.
-- return {'loss': -acc, 'status': STATUS_OK}
++ return {'loss':loss, 'status': STATUS_OK, 'Trained_Model': model}
Note:'Trained_Model' just a key and you can use any other string.
best = fmin(f_nn, space, algo=tpe.suggest, max_evals=100, trials=trials)
model = getBestModelfromTrials(trials)
Retrieve the trained model from the trials object:
import numpy as np
from hyperopt import STATUS_OK
def getBestModelfromTrials(trials):
valid_trial_list = [trial for trial in trials
if STATUS_OK == trial['result']['status']]
losses = [ float(trial['result']['loss']) for trial in valid_trial_list]
index_having_minumum_loss = np.argmin(losses)
best_trial_obj = valid_trial_list[index_having_minumum_loss]
return best_trial_obj['result']['Trained_Model']
Note: I have used this approach in Scikit-Learn classes.
Make f_nn return the model.
def f_nn(params):
# ...
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
The models will be available on trials object under results. I put in some sample data and got print(trials.results) to spit out
[{'loss': 2.8245880603790283, 'status': 'ok', 'model': <keras.engine.training.Model object at 0x000001D725F62B38>}, {'loss': 2.4592788219451904, 'status': 'ok', 'model': <keras.engine.training.Model object at 0x000001D70BC3ABA8>}]
Use np.argmin to find the smallest loss, then save using model.save
trials.results[np.argmin([r['loss'] for r in trials.results])]['model']
(Side note, in C# this would be trials.results.min(r => r.loss).model... if there's a better way to do this in Python please let me know!)
You may wish to use attachments on the trial object if you're using MongoDB, as the model may be very large:
attachments - a dictionary of key-value pairs whose keys are short strings (like filenames) and whose values are potentially long strings (like file contents) that should not be loaded from a database every time we access the record. (Also, MongoDB limits the length of normal key-value pairs so once your value is in the megabytes, you may have to make it an attachment.) Source.
I don't know how to send some variable to f_nn or another hyperopt target explicilty. But I've use two approaches to do the same task.
First approach is some global variable (don't like it, because it's non-clear) and the second is to save the metric value to the file, then read and compare with a current metric. The last approach seems to me better.
def f_nn(params):
...
# I omit a part of the code
pred_auc =model.predict_proba(X_val, batch_size = 128, verbose = 0)
acc = roc_auc_score(y_val, pred_auc)
try:
with open("metric.txt") as f:
min_acc = float(f.read().strip()) # read best metric,
except FileNotFoundError:
min_acc = acc # else just use current value as the best
if acc < min_acc:
model.save("model.hd5") # save best to disc and overwrite metric
with open("metric.txt", "w") as f:
f.write(str(acc))
print('AUC:', acc)
sys.stdout.flush()
return {'loss': -acc, 'status': STATUS_OK}
trials = Trials()
best = fmin(f_nn, space, algo=tpe.suggest, max_evals=100, trials=trials)
print 'best: '
print best
from keras.models import load_model
best_model = load_model("model.hd5")
This approach has several advantages: you can keep metric and model together, and even apply to it some version or data version control system - so you can restore results of an experiment in the future.
Edit
It can cause an unexpected behaviour, if there's some metric from a previous run, but you don't delete it. So you can adopt the code - remove the metric after the optimization or use timestamp etc. to distinguish your experimets' data.
It is easy to implement a global variable to save the model. I would recommend saving it as an attribute under the trials object for clarity. In my experience in using hyperopt, unless you wrap ALL the remaining parameters (that are not tuned) into a dict to feed into the objective function (e.g. objective_fn = partial(objective_fn_withParams, otherParams=otherParams), it is very difficult to avoid global vars.
Example provided below:
trials = Trials()
trials.mybest = None # initialize an attribute for saving model later
best = fmin(f_nn, space, algo=tpe.suggest, max_evals=100, trials=trials)
trials.mybest['model'].save("model.hd5")
## In your optimization objective function
def f_nn(params):
global trials
model = trainMyKerasModelWithParams(..., params)
...
pred_auc =model.predict_proba(X_val, batch_size = 128, verbose = 0)
acc = roc_auc_score(y_val, pred_auc)
loss = -acc
## Track only best model (for saving later)
if ((trials.mybest is None)
or (loss < trials.mybest['loss'])):
trials.mybest = {'loss': loss,'model': model}
...
##
I wish to use a LIFT metric, using lift_score(), as the metric in xgboost tree model, then I set
.cv( ...,
feval = lift_score,
...,
)
but it shows the error:
TypeError: len() of unsized object
It might be, because my dataset is of an int type, but xgboost tree only accepts integer data, not sure how to fix this problem.
Below is my code:
import xgboost as xgb
from mlxtend.evaluate import lift_score
t_params = { 'objective': 'binary:logistic',
'eta': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
'max_depth': 4,
'min_child_weight': 6,
'seed': 0,
}
xgdmat = xgb.DMatrix( X_train, y_train ) # my data
cv_xgb = xgb.cv( params = t_params,
dtrain = xgdmat,
feval = lift_score,
maximize = True,
num_boost_round = 600,
nfold = 5,
early_stopping_rounds = 100
)
Why? Well, the API expectations were not met:
While the feval parameter could be set freely, there are some expectations from the xgboost.cv() method API, that ought be fulfilled.
The simpler part:
# user defined evaluation function, return a pair metric_name, result
so, your metric-evaluator function has to deliver a compatible result, best by:
return 'error', <_a_custom_LIFT_score_>
not meeting this requirement ( detected by testing if the len( ... ) was at least 2 ), has actually ignited the above thrown TypeError exception. So this part is solved.
Next, the harder part to meet:
( taken from the source-code )
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
labels = dtrain.get_label()
# return a pair metric_name, result
# since preds are margin(before logistic transformation, cutoff at 0)
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
Call-interface matching issues:
def eval_LIFT( ModelPREDICTIONS, dtrain ):
# a thin wrapper to mediate conversion from
# Xgboost.cv() <feval>-FUN call-signature
# to a target lift_score() call-signature
return 'LIFT', lift_score( dtrain.get_label(),
ModelPREDICTIONS
)
because the mlxtend.evaluate.lift_score() call-signature does not match the xgboost.cv()-native parameter ordering, but has it's own one:
def lift_score( y_target,
y_predicted,
binary = True,
positive_label = 1
):
"""
Lift measures the degree to which the predictions of a
classification model are better than randomly-generated predictions.
The in terms of True Positives (TP), True Negatives (TN),
False Positives (FP), and False Negatives (FN), the lift score is
computed as:
[ TP/(TP+FN) ] / [ (TP+FP) / (TP+TN+FP+FN) ]
...
"""
...
the simple above wrapper will do the simpler part.
In case the xgboost walkthrough source-code warnings above, about the actual value-biases is to be implemented also for your LIFT-metric case, your value-adaptation step will have to take place inside the eval_LIFT(), before passing, now correctly adapted-values to the mlxtend.evaluate.lift_score() that expects un-biased values.