I apologize for a longer than usual intro, but it is important for the question:
I've recently been assigned to work on an existing project, which uses Keras+Tensorflow to create a Fully Connected Net.
Overall the model has 3 fully connected layers with 500 neurons and has 2 output classes. The first layer has 500 neurons which are connected to 82 input features. The model is used in the production and is retrained weekly, using this week information generated by an outer source.
The engineer which designed the model is no longer working here and I'm trying to reverse engineer and understand the behavior of the model.
Couple of objectives I have defined for myself are:
Understand the feature selection process and feature importance.
Understand and control the weekly re-training process.
In order to try and answer both of them, I've implemented an experiment where I feed my code with two models: one from the previous week and the other from the current week:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from keras.models import model_from_json
path1 = 'C:/Model/20190114/'
path2 = 'C:/Model/20190107/'
model_name1 = '0_10.1'
model_name2 = '0_10.2'
models = [path1 + model_name1, path2 + model_name2]
features_cum_weight = {}
I then take each feature and try to sum all the weights (their absolute value) which connect it to the first hidden layer.
This way I create two vectors of 82 values:
for model_name in models:
structure_filename = model_name + "_structure.json"
weights_filename = model_name + "_weights.h5"
with open(structure_filename, 'r') as model_json:
model = model_from_json(model_json.read())
in_layer_weights = model.layers[0].get_weights()[0]
in_layer_weights = abs(in_layer_weights)
features_cum_weight[model_name] = in_layer_weights.sum(axis=1)
I then plot them, using MatplotLib:
# Plot the Evolvement of Input Neuron Weights:
keys = list(features_cum_weight.keys())
weights_1 = features_cum_weight[keys[0]]
weights_2 = features_cum_weight[keys[1]]
fig, ax = plt.subplots(nrows=2, ncols=2)
width = 0.35 # the width of the bars
n_plots = 4
batch = int(np.ceil(len(weights_1)/n_plots))
for i in range(n_plots):
start = i*(batch+1)
stop = min(len(weights_1), start + batch + 1)
cur_w1 = weights_1[start:stop]
cur_w2 = weights_2[start:stop]
ind = np.arange(len(cur_w1))
cur_ax = ax[i//2][i%2]
cur_ax.bar(ind - width/2, cur_w1, width, color='SkyBlue', label='Current Model')
cur_ax.bar(ind + width/2, cur_w2, width, color='IndianRed', label='Previous Model')
cur_ax.set_ylabel('Sum of Weights')
cur_ax.set_title('Sum of all weights connected by feature')
cur_ax.set_ylim(0, 30)
Resulting in the following plot:
MatPlotLib plot
I then try to compare the vectors to deduce:
If the vectors have been changed drastically - there might be some major change in the training data or some problem while retraining the model.
If some value is close to zero the model might have recognized this feature as not important.
I want your opinion and insights on the following:
The overall approach to this experiment.
Advice on other ideas on reverse engineering on a given model.
Insights on the output I provide here.
Thank you all, I am open to any suggestions and critic!
This type of deduction is not entirely true. The combination between the features is not linear. It is true that if is strictly 0 does not matter, but it may be that it is then recombined in another way and in another deep layer.
It would be true if your model is linear. In fact, this is how the PCA analysis works, where it searches for linear relationships through the covariance matrix. The eigenvalue would indicate the importance of each feature.
I think that there are several ways to confirm your suspicions:
Eliminate features that you think are not important to train again and see the result. If it is similar, your suspicions are correct.
Apply the current model, take an example (we will call it as pivot) to evaluate and significantly change the features that you consider irrelevant and create many examples. This applies for several pivots. If the result is similar, that field should not matter. Example (I consider the first feature to be irrelevant):
data = np.array([[0.5, 1, 0.5], [1, 2, 5]])
range_values = 50
new_data = []
for i in range(data.shape[0]):
sample = data[i]
# We create new samples
for i in range (1000):
noise = np.random.rand () * range_values
new_sample = sample.copy()
new_sample[0] += noise
I have a Python code that aims to build several multi-fidelity models (one for each of several variables) and use Emukit's experimental design functions to update them iteratively. I am using simple uncertainty acquisition (ModelVariance) and the multi-fidelity-wrapped gradient optimizer as shown in the examples here and here. I started by applying this technique to only one of my several variables. When doing that I noticed that 1) all update points (x_new) seemed to be selected from the LF model and 2) the variance dropped precipitously everywhere after adding only a single update point. I shrugged this off initially, and applied the technique to all my variables (using a loop over a dictionary to do each variable in turn). When I did that, I discovered that the mean predictions (new model points) seemed perfectly reasonable, but the reported variances using .predict() for ALL the models of ALL the variables were exactly the same, and were in fact what I had been given by the program when just doing the single variable. Something seems to be going very wrong finding and updating the variances after adding a new training point and using .set_data to update the model and I am not sure what or where the problem is. Is there an emukit bug? Am I using an incorrect setting? Is the problem with my dictionaries or for-loops? I am at a loss. Can anyone offer some insight?
Here is the code I currently have, somewhat redacted. I am sorry that it's such a long read....
def make_mf(x,y,kernel,fidels):
# Generic multifidelity model builder.
# Returns a mutlifidelity model built based on the training points (x and y),
# kernels, and number of fidelities
mf_lin_model=GPyLinearMultiFidelityModel(x, y,kernel, n_fidelities=fidels)
# set up loop to fix noise to 0 for all fidelities, indicating training points are exact
for i in range(fidels):
if i == 0:
caller = "mf_lin_model.mixed_noise.Gaussian_noise.fix(0)"
caller = "mf_lin_model.mixed_noise.Gaussian_noise_" + str(i) + ".fix(0)"
## Wrap the model using the given 'GPyMultiOutputWrapper'
mf_model= model = GPyMultiOutputWrapper(mf_lin_model, 2, n_optimization_restarts=5,verbose_optimization=False)
# Fit the model
# Return the final model to the calling procedure
# list of y (result variables)
#list of x (input) variables
# list of fidelity levels. levels should be in order of ascending fidelity (0=lowest)
# list of what we'll need to store for each variable and level
# these are the model itself, the predicted values for plotting,
# and the predicted values at the training points
# list of medium_fidelity variables
# these are the training coordintaes, the model, predicted values for plotting,
# predicted variances, the maximum and mean variance, and predicted
# values at the training points
# set up a dictionary to store the models and related results for each y-variable
# and each fidelity
MyModels={key:{lkey:{ckey:None for ckey in contents} for lkey in levels} for key in yvars}
# Set up a dictionary for the multi-fidelity models
MultiFidelity={key:{vkey: None for vkey in mainvars}for key in yvars}
for key in MultiFidelity.keys():
for level in levels:
MultiFidelity[key][level]={mkey:None for mkey in multifivars}
#set up a dictionary to easily access data
MyData={key:None for key in levels}
# set up a dictionaries to easily access training and plotting points
x_train={key:None for key in levels}
Y_plot={key:None for key in levels}
T_plot={key:None for key in levels}
# Number of initial points evaluated at each fidelity level
MyPoints={levels[i]:npoints[i] for i in range(len(levels))}
# High sampling of models for plotting of functions
x_plot = np.linspace(2, 16, 200)[:, None]
# set up points for plotting and retrieving MF model
X_plot = convert_x_list_to_array([x_plot, x_plot])
for i in range(len(levels)):
Y_plot[levels[i]] = X_plot[i*len(x_plot):(i+1)*len(x_plot)]
# Sampling for training for multi-fidelity analysis
x_train[levels[0]] = np.atleast_2d(np.random.rand(MyPoints[levels[0]])*14+2).T
for i in range (1,len(levels)):
x_train[levels[i]] = np.atleast_2d(np.random.permutation(x_train[levels[i-1]])[:MyPoints[levels[i]]])
#x_train_h = np.atleast_2d([3, 9.5, 11, 15]).T
# set up points for plotting mf result at training points
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
# combine the training points of all fidelity levels into a list of arrays
for level in levels:
kernels = [GPy.kern.RBF(1), GPy.kern.RBF(1)]
lin_mf_kernel = emukit.multi_fidelity.kernels.LinearMultiFidelityKernel(kernels)
for var in MyModels.keys():
for level in levels:
# use SciPy interpolate to build surrogate for given variable and fidelity level
# find y-values for training MF points and append to a list of arrays
## Convert lists of arrays to ndarrays augmented with fidelity indicators
# Build the multi-fidelity model
## Construct a linear multi-fidelity model
MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
# Get multifidelity model values and variances at plotting points
for level in levels:
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
for key in MyModels.keys():
for level in levels:
# set up the parameter space. we are scanning in x between 2 and 16 to match the range of my input
parameter_space = ParameterSpace([ContinuousParameter('x', 2, 16), InformationSourceParameter(len(levels))])
# set up how we will look for the target of our search
optimizer = MultiSourceAcquisitionOptimizer(GradientAcquisitionOptimizer(parameter_space), parameter_space)
# Plot each variable vs X for BEFORE any new points are added
for var in yvars:
# Note: right now I am basing the aquisition function on the first variable ONLY. I intend to
build a more complex function later when I get these bugs worked out.
# perform optimization to find the target point
x_new, val = optimizer.optimize(acquisition)
# x_new=np.atleast_2d(0)
# x_new[0][0]=np.random.rand()*14+2
print('first update points is',x_new)
# I want to manually specify that I add one HF training point and 4 LF training points,
# hence the way the following code is built. This could be a source of problems?
# construct our own version of the new data point because we will want it from the HF surrogate model
# (hence the value 1 in the final column)
new_point_x_hi = [[x_new[0][0],1.]]
# also, since this is an HF point, we include it as a training point in the LF model
new_point_x_lo = [[x_new[0][0],0.]]
# # we also append the new x-value to the training point x-array
# next, prepare points to allow the plotting of the training points on each model
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
for var in yvars:
# Now, for every variable in our list we add training points and update the models
# find the corresponding y-values from the respective surrogates
new_point_y_hi = np.atleast_2d(MyModels[var]['hf']['surrogate'](x_new[0][0]))
new_point_y_lo = np.atleast_2d(MyModels[var]['lf']['surrogate'](x_new[0][0]))
# Note that, as usual, we make these into 2D arrays to match EMUKit's formatting
# now append the new point to our model's training data arrays
# now we use .set_data to update the model based on the extended training data
# MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
# and finally, re-calculate the values and variances at our plotting points to create an updated plot
# MultiFidelity[var]['lf']['y_plot'],MultiFidelity[var]['lf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['lf'])
# MultiFidelity[var]['hf']['y_plot'],MultiFidelity[var]['hf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['hf'])
# MultiFidelity[var]['hf']['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot['hf'])
# not forgetting to update the maximum and average variances
for level in levels:
# get new plotting points
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
# report maximum and avarage variance
print(var,level,'max = ',MultiFidelity[var][level]['varmax'],'mean = ', MultiFidelity[var][level]['varmean'])
# Plot each variable vs Coll for rcas, helios and the low and high-fidelity models for aftr HF point added
I have tried using different acquisition functions and got the same behavior. I have also tried rebilding the model from scratch using model.optimize() and only got stranger behavior.
I am trying to fit the parameters of a transit light curve.
I have observed transit light curve data and I am using a .py in python that through 4 parameters (period, a(semi-major axis), inclination, planet radius) returns a model transit light curve. I would like to minimize the residual between these two light curves. This is what I am trying to do: First - Estimate a max likelihood using method = "L-BFGS-B" and then apply the mcmc using emcee to estimate the uncertainties.
The code:
p = lmfit.Parameters()
p.add_many(('per', 2.), ('inc', 90.), ('a', 5.), ('rp', 0.1))
per_b = [1., 3.]
a_b = [4., 6.]
inc_b = [88., 90.]
rp_b = [0.1, 0.3]
bounds = [(per_b[0], per_b[1]), (inc_b[0], inc_b[1]), (a_b[0], a_b[1]), (rp_b[0], rp_b[1])]
def residual(p):
v = p.valuesdict()
eclipse.criarEclipse(v['per'], v['a'], v['inc'], v['rp'])
lc0 = numpy.array(eclipse.getCurvaLuz()) (observed flux data)
ts0 = numpy.array(eclipse.getTempoHoras()) (observed time data)
c = numpy.linspace(min(time_phased[bb]),max(time_phased[bb]),len(time_phased[bb]),endpoint=True)
nn = interpolate.interp1d(ts0,lc0)
return nn(c) - smoothed_LC[bb] (residual between the model and the data)
Inside def residual(p) I make sure that both the observed data (time_phased[bb] and smoothed_LC[bb]) have the same size of the model transit light curve. I want it to give me the best fit values for the parameters (v['per'], v['a'], v['inc'], v['rp']).
I need your help and I appreciate your time and your attention. Kindest regards, Yuri.
Your example is incomplete, with many partial concepts and some invalid Python. This makes it slightly hard to understand your intention. If the answer below is not sufficient, update your question with a complete example.
It seems pretty clear that you want to model your data smoothed_LC[bb] (not sure what bb is) with a model for some effect of an eclipse. With that assumption, I would recommend using the lmfit.Model approach. Start by writing a function that models the data, just so you check and plot your model. I'm not entirely sure I understand everything you're doing, but this model function might look like this:
import numpy
from scipy import interpolate
from lmfit import Model
# import eclipse from somewhere....
def eclipse_lc(c, per, a, inc, p):
eclipse.criarEclipse(per, a, inc, rp)
lc0 = numpy.array(eclipse.getCurvaLuz()) # observed flux data
ts0 = numpy.array(eclipse.getTempoHoras()) # observed time data
return interpolate.interp1d(ts0,lc0)(c)
With this model function, you can build a Model:
lc_model = Model(eclipse_lc)
and then build parameters for your model. This will automatically name them after the argument names of your model function. Here, you can also give them initial values:
params = lc_model.make_params(per=2, inc=90, a=5, rp=0.1)
You wanted to place upper and lower bounds on these parameters. This is done by setting min and max parameters, not making an ordered array of bounds:
params['per'].min = 1.0
params['per'].max = 3.0
and so on. But also: setting such tight bounds is usually a bad idea. Set bounds to avoid unphysical parameter values or when it becomes evident that you need to place them.
Now, you can fit your data with this model. Well, first you need to get the data you want to model. This seems less clear from your example, but perhaps:
c_data = numpy.linspace(min(time_phased[bb]), max(time_phased[bb]),
len(time_phased[bb]), endpoint=True)
lc_data = smoothed_LC[bb]
Well: why do you need to make this c_data? Why not just use time_phased as the independent variable? Anyway, now you can fit your data to your model with your parameters:
result = lc_model(lc_data, params, c=c_data)
At this point, you can print out a report of the results and/or view or get the best-fit arrays:
for p in result.params.items(): print(p)
import matplotlib.pyplot as plt
plt.plot(c_data, lc_data, label='data')
plt.plot(c_data. result.best_fit, label='fit')
Hope that helps...
I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me
I made the simplest 1D example for TensorBoard (tracking the minimization of a quadratic) but I get plots that don't make sense to me and I can't figure out why. Is it my own implementation or is TensorBoard buggy?
Here are the plots:
Usually I think of histograms as bar graphs that encode probability distributions (or frequency counts). I assume that the y-axis say the values and the x-axis the count? Since my numbers of steps is 120 that seemed reasonable guess.
and Scalar plot:
why is there a strange line going through my plots?
The code that produced it (you should be able to copy paste it and run it):
## run cmd to collect model: python playground.py --logdir=/tmp/playground_tmp
## show board on browser run cmd: tensorboard --logdir=/tmp/playground_tmp
## browser: http://localhost:6006/
import tensorflow as tf
# x variable
x = tf.Variable(10.0,name='x')
# b placeholder (simualtes the "data" part of the training)
b = tf.placeholder(tf.float32)
# make model (1/2)(x-b)^2
xx_b = 0.5*tf.pow(x-b,2)
learning_rate = 1.0
# get optimizer
opt = tf.train.GradientDescentOptimizer(learning_rate)
# gradient variable list = [ (gradient,variable) ]
gv = opt.compute_gradients(y,[x])
# transformed gradient variable list = [ (T(gradient),variable) ]
decay = 0.9 # decay the gradient for the sake of the example
# apply transformed gradients
tgv = [ (decay*g, v) for (g,v) in gv] #list [(grad,var)]
apply_transform_op = opt.apply_gradients(tgv)
# track value of x
x_scalar_summary = tf.scalar_summary("x", x)
x_histogram_sumarry = tf.histogram_summary('x_his', x)
with tf.Session() as sess:
merged = tf.merge_all_summaries()
tensorboard_data_dump = '/tmp/playground_tmp'
writer = tf.train.SummaryWriter(tensorboard_data_dump, sess.graph)
epochs = 120
for i in range(epochs):
b_val = 1.0 #fake data (in SGD it would be different on every epoch)
# applies the gradients
[summary_str_apply_transform,_] = sess.run([merged,apply_transform_op], feed_dict={b: b_val})
writer.add_summary(summary_str_apply_transform, i)
I also met the same problem where multiple lines occurred in the Instance tab in tensor board (even I tried your codes and Board service shows the duplicated warning and only present one curve, better than me)
WARNING:tensorflow:Found more than one graph event per run. Overwriting the graph with the newest event.
nevertheless, the solution hold the same as #Olivier Moindrot mentioned, delete the old logs, while sometimes the board may cache some results so you may want to reboot the board services.
The way to make sure we present the newest summary, as the MINIST example shown, is to log at a new folder:
if tf.gfile.Exists(FLAGS.summaries_dir):
Link to full source, with TF version r0.10: https://github.com/tensorflow/tensorflow/blob/r0.10/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.