Implement a N-aryTreeLSTM version of the TreeLSTM in TensorFlow Fold - python

I'm trying to implement this paper TreeLSTM by using TensorFlow Fold. Actually, in Tensorflow Fold, there's already an example of TreeLSTM but in a BinaryTreeLSTM version, here's the tutorial: https://github.com/tensorflow/fold/blob/master/tensorflow_fold/g3doc/sentiment.ipynb
What I'm trying to do now is to implement a real NaryTreeLSTM, means that a LSTM node can be the parent of any number of children, not just 2 like in the above tutorial.
This is my attempt at trying to fold the tree, this is a modify version of logits_and_state() in the above example "
def logits_and_state():
"""Creates a block that goes from tokens to (logits, state) tuples."""
word2vec = (td.GetItem(0) >> td.InputTransform(lookup_word) >>
td.Scalar('int32') >> word_embedding)
children_num =
children2vec_list = list()
children2vec_list.append(embed_subtree())
for i in range(children_num):
children2vec_list.append(embed_subtree())
children2vec = tuple(children2vec_list)
# Trees are binary, so the tree layer takes two states as its input_state.
zero_state = td.Zeros((tree_lstm.state_size,) * 2)
# Input is a word vector.
zero_inp = td.Zeros(word_embedding.output_type.shape[0])
# word_case =
word_case = td.AllOf(word2vec, zero_state)
children_case = td.AllOf(zero_inp, children2vec)
tree2vec = td.OneOf(lambda x: 1 if len(x) == 1 else 2), [(1,word_case),(2,children_case)])
return tree2vec >> tree_lstm >> (output_layer, td.Identity())
The children_num is the thing that I'm struggling at this moment, I have no idea to get out that number, eventhought I know that the length of children can be obtained by td.GetItem(1) ==> will produce a block that contains an array of children ==> how to get out the real number of that block?
You may say that I should try PyTorch or some others DL framework that also provides Dynamic Computation Graph, but in my case, the requirement is strict with Tensorflow Fold.

Related

Find nearest neighbors change algorithm

I am in the process of creating a recommender system that suggests 20 most suitable songs to a user. I've trained my model, I'm ready to recommend songs for a given playlist! However, one issue that I encountered is that I need the embedding of that new playlist in order to find the closest relevant playlists in that embedding space using kmeans.
To recommend songs, I first cluster the learned embeddings for all of the training playlists, and then select "neighbor" playlists for my given test playlist as all of the other playlists in that same cluster. I then take all of the tracks from these playlists and feed the test playlist embedding and these "neighboring" tracks into my model for prediction. This ranks the "neighboring" tracks by how likely they are (under my model) to occur next in the given test playlist.
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))
print('\nPerforming kmeans to find the nearest users/playlists...')
# get 100 similar users
kmeans = KMeans(n_clusters=100, random_state=0, verbose=0).fit(user_latent_matrix)
desired_user_label = kmeans.predict(one_user_vector)
user_label = kmeans.labels_
neighbors = []
for user_id, user_label in enumerate(user_label):
if user_label == desired_user_label:
neighbors.append(user_id)
print('Found {0} neighbor users/playlists.'.format(len(neighbors)))
tracks = []
for user_id in neighbors:
tracks += list(df[df['pid'] == int(user_id)]['trackindex'])
print('Found {0} neighbor tracks from these users.'.format(len(tracks)))
users = np.full(len(tracks), desired_user_id, dtype='int32')
items = np.array(tracks, dtype='int32')
# and predict tracks for my user
results = model.predict([users,items],batch_size=100, verbose=0)
results = results.tolist()
print('Ranked the tracks!')
results_df = pd.DataFrame(np.nan, index=range(len(results)), columns=['probability','track_name', 'track artist'])
print(results_df.shape)
# loop through and get the probability (of being in the playlist according to my model), the track, and the track's artist
for i, prob in enumerate(results):
results_df.loc[i] = [prob[0], df[df['trackindex'] == i].iloc[0]['track_name'], df[df['trackindex'] == i].iloc[0]['artist_name']]
results_df = results_df.sort_values(by=['probability'], ascending=False)
results_df.head(20)
Instead of this code above, I would like to use this https://www.tensorflow.org/recommenders/examples/basic_retrieval#building_a_candidate_ann_index or the official GitHub repository from Spotify https://github.com/spotify/annoy.
Unfortunately I don't know exactly how to use this so that the new program gives me the 20 most popular tracks for a user.
How do I have to change this?
Edit:
What I tried:
from annoy import AnnoyIndex
import random
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))
t = AnnoyIndex(desired_user_id , one_user_vector) #Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
u = AnnoyIndex(desired_user_id , one_user_vector)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
# Now how to I get the probability and the values?
You're almost there!
In your code starting with desired_user_id = 123, you have 4 main steps:
1 (L 1-12): retrieve the matrix of user embeddings (user_latent_matrix) from your saved model
2 (L 14-23): find the user's cluster label (desired_user_label) with kmeans and list other users in the cluster (neighbors). users in the same cluster should listen to similar songs to you.
3 (L 25-31): list songs that other users in the cluster like (tracks). music that you like will be similar to music listened to other people in your cluster. steps 2 and 3 are just to filter out 99% of all music so you only have to run your model on the last 1% to save time & money. Removing 2 and 3 and adding every song to tracks would still work (but take 100x longer).
4 (L 33+): use the saved model to predict if the songs liked by other users in the cluster are suitable to you (results_df)
Annoy is a drop in replacement for finding the similar users (step 2). Instead of using kmeans to find the user's cluster then finding other users in the cluster, it uses a k nearest neighbors style algorithm to find close users directly.
after you've found one_user_vector on line 12, replace step 2 (Lines 14-23) with something like
from annoy import AnnoyIndex
user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')
# add the user embeddings to annoy (your annoy userids will be the row indexes)
for user_id, user_embedding in enumerate(user_latent_matrix):
t.add_item(user_id, user_embedding)
# build the forrest
t.build(10) # 10 trees
# save the forest for later if you're using this again and don't want to rebuild the trees every time
t.save('test.ann')
# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100)
if you want to run your stuff again but don't want to rebuild the trees and have already run it once, replace step 2 with just
from annoy import AnnoyIndex
user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')
# load the trees
t.load('test.ann')
# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100)
After replacing the stuff in step 2, just run steps 3 and 4 (Lines 25+) like normal
In the existing code, you really seem to have two predictions steps: one for finding the 100 nearest neighbor users and another one for ranking all the tracks of all these users in relation to the current user. Now in general, you should decide which of these steps (or both) you want to replace by the Annoy algorithm.
Looking at the example code from GitHub, you do not need the t = AnnoyIndex ... part here, this just creates some sample data for showing the usage.
u = AnnoyIndex(f, metric) needs the number of dimensions as the input parameter f and one of the values "angular", "euclidean", "manhattan", "hamming", or "dot" as the metric.
From your question I cannot tell what the number of dimensions is in your case, and you will probably have to experiment with the metric yourself to find out what yields the best results.
After that you will have to import your data into the AnnoyIndex object, which must probably be derived from user_latent_matrix and/or users/items.
Finally, you should be able to retrieve the 20 nearest neighbors by running u.get_nns_by_item(i, 20) where i is the id of the concerning user or track. Setting include_distances=True will also give you the corresponding distances (not probabilities as in your approach).
Hope this gives you some hints for getting ahead.

Implementing the TD-Gammon algorithm

I am attempting to implement the algorithm from the TD-Gammon article by Gerald Tesauro. The core of the learning algorithm is described in the following paragraph:
I have decided to have a single hidden layer (if that was enough to play world-class backgammon in the early 1990's, then it's enough for me). I am pretty certain that everything except the train() function is correct (they are easier to test), but I have no idea whether I have implemented this final algorithm correctly.
import numpy as np
class TD_network:
"""
Neural network with a single hidden layer and a Temporal Displacement training algorithm
taken from G. Tesauro's 1995 TD-Gammon article.
"""
def __init__(self, num_input, num_hidden, num_output, hnorm, dhnorm, onorm, donorm):
self.w21 = 2*np.random.rand(num_hidden, num_input) - 1
self.w32 = 2*np.random.rand(num_output, num_hidden) - 1
self.b2 = 2*np.random.rand(num_hidden) - 1
self.b3 = 2*np.random.rand(num_output) - 1
self.hnorm = hnorm
self.dhnorm = dhnorm
self.onorm = onorm
self.donorm = donorm
def value(self, input):
"""Evaluates the NN output"""
assert(input.shape == self.w21[1,:].shape)
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
return(self.onorm(o))
def gradient(self, input):
"""
Calculates the gradient of the NN at the given input. Outputs a list of dictionaries
where each dict corresponds to the gradient of an output node, and each element in
a given dict gives the gradient for a subset of the weights.
"""
assert(input.shape == self.w21[1,:].shape)
J = []
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
for i in range(len(self.b3)):
db3 = np.zeros(self.b3.shape)
db3[i] = self.donorm(o[i])
dw32 = np.zeros(self.w32.shape)
dw32[i, :] = self.donorm(o[i])*hn
db2 = np.multiply(self.dhnorm(h), self.w32[i,:])*self.donorm(o[i])
dw21 = np.transpose(np.outer(input, db2))
J.append(dict(db3 = db3, dw32 = dw32, db2 = db2, dw21 = dw21))
return(J)
def train(self, input_states, end_result, a = 0.1, l = 0.7):
"""
Trains the network using a single series of input states representing a game from beginning
to end, and a final (supervised / desired) output for the end state
"""
outputs = [self(input_state) for input_state in input_states]
outputs.append(end_result)
for t in range(len(input_states)):
delta = dict(
db3 = np.zeros(self.b3.shape),
dw32 = np.zeros(self.w32.shape),
db2 = np.zeros(self.b2.shape),
dw21 = np.zeros(self.w21.shape))
grad = self.gradient(input_states[t])
for i in range(len(self.b3)):
for key in delta.keys():
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
delta[key] += a*(outputs[t + 1][i] - outputs[t][i])*td_sum
self.w21 += delta["dw21"]
self.w32 += delta["dw32"]
self.b2 += delta["db2"]
self.b3 += delta["db3"]
The way I use this is I play through a whole game (or rather, the neural net plays against itself), and then I send the states of that game, from start to finish, into train(), along with the final result. It then takes this game log, and applies the above formula to alter weights using the first game state, then the first and second game states, and so on until the final time, when it uses the entire list of game states. Then I repeat that many times and hope that the network learns.
To be clear, I am not after feedback on my code writing. This was never meant to be more than a quick and dirty implementation to see that I have all the nuts and bolts in the right spots.
However, I have no idea whether it is correct, as I have thus far been unable to make it capable of playing tic-tac-toe at any reasonable level. There could be many reasons for that. Maybe I'm not giving it enough hidden nodes (I have used 10 to 12). Maybe it needs more games to train (I have used 200 000). Maybe it would do better with different normalisation functions (I've tried sigmoid and ReLU, leaky and non-leaky, in different variations). Maybe the learning parameters are not tuned right. Maybe tic-tac-toe and its deterministic gameplay means it "locks in" on certain paths in the game tree. Or maybe the training implementation is just wrong. Which is why I'm here.
Have I misunderstood Tesauro's algorithm?
I can't say that I entirely understand your implementation, but this line jumps out to me:
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
Comparing with the formula you reference:
I see at least two differences:
Your implementation sums over t+1 elements compared to t elements in the formula
The gradient should be indexed with the same k as used in l**(t-k), but in your implementation it is indexed with i and key, without any reference to k
Perhaps if you fix these discrepancies your solution will behave more as expected.

Create a model to classificy a sentence logical or not

As the title mention, how could I train a model to classify following sentences are logical or illogical?
“He has two legs”–logical
“He has six legs”–illogical
Solution I tried:
1 : Train the classifier by cnn
I have done it before, it works very well if you have enough of data. Problem is I do not have a huge data set which comes with “logical” or “illogical” labels for this case.
2 : Use language model
Train a language model introduced by gluonnlp on some data set like wiki, use it to find out the probability of the sentences. If the probability of the sentences are high, mark it as logical and vice versa. Problem is the results not good.
The way I estimate the probability
def __predict(self):
lines = self.__text_edit_input.toPlainText().split("\n")
result = ""
for line in lines:
result += str(self.__sentence_prob(line, 10)) + "\n"
self.__text_edit_output.setPlainText(result)
def __prepare_sentence(self, text, max_len):
result = mx.nd.zeros([max_len, 1], dtype='float32')
max_len = min(len(text), max_len)
i = max(max_len - len(text), 0)
j = 0
for index in range(i, max_len):
result[index][0] = self.__vocab[text[j]]
j = j + 1
return result
def __sentence_prob(self, text, max_len):
hiddens = self.__model.begin_state(1, func=mx.nd.zeros, ctx=self.__context)
tokens = self.__tokenizer(text)
data = self.__prepare_sentence(tokens, max_len)
output, _ = self.__model(data, hiddens)
prob = 0
for i in range(max_len):
total_prob = mx.nd.softmax(output[i][0])
prob += total_prob[self.__vocab[i]].asscalar()
return prob / max_len
Possible issues of language models:
1. Do not use correct way to split the sentences(I am using jieba to split the Chinese senteces)
2. Number of vocab is too small/big(test 10000, 15000 and 30000)
3. Loss too high(ppl around 190) after 50 epochs?
4. Number of sentences length should be larger/smaller(tried 10,20,35)
5. The data I use do not meet my requirements(not every sentences are logical)
6. Language model is not appropriate for this task?
Any suggestions?Thanks
Issue 6. Language model is not appropriate for this task? is the main problem. Language models are built to make sense of the input text with respect of language usage (syntax, semantics etc.) and not draw logical conclusions. So, you may not get good results even with a large amount of data or very deep models.
The problem you're trying to solve is extremely difficult. Something you may want to look at is Symbolic AI. There's a lot of ongoing research in this area.

Preprocessing CSV data using tensorflow DataSet API

I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.

How to optimize filtering of a layer in QGS API

I'm developping a QGIS plugin (under version 2.8.1) for traffic assignment where I want to show the results of my simulation at each time step. Right now I'm using Time Manager plugin but it gets very slow when my layer has hundreds of thousands of attributes. In my case I know exactly what feature IDs I want to show at each time step so I thought it would be easy to make it faster.
Here is what I tried (sorry of my way of python programming but I'm quite new using this language): at each time step of my loop I set the ordered list of indexes of attributes to show (they are always ordered in my case).
# TEST 1 -----------------------------------
for step in time_steps:
index_start = my_list_of_indexes_start[step]
index_end = my_list_of_indexes_end[step]
expression = 'fid >= ' + str(index_start) + ' AND fid <= ' + str(index_end)
# Or for optimization tests
# expression = '"FIELD_TIME"' + "=" + str(step)
layer_dynamic.setSubsetString(expression)
self.iface.mapCanvas().refresh()
time.sleep(0.2)
# TEST 2 ------------------------------------
for step in time_steps:
index_start = my_list_of_indexes_start[step]
index_end = my_list_of_indexes_end[step]
indexes = list(j for j in range(index_start, index_end))
request = QgsFeatureRequest().setFilterFids(indexes)
layer_dynamic.getFeatures(request)
self.iface.mapCanvas().refresh()
time.sleep(0.2)
Solution 1 with
layer_dynamic.setSubsetString(expression)
works as it refresh the view with the correct filtered features displayed on canvas at each time step but it is even slower than using a SQL expression not based on the indexes but on attributes values (as shown in comment in TEST 1 loop).
Solution 2 with
layer_dynamic.getFeatures(request)
is fast but the display of the layer doesn't change.
Any idea why?
The method
bool QgsVectorLayer.setSubsetString(self, QString subset)
filters the layer (more details in setSubsetString), so only the features that match the filter (provided using a SQL statement or other definition string the the "subset" QString) "will belong to the layer" after it's being filtered. Thus, when you call refresh, only the filtered features are displayed.
On the other hand, the method
QgsFeatureIterator QgsVectorLayer.getFeatures(self, QgsFeatureRequest request=QgsFeatureRequest())
returns a iterator for the features matching you request (more details in getFeatures). It doesn't filter the layer. Using the iterator, you just iterate over the features matching the request.

Categories