Spark equivalent to Keras Tokenizer? - python

So far, I pre-process text data using numpy and build-in fuctions (such as keras tokenizer class, tf.keras.preprocessing.text.Tokenizer: https://keras.io/api/preprocessing/text/).
And there is were I got stuck:
Since I am trying to scale up my model and data set, I am experimenting with spark and spark nlp (https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer)... however, I couldn´t yet find a similar working tokenizer. The fitted tokenizer must be later available to transform validation/new data.
My output should represent each token as an unique integer value (starting from 1), something like:
[ 10,... , 64, 555]
[ 1,... , 264, 39]
[ 12,..., 1158, 1770]
Currently, I was able to use the Spark NLP-tokenizer to obtain tokenized words:
[okay,..., reason, still, not, get, background]
[picture,..., expand, fill, whole, excited]
[not, worry,..., happy, well, depend, on, situation]
Does anyone have a solution which doesn´t require to copy the data out of the spark environment?
UPDATE:
I created two CSVs to clarify my current issue. The first file was created thru a pre-processing pipeline: 1. cleaned_delim_text
After that, the delimited words should be "translated" to integer values and the sequence should be padded with zeros to the same length: 2. cleaned_tok_text

Please try below combination -
1. Use tokenizer to convert the statements into words and then
2.use word2vec
to compute distributed vector representation of those words

Related

Is there a way to iterate through the vectors of Gensim's Word2Vec?

I'm trying to perform a simple task which requires iterations and interactions with specific vectors after loading it into gensim's Word2Vec.
Basically, given a txt file of the form:
t1 -0.11307 -0.63909 -0.35103 -0.17906 -0.12349
t2 0.54553 0.18002 -0.21666 -0.090257 -0.13754
t3 0.22159 -0.13781 -0.37934 0.39926 -0.25967
where t1 is the name of the vector and what follows are the vectors themselves. I load it in using the function vecs = KeyedVectors.load_word2vec_format(datapath(f), binary=False).
Now, I want to iterate through the vectors I have and make a calculation, take summing up all of the vectors as an example. If this was read in using with open(f), I know I can just use .split(' ') on it, but since this is now a KeyedVector object, I'm not sure what to do.
I've looked through the word2vec documentation, as well as used dir(KeyedVectors) but I'm still not sure if there is an attribute like KeyedVectors.vectors or something that allows me to perform this task.
Any tips/help/advice would be much appreciated!
There's a list of all words in the KeyedVectors object in its .index_to_key property. So one way to sum all the vectors would be to retrieve each by name in a list comprehension:
np.sum([vecs[key] for key in vecs.index_to_key], axis=0)
But, if all you really wanted to do is sum the vectors – and the keys (word tokens) aren't an important part of your calculation, the set of all the raw word-vectors is available in the .vectors property, as a numpy array with one vector per row. So you could also do:
np.sum(vecs.vectors, axis=0)

Clustering sentence vectors in a dictionary

I'm working with a kind of unique situation. I have words in Language1 that I've defined in English. I then took each English word, took its word vector from a pretrained GoogleNews w2v model, and average the vectors for every definition. The result, an example with a 3 dimension vector:
L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}
What I want to do is cluster (using K-means probably, but I'm open to other ideas) the keys of the dict by their numpy-array values.
I've done this before with standard w2v models, but the issue I'm having is that this is a dictionary. Is there another data set I can convert this to? I'm inclined to write it to a csv/make it into a pandas datafram and use Pandas or R to work on it like that, but I'm told that floats are problem when it comes to things requiring binary (as in: they lose information in unpredictable ways). I tried saving my dictionary to hdf5, but dictionaries are not supported.
Thanks in advance!
If I understand your question correctly, you want to cluster words according to their W2V representation, but you are saving it as dictionary representation. If that's the case, I don't think it is a unique situation at all. All you got to do is to convert the dictionary into a matrix and then perform clustering in the matrix. If you represent each line in the matrix as one word in your dictionary you should be able to reference the words back after clustering.
I couldn't test the code below, so it may not be completely functional, but the idea is the following:
from nltk.cluster import KMeansClusterer
import nltk
# make the matrix with the words
words = L1_words.keys()
X = []
for w in words:
X.append(L1_words[w])
# perform the clustering on the matrix
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
# print the cluster each word belongs
for i in range(len(X)):
print(words[i], assigned_clusters[i])
You can read more in details in this link.

Is there a tensor eqiv to python's list.count()

I'm attempting to do all my input pipeline work in tensorflow. This includes transforming the examples into the types required by the classifier.
I just learned I can't iterate over a string tensor like I would do with a standard python list. My specific question is "is there a tf function for testing the existence of a constant value within a tensor?" Of course there may be a better way to do this (I'm new to tf and python).
# creating a unique list of tokens (python)
a_global = []
a = [...]
for token in a:
if a_global.count(token) == 0:
a_global.append(token)
I'm indexing string tokens so I can essentially convert them into integers using the token's position within the list as its new value. That snippet will not work when "a" is a tensor, so I'm trying tf.map_fn() instead, but I don't know how to replicate the IF statement predicate. Can someone point me in the right direction?
tf ver 1.8
If you don't need gradients for this operation (which I guess you don't for preprocessing stuff), the easiest could be to use tf.py_func. It essentially is able to wrap numpy code snippets into TensorFlow ops.
If that doesn't work for you, look at this post to count occurrences. Then you could use tf.cond to replicate the if statement.

Gensim's Doc2vec - inferred vector isn't similar

When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id'] and model.infer_vector(documents['some_doc_id']).
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy
What is the type of your documents object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument-shaped text examples? That is, does iter(documents) always return a fresh iterator, with all items as TaggedDocument-shaped objects with the right list-of-words in words and list-of-tags in tags? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)
Then for infer_vector(), does documents[tag] really return just the list-of-words it expects (not TaggedDocument or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words and get results that are just noise.)
Was there evaluation-guided reason for changing various defaults, either a little (window=6, negative=8) or a lot (alpha=0.1, min_count=2)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)
Some report needing much higher steps values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.
A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.
If you're still having problems, you should expand your question with more code showing how documents works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.

Create LabeledPoint from rdd data which has both strings and numbers - PySpark

I have lines like this in my data:
0,tcp,http,SF,181,5450,0,0,0.5,normal.
I want to use decision tree algorithm for training. I couldn't create LabeledPoints, so I want to try HashingTF for strings but I couldn't handle it. "normal" is my target label. How can I create a LabeledPoint RDD data to use in pyspark? Also, Label for LabeledPoint requires double, should I just create some double values for labels or should it be hashed?
I come up with the solution.
First of all, Spark's Decision tree classifier has already a parameter for this: categoricalFeaturesInfo. In the pyspark api documentation:
categoricalFeaturesInfo - Map from categorical feature index to number of categories. Any feature not in this map is treated as continuous.
However, before doing this, we first should simply replace the strings to numbers for pypsark to understand them.
Then we create for the above example data categoricalFeaturesInfo as in the definition like this:
categoricalFeaturesInfo = {1:len(feature1), 2:len(feature2), 3:len(feature3), 9:len(labels)}
Simply, first ones are the indexes of the categorical features and the second ones are the number of categories in that feature.
Note that converting strings to numbers is enough for the trainer algorithm but if you declare the categorical features like this, it would train faster.

Categories