I'm using scikit-learn and the SGD classifier to train an SVM in mini-batches. Here's a little code snippet:
for row in reader:
if row[0] in model.docvecs:
TRAINING_X.append(model.docvecs[row[0]])
TRAINING_Y.append(row[2])
if count % 10000 == 0:
np_x = np.asarray(TRAINING_X)
np_y = np.asarray(TRAINING_Y)
clf.partial_fit(np_x,np_y, np.unique(np.asarray))
TRAINING_X = []
TRAINING_Y = []
count += 1
I'm using the partial_fit function to read in every 1000 data points and using np.unique() to generate class labels as per the documentation.
However, when I run this, I get the following error:
raise ValueError("The number of class labels must be " ValueError: The
number of class labels must be greater than one.
I'm a little confused. Am I generating class labels incorrectly?
The documentation for partial_fit says, Classes across all calls to partial_fit. Can be obtained by via np.unique(y_all), where y_all is the target vector of the entire dataset..
You seem to be passing np.unique(np.asarray) which does seem incorrect.
Going by the error thrown by the program, I think there is only one unique class in your target variable. Please use np.unique(np_y) and get the number of unique classes that you are feeding into the model and ensure that it is more than one.
Also, your value to the classes argument seem to be incorrect, it should have been np.unique(np_y) instead of np.unique(np.asarray)
Related
How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print('Number of texts processed: ', j)
d2v_model = Doc2Vec(all_content_train, vector_size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)```
```kmeans_model = KMeans(n_clusters=10, init='k-means++', max_iter=100)
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
I can get the text and its cluster but how can I learn the words which mainly created those groups
It's not an inherent feature of Doc2Vec to list words most-related to any document or doc-vector. (Other algorithms, such as LDA, will offer that.)
So, you could potentially write your own code, once you've split your documents into clusters, to report the words that are "most over-represented" in each cluster.
For example, calculate every word's frequency in the entire corpus, then each word's frequency in each cluster. For each cluster, report the N words whose in-cluster-frequency is the largest multiple of the full-corpus-frequency. Would this give helpful results on your data, for your needs? You'd have to try it.
Separately, regarding your use of Doc2Vec:
there's no good reason to alias the existing class TaggedDocument to a strange class name like LabeldSentence1. Just use TaggedDocument directly.
if you supply your corpus, all_content_train, to the object-inittialization – as your code does – then you don't need to also call train(). Training will have already happened automatically. If you do want more than the default amount of training (epochs=5), just supply a larger epochs value to the initialization.
the learning-rate values you've supplied to train() – start_alpha=0.002, end_alpha=-0.016 – are nonsensical & destructive. Few users should need to tinker with these alpha values at all, but especially, they should never increase from the beginning to end of a training cycle, as these values do.
If you were running with logging enabled at the INFO level, and/or watching the output closely, you would likely see readouts and warnings indicating that excessive training was happening, or problematic values used.
Hi got into another roadblock in tensorflow crashcourse...at the representation programming excercises at this page.
https://developers.google.com/…/repres…/programming-exercise
I'm at Task 2: Make Better Use of Latitude
seems I narrowed the issue to when I convert the raw latitude data into "buckets" or ranges which will be represented as 1 or zero in my feature. The actual code and issue I have is in the paste bin. Any advice would be great! thanks!
https://pastebin.com/xvV2A9Ac
this is to convert the raw latitude data in my pandas dictionary into "buckets" or ranges as google calls them.
LATITUDE_RANGES = zip(xrange(32, 44), xrange(33, 45))
the above code I changed and replaced xrange with just range since xrange is already deprecated python3.
could this be the problem? using range instead of xrange? see below for my conundrum.
def select_and_transform_features(source_df):
selected_examples = pd.DataFrame()
selected_examples["median_income"] = source_df["median_income"]
for r in LATITUDE_RANGES:
selected_examples["latitude_%d_to_%d" % r] = source_df["latitude"].apply(
lambda l: 1.0 if l >= r[0] and l < r[1] else 0.0)
return selected_examples
The next two are to run the above function and convert may exiting training and validation data sets into ranges or buckets for latitude
selected_training_examples = select_and_transform_features(training_examples)
selected_validation_examples = select_and_transform_features(validation_examples)
this is the training model
_ = train_model(
learning_rate=0.01,
steps=500,
batch_size=5,
training_examples=selected_training_examples,
training_targets=training_targets,
validation_examples=selected_validation_examples,
validation_targets=validation_targets)
THE PROBLEM:
oki so here is how I understand the problem. When I run the training model it throws this error
ValueError: Feature latitude_32_to_33 is not in features dictionary.
So I called selected_training_examples and selected_validation_examples
here's what I found. If I run
selected_training_examples = select_and_transform_features(training_examples)
then I get the proper data set when I call selected_training_examples which yields all the feature "buckets" including Feature #latitude_32_to_33
but when I run the next function
selected_validation_examples = select_and_transform_features(validation_examples)
it yields no buckets or ranges resulting in the
`ValueError: Feature latitude_32_to_33 is not in features dictionary.`
so I next tried disabling the first function
selected_training_examples = select_and_transform_features(training_examples)
and I just ran the second function
selected_validation_examples = select_and_transform_features(validation_examples)
If I do this, I then get the desired dataset for
selected_validation_examples .
The problem now is running the first function no longer gives me the "buckets" and I'm back to where I began? I guess my question is how are the two functions affecting each other? and preventing the other from giving me the datasets I need? If I run them together?
Thanks in advance!
a python developer gave me the solution so just wanted to share. LATITUDE_RANGES = zip(xrange(32, 44), xrange(33, 45)) can only be used once the way it was written so I placed it inside the succeding def select_and_transform_features(source_df) function which solved the issues. Thanks again everyone.
I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.
I am preparing a Doc2Vec model using tweets. Each tweet's word array is considered as a separate document and is labeled as "SENT_1", SENT_2" etc.
taggeddocs = []
for index,i in enumerate(cleaned_tweets):
if len(i) > 2: # Non empty tweets
sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[u'SENT_{:d}'.format(index)])
taggeddocs.append(sentence)
# build the model
model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
if epoch % 20 == 0:
print('Now training epoch %s' % epoch)
model.train(taggeddocs)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
I wish to find tweets similar to a given tweet, say "SENT_2". How?
I get labels for similar tweets as:
sims = model.docvecs.most_similar('SENT_2')
for label, score in sims:
print(label)
It prints as:
SENT_4372
SENT_1143
SENT_4024
SENT_4759
SENT_3497
SENT_5749
SENT_3189
SENT_1581
SENT_5127
SENT_3798
But given a label, how do I get original tweet words/sentence? E.g. what are the tweet words of, say, "SENT_3497". Can I query this to Doc2Vec model?
Gensim's Word2Vec/Doc2Vec models don't store the corpus data – they only examine it, in multiple passes, to train up the model. If you need to retrieve the original texts, you should populate your own lookup-by-key data structure, such as a Python dict (if all your examples fit in memory).
Separately, in recent versions of gensim, your code will actually be doing 1,005 training passes over your taggeddocs, including many with a nonsensically/destructively negative alpha value.
By passing it into the constructor, you're telling the model to train itself, using your parameters and defaults, which include a default number of iter=5 passes.
You then do 200 more loops. Each call to train() will do the default 5 passes. And by decrementing alpha from 0.025 by 0.002 199 times, the last loop will use an effective alpha of 0.025-(200*0.002)=-0.375 - a negative value essentially telling the model to make a large correction in the opposite direction of improvement each training-example.
Just use the iter parameter to choose the desired number of passes. Let the class manage the alpha changes itself. If supplying the corpus when instantiating the model, no further steps are necessary. But if you don't supply the corpus at instantiation, you'll need to do model.build_vocab(tagged_docs) once, then model.train(tagged_docs) once.
I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.
From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.
Could anyone give me an idea on above problem?
The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:
EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).
vectors_variable = [v for v in tf.trainable_variables()
if EMBEDDING_KEY in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many.")
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
print("Setting word vectors from %s" % FLAGS.word_vector_file)
with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
# Lines have format: dog 0.045123 -0.61323 0.413667 ...
for line in f:
line_parts = line.split()
# The first part is the word.
word = line_parts[0]
if word in model.vocab:
# Remaining parts are components of the vector.
word_vector = np.array(map(float, line_parts[1:]))
if len(word_vector) != vec_size:
print("Warn: Word '%s', Expecting vector size %d, found %d"
% (word, vec_size, len(word_vector)))
else:
vectors[model.vocab[word]] = word_vector
# Assign the modified vectors to the vectors_variable in the graph.
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I guess with the scope style, which Matthew mentioned, you can get variable:
with tf.variable_scope("embedding_attention_seq2seq"):
with tf.variable_scope("RNN"):
with tf.variable_scope("EmbeddingWrapper", reuse=True):
embedding = vs.get_variable("embedding", [shape], [trainable=])
Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like:
"embedding_attention_seq2seq/embedding_attention_decoder/embedding"
Thanks for your answer, Lukasz!
I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? Just the position of the word in the vocabulary?
In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model.