How to use Word2Vec for classification problem in Tensorflow - python

I'm trying to predict big 5 personality traits (Extraversion, Neuroticism, Agreeableness, Conscientiousness, Openness) based on text analysis written by user.
Here is my preprocessed data set:
And here is my word2vec model:
model_wv = Word2Vec(df_processed['TEXT'], sg=1, size=300, window=10, min_count=1)
Vocabulary consists of 26126 words.
features = ['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']
X = df_processed['TEXT']
y = df_processed[features]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)
model = Sequential()
model.add(Embedding(max_features, 100))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(5, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=batch_size,epochs=epochs, validation_data=(X_test, y_test))
My question is, how can I use word2vec vectors in tensorflow model?
Should I replace every word in every row in "X" with the vector from word2vec model? I think it will be quite expensive for calculation but what will be another possibility?
Sorry if my questions sounds dummy, I' m just really new in NLP and tensorflow.

Related

What is the best approach to build a model with multiple targets on the Y set?

I need to create a model to predict multiple labels based on sixteen (16) input features. The dataset has 4486 instances, each instance has a different number of labels (48 different labels).
This is how the data looks:
X Data example
Y Data example
The challenge is to predict labels on a new instance; I know the learning is the problem because the imbalance in the number of labels, this make the learning a bit difficult.
I will appreciate commets and advice regarding how to tackle this issue.
My best result is 30% in accuracy, but I've noticed it predicts the same labels sometimes and not given any satissfactory results so far...
This is the model I've implemented:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
n_inputs, n_outputs = X_train.shape[1], y_train.shape[1]
nodes = math.sqrt(n_inputs*n_outputs)
model = Sequential()
model.add(Dense(nodes, activation='relu'))
model.add(Flatten())
model.add(Dense(n_outputs, activation = 'sigmoid'))
model.compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = [tf.keras.metrics.BinaryAccuracy(),tf.keras.metrics.AUC(), 'accuracy'])
history = model.fit(X_train, y_train, epochs=300, verbose=1, shuffle=True, validation_data=
(X_test, y_test), batch_size=8)

Tensorflow (keras) model gives me always the same value

model.predict(x) where x is the same np array i used to train the model(x is obviously without the validation values).
Running this I just get the same value for all 1733 lines of numpy array. If you need code or an example for the np arrays used ask me.
the model is:
dataset = pd.read_csv('BNB.csv')
x = dataset.drop(columns=["Valuable"])
x = np.asarray(x).astype('float32')
y = dataset["Valuable"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(256, input_shape=x_train.shape, activation='sigmoid'))
model.add(tf.keras.layers.Dense(256, activation='sigmoid'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1000)
The numpy array (csv file) I used to train and test looks like this:
Valuable,Open,High,Low,Close,EMA8,EMA14,EMA50,ht,sar,MorningStar,Engulfing
-1,355.48,355.82,355.21,355.76,355.21,355.51,357.96,356.63,351.08,0,0
0,355.77,356.2,355.52,355.79,355.34,355.54,357.87,356.51,351.08,0,0
0,355.82,356.61,355.5,356.23,355.54,355.63,357.81,356.44,351.08,0,0
0,356.14,356.17,354.63,354.92,355.4,355.54,357.69,356.46,351.08,0,0
0,354.88,355.54,354.81,354.96,355.3,355.46,357.59,356.55,351.08,0,0
0,354.91,354.91,353.71,354.11,355.04,355.28,357.45,356.59,351.08,0,0
0,354.12,354.93,353.89,354.72,354.97,355.21,357.34,356.44,351.08,0,0
0,354.72,355.2,354.01,354.7,354.91,355.14,357.24,356.21,351.08,0,0
0,354.69,355.46,354.43,355.23,354.98,355.15,357.16,355.9,351.08,0,100
0,355.27,355.47,354.54,355.39,355.07,355.18,357.09,355.57,351.08,0,0
0,355.37,356.0,355.22,355.81,355.24,355.27,357.04,355.31,351.08,0,0
0,355.79,356.23,355.11,355.54,355.3,355.3,356.98,355.15,351.08,0,0
0,355.56,355.67,354.78,355.21,355.28,355.29,356.91,355.08,351.08,0,0
0,355.2,355.63,354.88,355.2,355.26,355.28,356.84,355.06,351.08,0,0
0,355.2,355.99,355.2,355.76,355.37,355.34,356.8,355.08,351.08,0,0
0,355.74,355.97,355.17,355.37,355.37,355.35,356.75,355.14,351.08,0,0
0,355.37,355.38,354.51,354.69,355.22,355.26,356.67,355.19,351.08,0,0
0,354.78,355.4,354.64,355.02,355.18,355.23,356.6,355.23,351.08,0,0
I want to predict whether Valuable is 0, -1, -2, 1 or 2 (my csv file is about 1700 lines long).
There are few problems with your model.
First:
You should use sparse categorical cross entropy loss instead of binary loss for your model if you have more than two classes in output.
Second:
Use softmax activation for the last/output layer.
Third:
Use as many neurons in the last layer as there are classes.
I consider the distinct values in valuable column are: [-1,-2,0,1,2].
First encode your target column like this:
y = dataset["Valuable"] # after this
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
Then Change your model definition like this:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = tf.keras.models.Sequential()
# changes
model.add(tf.keras.layers.Dense(256, input_shape=x_train.shape, activation="relu"))
model.add(tf.keras.layers.Dense(256, activation="relu"))
model.add(tf.keras.layers.Dense(5, activation="softmax"))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1000)

RNN model predicting only one class?

I am trying to use GloVe embeddings to train a rnn model based on this article.
I have a labeled data: text(tweets) on one column, labels on another (hate, offensive or neither).
However the model seems to predict only one class in the result.
This is the LSTM model:
model = Sequential()
hidden_layer = 3
gru_node = 32
# model embedding matrix here....
for i in range(0,hidden_layer):
model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(GRU(gru_node, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(Dense(64, activation='softmax'))
model.add(Dense(nclasses, activation='softmax'))
start=time.time()
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
fitting the model:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
X_train_Glove,X_test_Glove, word_index, embeddings_index = loadData_Tokenizer(X_train, X_test)
model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)
model_RNN.fit(X_train_Glove,y_train,
validation_data=(X_test_Glove, y_test),
epochs=4,
batch_size=128,
verbose=2)
y_preds = model_RNN.predict_classes(X_test_Glove)
print(metrics.classification_report(y_test, y_preds))
Results:
classification report
Confusion matrix
Am I missing something here?
Update:
this is what the distribution looks like
and the model summary, more or less
How the distribution of your data looks like? The first suggestion is to stratify train/test split (here is the link for the documentation).
The second question is how much data do you have in comparison with the complexity of the model? Maybe, your model is so complex, that just do overfitting. You can use the command model.summary() to see the number of trainable parameters.

Predict future values using Keras

I am pretty new to creating neural networks, and I am trying to use Keras in python to create a model that will predict a sequence based on the first ~100 inputs.
I am using 11x10000 arrays to predict a single output value, and I have created a NN that will fit that regression pretty well. However, I want to create another NN that will predict the inputs for the regression NN.
I have a very large dataset at my disposal for training, and I am trying to predict a 11x10000 array using the first few inputs of an array.
I can get it to predict the expected value at the current time step, but I want it to predict far future values based on current and near future values.
Here is some code:
names = getNames()
test = getNames()
for index in range(len(names)):
print(index)
X, Y = getData(names, index, False)
X_test, Y_test = getData(test, index, True)
NN = makeNN(X,Y, X_test, X_test)
def getModel():
keras.backend.clear_session()
model = Sequential()
model.add(Dense(100, input_dim=11, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(11, init='uniform', activation='relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
return model
def makeNN(X, Y, X_test, Y_test):
try:
model = keras.models.load_model('./trajSweepNNVacc.h5')
print("previous model loaded")
except:
model = getModel()
print("new model created")
X_train = X[0:]
print('Training -----------')
model.fit(X_train, X_train, epochs=2)
print('\nTesting ------------')
cost = model.evaluate(X_test, Y_test)
print('test cost:', cost)
W, b = model.layers[0].get_weights()
model.save('trajSweepNNVacc.h5')
return model
I want to get it to predict the future 10000 values, but currently I can only have it predict a current value

Don't know whats going wrong while training dataset. Keras Model Fit

I have just started using Keras and was trying to train a model using Keras deep learning kit. Works till the epochs are runned but crashes just after it.
np.random.seed(1778) # for reproducibility
need_normalise=True
need_validataion=True
nb_epoch=2#8
#Creating model
model = Sequential()
model.add(Dense(512, input_shape=(dims,)))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
opt=Adadelta(lr=1,decay=0.995,epsilon=1e-5)
model.compile(loss='binary_crossentropy', optimizer=opt)
auc_scores=[]
best_score=-1
best_model=None
print('Training model...')
if need_validataion:
for i in range(nb_epoch):
#early_stopping=EarlyStopping(monitor='val_loss', patience=0, verbose=1)
#model.fit(X_train, y_train, nb_epoch=nb_epoch,batch_size=256,validation_split=0.01,callbacks=[early_stopping])
model.fit(X_train, y_train, nb_epoch=2,batch_size=256,validation_split=0.15)
y_pre = model.predict_proba(X_valid)
scores = roc_auc_score(y_valid,y_pre)
auc_scores.append(scores)
print (i,scores)
if scores>best_score:
best_score=scores
best_model=model
plt.plot(auc_scores)
plt.show()
else:
model.fit(X_train, y_train, nb_epoch=nb_epoch, batch_size=256)
y_pre = model.predict_proba(X_test)[:,1]
print roc_auc_score(y_test,y_pre)
Error Recieved:
I have pasted it over here. Please have a look at it.
http://pastebin.com/dSw9ckkk
It looks like you have two classes, a positive class and a negative class, so that the positive class labels are 1 minus the negative class labels. In that case, you can discard the negative class labels and make it a single-class problem:
model.add(Dense(1), activation='sigmoid') # instead of Dense(nb_classes) and Activation('softmax')
Alternatively, you can still train the model on both classes and just use the positive class in the AUC calculation:
roc_auc_score(y_test[:, 1],y_pre[:, 1])

Categories