fit word2vec into training set which has data frame structure [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am begginer in NLP and I have some questions about a classification task. I have a data set in data frame structure which contains two columns, the first on is the texts (so strings) and the second one in the label of each test. So let's say the first column x_train and the seonc one y_train. In order to apply an MLP I could use this code
Tfidf_vect = TfidfVectorizer(max_features = 5000)
Tfidf_vect.fit(input_text)
Train_X_Tfidf = Tfidf_vect.transform(x_train)
Test_X_Tfidf = Tfidf_vect.transform(x_test)
I want to try the Word2Vec model, but I don't know how to transform my training data into number by using Word2vec. So then I could apply again the MLP model. I would be grateful if you could help me.

According to the documentation from sklearn,
max_features int, default=None
If not None, build a vocabulary that
only consider the top max_features ordered by term frequency across
the corpus.
It means that based on your texts, the TfidfVectorizer will build a vocabulary that contains the top 'max_features' most frequently appeared token (word or character). For example, using a word level, set max_features = 10, it will take the 10 most commonly appeared word in your texts as its vocabulary. As for how many features you want to use, it depends on the number of words in your texts. Most common choice is 10000, though.
As for your second question, aside from Gensim's Word2Vec, you could try Keras Embedding layer. A good tutorial is posted on tensorflow website here.

What do you mean by "transform my training data into number by using Word2vec"? If you are referring to obtaining an embedded representation of a given text, you can use Gensim's Word2Vec. In the documentation you will find some examples of usage of the model

Related

Using a XGBoost model that was fitted on a database to make predictions on a new database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have a database that I have split into train and test datasets, fitted a XGBoost model on the train set, and made predictions using the fitted model on the test set. so far everything is good.
Now if I save the fitted model and want to use it on a completely new dataset to make predictions, what should my new database look like?
Does it have to contain the exact number of features?
Does a categorical feature have to have the same categories in both databases?
I assume, you are using one-hot encoding for lets say the color-feature?
So technically to avoid extra or new features in the test-data, you should form the feature-vector using train+test data.
Do one-hot encoding/featurization on the whole set of training+testing data. Now separate out training-dataset and testing-dataset.
Lets say [v1, v2, v3... vn] are the list of feature-names from train+test data.
Now form the training-data using this feature-name. As expected the feature-column corresponding to 5th color in the training-data would all be zero and THATS FINE
Use this same features-list for the test-data, now you shouldnt have any discrepancies in terms of new features coming up.
Hope that clarifies.

What's the best way to process strings for inputs in Keras? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a dataset where name is an important feature . I want to use it has an input node in my keras neural network in python . But since this is not possible , what's the best way to do it ??
I have tried one hot encoding but since the length of the name is not fixed , it is not useful ?
You can use Embeddings, which translates large sparse vectors (one-hot encoded) into a lower-dimensional space that preserves semantic relationships. So for the categorical feature, you will have dense vector representation.
unique_amount = np.unique(col1)
input_1 = tf.keras.layers.Input(shape=(1,), name='input_1')
embedding_1 = tf.keras.layers.Embedding(unique_amount_1, 50, trainable=True)(input_1)
col1_embedding = tf.keras.layers.Flatten()(embedding_1)
Where 50 - the size of the embedding vector, that you can choose by your self.
You can try with character level one hot encoding as follows in keras. Make sure you set char_level=True flag in Tokenizer. This could leads us very low dimension sparse matrix.
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(<names>)
sequence_of_int = tokenizer.texts_to_sequences(<dataset_names>)
Even, you try with representing frequency based character encoding by your own.

Tensor flow Model saving and Calculating Average of Models [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am trying to implement and reproduce the results of federated Bert pertaining in paper
Federated pretraining and fine-tuning of BERT using clinical notes from multiple silos.
I prefer to use TensorFlow
code of Bert pretraining.
For training in a federated way, initially, I had divided dataset into 3 different silos(each of that contains discharge summary of 50 patients, using mimic-3 data). and then pretrained the Bert model for each dataset using TensorFlow
implementation of Bert pretraining from the official release of Bert.
Now I have three different models that are pretrained from a different dataset. for model aggregation, I need to take an average of all three models. since the number of notes in each silo is equal, for averaging I need to do sum all models and divide by three.
How to take avg of models as did in the paper? somebody, please give me some insights to code this correctly. The idea of averaging the model weight is taken from the paper FEDERATED LEARNING: STRATEGIES FOR IMPROVING
COMMUNICATION EFFICIENCY
.
I am very new to deep learning
and TensorFlow
. so someone please help me to figure out the issue and suggest some reading material for TensorFlow
.
In the paper, it is mentioned that It is a good option to overcome privacy and regulatory issues while sharing of clinical data. My question is
is it possible to get sensitive data from this model.ckpt files? Then how?
Any help would be appreciated. Thanks...
Model averaging can be done in many ways. The simplest is to have a complete copy of each architecture in each silo, and take a (weighted) average of their parameter scores, and use this as the parameters for the full model. However there are a number of practical issues (latency, network speed, computational power of device) which may prohibit this, and so more complex solutions where silos are only trained on subsets of variables etc are used (as in the paper you cite).
It is not generally possible to retrieve information (sensitive of otherwise) from a dataset purely from the parameter updates to a model fine-tuned on it.

How to fit data into a machine leaning model in parts? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.
The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list
You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.

what are the best methods to classify the user gender based on names? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
If you check my github, I have successfully implemented CNN, KNN for classifying signal faults. For that, I have taken the signal with little preprocessing for dimensionality reduction and provided it to the network, using its class information I trained the network, later the trained network is tested with testing samples to determine the class and computed the accuracy.
My question here how do I input the text information to CNN or any other network. For inputs, I took the Twitter database from kaggle, I have selected 2 columns which have names and gender information. I have gone through some algorithms which classify gender based on their blog data. I wasn't clear how I implement to my data (In my case, if I just want to classify using only names alone).
In some examples, which I understood I saw computing sparse matrix for the text, but for 20,000 samples the sparse matrix is huge to give as input. I have no problem in implementing the CNN architectures(I want to implement because no features are required) or any other network. I am stuck here, how to input data to the network. What kind of conversations can I make so that I take the names and gender information can be considered to train the network?
If my method of thinking is wrong please provide me suggestion which algorithm is the best way. Deep learning or any other methods are ok!
You could use character-level embeddings (i.e. your input classes are the different characters, so 'a' is class 1, 'b' is class 2 etc..). One-hot encoding the classes and then passing them through an embedding layer will yield unique representations for each character. A string can then be treated as a character-sequence (or equally a vector-sequence), which can be used as an input for either a recurrent or convolutional network. If you feel like reading, this paper by Kim et al. will provide you all the necessary theoretical backbone.

Categories