Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have extracted unstructured textual data from approximately 3000 documents and I am attempting to use this data to classify this document.
However, even after removing stopwords & punctuation and lemmatizing the data, the count vectorization produces more than 64000 features.
A lot of these features contain unnecessary tokens like random numbers and text in different languages.
The libraries I have used are:
tokenization: Punkt (NLTK)
pos tagging: Penn Treebank (NLTK)
lemmatization: WordNet(NLTK)
vectorization: CountVectorizer (sk-learn)
Can anyone suggest how I can reduce the number of features for training my classifier?
You have two choices here, that can be complementary :
Change your tokenization with stronger rules using regex to remove numbers or other tokens you are not interested in.
Use feature selection to keep a subset of your features that are relevant for the classification. Here is a demo sample of code to keep 50% of the features in data:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
import numpy
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectPercentile(score_func=chi2, percentile=50)
X_reduced = selector.fit_transform(X, y)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am begginer in NLP and I have some questions about a classification task. I have a data set in data frame structure which contains two columns, the first on is the texts (so strings) and the second one in the label of each test. So let's say the first column x_train and the seonc one y_train. In order to apply an MLP I could use this code
Tfidf_vect = TfidfVectorizer(max_features = 5000)
Tfidf_vect.fit(input_text)
Train_X_Tfidf = Tfidf_vect.transform(x_train)
Test_X_Tfidf = Tfidf_vect.transform(x_test)
I want to try the Word2Vec model, but I don't know how to transform my training data into number by using Word2vec. So then I could apply again the MLP model. I would be grateful if you could help me.
According to the documentation from sklearn,
max_features int, default=None
If not None, build a vocabulary that
only consider the top max_features ordered by term frequency across
the corpus.
It means that based on your texts, the TfidfVectorizer will build a vocabulary that contains the top 'max_features' most frequently appeared token (word or character). For example, using a word level, set max_features = 10, it will take the 10 most commonly appeared word in your texts as its vocabulary. As for how many features you want to use, it depends on the number of words in your texts. Most common choice is 10000, though.
As for your second question, aside from Gensim's Word2Vec, you could try Keras Embedding layer. A good tutorial is posted on tensorflow website here.
What do you mean by "transform my training data into number by using Word2vec"? If you are referring to obtaining an embedded representation of a given text, you can use Gensim's Word2Vec. In the documentation you will find some examples of usage of the model
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I have a data with four regression labels. Samples for each regression labels are imbalanced. The data is attached here with the post
data_multi_label_reg.csv.
It has 5 columns, out of which 4 i.e A, B, C, and D are for regression labels sample is for sample or training example in the data.
Each sample is defined for one of the four labels only. Therefore, each sample carries one label value and rest are empty.
Also, the labels are highly imbalanced. For instance, D is defined for most of the samples while A is defined for least samples.
Is there any python package which can divide this data set into train_test_split such that in either of the train and test split, the ratio of each label is retained as in the original data set.
There is sklearn function as follows.
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.33,
random_state=0,
stratify=y)
But this seems to be working with single label output. Is there any similar function for multi-label regression output?
You could take a look at scikit-multilearn library. There is the iterative_train_test_split module. Check out this simple usage example and this doc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.
The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list
You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I apply scikit-learn to a numpy array with 4 columns each representing a different attribute?
Basically, I'm wanting to teach it how to recognize a healthy patient from these 4 characteristics and then see if it can identify an abnormal one.
Thanks in advance!
A pipeline usually has the following steps:
Define a classifier/ regressor
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
Fit the data
clf.fit(X_train,y_train)
Here X_train will your four column features and y_train will be the labels whether the patient is healthy.
Predict on new data
y_pred = clf.prdict(X_test)
This tutorial is great starting point for you to get some basic idea about the pipeline.
Look into the pandas package which allows you to import CSV files into a dataframe. pandas is supported by scikit-learn.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I got started with naive numerical prediction. Here is the training data
https://gist.github.com/karimkhanp/75d6d5f9c4fbaaaaffe8258073d00a75
Test data
https://gist.github.com/karimkhanp/0f93ecf5fe8ec5fccc8a7f360a6c3950
I wrote basic scikit learn code to train and test.
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn import metrics, linear_model
from sklearn.linear_model import LogisticRegression, LinearRegression, BayesianRidge, OrthogonalMatchingPursuitCV, SGDRegressor
from datetime import datetime, date, timedelta
class NumericPrediction(object):
def __init__(self):
pass
def dataPrediction(self):
Train = pd.read_csv("data_scientist_assignment.tsv", sep='\t', parse_dates=['date'])
Train_visualize = Train
Train['timestamp'] = Train.date.values.astype(pd.np.int64)
Train_visualize['date'] = Train['timestamp']
print Train.describe()
x1=["timestamp", "hr_of_day"]
test=pd.read_csv("test.tsv", sep='\t', parse_dates=['date'])
test['timestamp'] = test.date.values.astype(pd.np.int64)
model = LinearRegression()
model.fit(Train[x1], Train["vals"])
# print(model)
# print model.score(Train[x1], Train["vals"])
print model.predict(test[x1])
Train.hist()
pl.show()
if __name__ == '__main__':
NumericPrediction().dataPrediction()
But accuracy is very low here. Because approach is very naive. Any better suggestion to improve accuracy ( In terms of algorithm, example, reference, library)?
For starter, your 'test' set doesn't look right. Please check it.
Secondly, your model is doomed to fail. Plot your data - what do you see? Clearly we have a seasonality here, while linear regression assumes that observations are independent. It's important to observe that you are dealing here with time series.
R language is excellent when it comes to time series, with advanced packages for time series forecasting like bsts. Still, Python here will be just as good. Pandas module is going to serve you well. Mind that you might not necessarily have to use machine learning here. Check ARMA and ARIMA. Bayesian structural time series are also excellent.
Here is a very good article that guides you through basics of dealing with time series data.