Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I have a data with four regression labels. Samples for each regression labels are imbalanced. The data is attached here with the post
data_multi_label_reg.csv.
It has 5 columns, out of which 4 i.e A, B, C, and D are for regression labels sample is for sample or training example in the data.
Each sample is defined for one of the four labels only. Therefore, each sample carries one label value and rest are empty.
Also, the labels are highly imbalanced. For instance, D is defined for most of the samples while A is defined for least samples.
Is there any python package which can divide this data set into train_test_split such that in either of the train and test split, the ratio of each label is retained as in the original data set.
There is sklearn function as follows.
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.33,
random_state=0,
stratify=y)
But this seems to be working with single label output. Is there any similar function for multi-label regression output?
You could take a look at scikit-multilearn library. There is the iterative_train_test_split module. Check out this simple usage example and this doc.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
Please do help, I am unable to split the two. Should we do it after importing the data or before?
You can use train_test_split to split your datasets.
Try this :
from sklearn.model_selection import train_test_split
train_test_split(X, y, test_size=0.3, random_state=0)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Dataset has around 150k records with four labels: ['A','B','C','D'] and the distribution is as follows:
A: 60000
B: 50000
C: 36000
D: 4000
I notice using the package classification report to get the precision, recall, and f1-score, the f1-score is causing an UndefinedMetricWarning because class D is not being predicted due to the low number of records.
I know that I need to perform oversample/undersample to fix the imbalanced data.
Question: Would it be a good idea to fix the imbalanced data but randomly sample 4000 records from each class so that it is balanced?
I think you want to oversample from your class D. The technique is called Synthetic Minority Oversampling Technique, or SMOTE.
One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.
An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.
Source: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.
The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list
You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I apply scikit-learn to a numpy array with 4 columns each representing a different attribute?
Basically, I'm wanting to teach it how to recognize a healthy patient from these 4 characteristics and then see if it can identify an abnormal one.
Thanks in advance!
A pipeline usually has the following steps:
Define a classifier/ regressor
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
Fit the data
clf.fit(X_train,y_train)
Here X_train will your four column features and y_train will be the labels whether the patient is healthy.
Predict on new data
y_pred = clf.prdict(X_test)
This tutorial is great starting point for you to get some basic idea about the pipeline.
Look into the pandas package which allows you to import CSV files into a dataframe. pandas is supported by scikit-learn.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
My dataset has a number of numerical input, and 1 categorical (factor) output, and I want to train the model with CNN/RNN/LSTM to predict the output.
My data looks like:
input1 input2 ... input_n output
2 1.2 ... -0.44 "b"
1 0.2 ... 3.2 "f"
3 1 ... 2.1 "a"
I tried with Keras and lasagne in Python, but did not succeed. I could not find a runnable example with my dataset, but I thought that this type of task should be basic (based on a set of input, predict the output).
Could you point me out an example that use the dataset similar with my dataset? Any programming language will be help.
Simple classification from skflow wrapper for tensor-flow.
import skflow
from sklearn import datasets, metrics
iris = datasets.load_iris()
classifier = skflow.TensorFlowLinearClassifier(n_classes=3)
classifier.fit(iris.data, iris.target)
score = metrics.accuracy_score(classifier.predict(iris.data), iris.target)
print("Accuracy: %f" % score)