Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
My dataset has a number of numerical input, and 1 categorical (factor) output, and I want to train the model with CNN/RNN/LSTM to predict the output.
My data looks like:
input1 input2 ... input_n output
2 1.2 ... -0.44 "b"
1 0.2 ... 3.2 "f"
3 1 ... 2.1 "a"
I tried with Keras and lasagne in Python, but did not succeed. I could not find a runnable example with my dataset, but I thought that this type of task should be basic (based on a set of input, predict the output).
Could you point me out an example that use the dataset similar with my dataset? Any programming language will be help.
Simple classification from skflow wrapper for tensor-flow.
import skflow
from sklearn import datasets, metrics
iris = datasets.load_iris()
classifier = skflow.TensorFlowLinearClassifier(n_classes=3)
classifier.fit(iris.data, iris.target)
score = metrics.accuracy_score(classifier.predict(iris.data), iris.target)
print("Accuracy: %f" % score)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a question regarding the model.fit method and overfitting from the scikit learn library in Pandas
Does the generic sklearn method model.fit(x---, y--) returns the score after applying the model to the specified training data?
Also, it is overfitting when performance on the test data degrades as more training data is used to learn the model?
model.fit(X, y) doesn't explicitly give you the score, if you assign a variable to it, it stores all the artifacts, training parameters. You can get the score by using model.score(X, y).
Overfitting in simple words is increasing the variance in your model by which your model fails to generalize. There are ways to reduce overfitting like feature engineering, normalization, regularization, ensemble methods etc.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Dataset has around 150k records with four labels: ['A','B','C','D'] and the distribution is as follows:
A: 60000
B: 50000
C: 36000
D: 4000
I notice using the package classification report to get the precision, recall, and f1-score, the f1-score is causing an UndefinedMetricWarning because class D is not being predicted due to the low number of records.
I know that I need to perform oversample/undersample to fix the imbalanced data.
Question: Would it be a good idea to fix the imbalanced data but randomly sample 4000 records from each class so that it is balanced?
I think you want to oversample from your class D. The technique is called Synthetic Minority Oversampling Technique, or SMOTE.
One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.
An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.
Source: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I have a data with four regression labels. Samples for each regression labels are imbalanced. The data is attached here with the post
data_multi_label_reg.csv.
It has 5 columns, out of which 4 i.e A, B, C, and D are for regression labels sample is for sample or training example in the data.
Each sample is defined for one of the four labels only. Therefore, each sample carries one label value and rest are empty.
Also, the labels are highly imbalanced. For instance, D is defined for most of the samples while A is defined for least samples.
Is there any python package which can divide this data set into train_test_split such that in either of the train and test split, the ratio of each label is retained as in the original data set.
There is sklearn function as follows.
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.33,
random_state=0,
stratify=y)
But this seems to be working with single label output. Is there any similar function for multi-label regression output?
You could take a look at scikit-multilearn library. There is the iterative_train_test_split module. Check out this simple usage example and this doc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to build a classification model and my target is not binary. The correlations of my features against my target are all weak (mostly 0.1). I have preprocessed my data and applied the all the algorithms i used to it (the algorithms i used are svm, knn, naivebayes,logistic regression, decision tree,gradient boosting, random forest). I evaluated all of the models with sklearn metrics.accuracy_score just to know how good they perform on my data but all of them scored 0.1~0.2 . The target is productline column.
My questions
How could this happen?
How to tackle this issue?
Is there any other algorithm that could make better score?
What's the accuracy if you use a dummy classifier? The accuracy of the models you have tried should be at least equal to that of the dummy classifier.
"How could this happen?" If there's no relationship between the features and the target variable, the model isn't going to return good results.
I'm not sure about the details of your dataset, but you can try to 1) Get more data 2) Get more features 3) Do some feature engineering 4) Clean your dataset if you haven't, there might be outliers or wrong inputs affecting your results
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I apply scikit-learn to a numpy array with 4 columns each representing a different attribute?
Basically, I'm wanting to teach it how to recognize a healthy patient from these 4 characteristics and then see if it can identify an abnormal one.
Thanks in advance!
A pipeline usually has the following steps:
Define a classifier/ regressor
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
Fit the data
clf.fit(X_train,y_train)
Here X_train will your four column features and y_train will be the labels whether the patient is healthy.
Predict on new data
y_pred = clf.prdict(X_test)
This tutorial is great starting point for you to get some basic idea about the pipeline.
Look into the pandas package which allows you to import CSV files into a dataframe. pandas is supported by scikit-learn.