I have a dataset like the one in the picture.
I need to apply Logistic Regression to this dataset and try to predict the correct result of R.
The problem is that R can assume 3 different values, as you can see in the picture (W,L,D). How can i structure it to work with logistic regression?
I am using python and sklearn logistic regression.
Create a dataframe or numpy ndarray with A,M,D,H, lets say it as x_train and then R into another ndarrary lets name it as y_train, now you can use below
LogisticRegression(multi_class='multinomial',solver ='newton-cg').fit(x_train,y_train)
But based on actual you may need to do some standardization/transform before you fit into a model, you can also refer to this explanation on using iris data set.
Related
Download the dataset, where the first four columns are features, and the last column corresponds to categories (3 labels). Perform the following tasks.
Split the dataset into train and test sets (80:20)
Construct the Naive Bayes classifier from scratch and train it on the train set. Assume Gaussian distribution to compute probabilities.
Evaluate the performance using the following metric on the test set
a. Confusion matrix
b. Overall and class-wise accuracy
c. ROC curve, AUC
Use any library (e.g. scikit-learn) and repeat 1 to 3
Compare and comment on the performance of the results of the classifier in 2 and 4 6. Calculate the Bayes risk.
Consider,
λ =
2 1 6
4 2 4
6 3 1
Where λ is a loss function and rows and columns corresponds to classes (ci) and actions (aj) respectively, e.g. λ(a3 / c2) = 4
It's not clear what specific part of the problem you're having trouble with, which makes it hard to give specific advice.
With that in mind, here is some reading that might help get you started:
If the dataset is in CSV format, you can read it into a dataframe using pd.read_csv() as discussed here: https://www.geeksforgeeks.org/python-read-csv-using-pandas-read_csv/
To split the df into a train set and test set, you can import scikit-learn (sklearn) and then use train_test_split() as discussed here: https://www.stackvidhya.com/train-test-split-using-sklearn-in-python/
It sounds like your professor (or whoever is the source of this question) wants you to write a function that duplicates a Naive Bayes classifier, so I'll leave you to figure that out. Sklearn does provide a Naive Bayes classifier you can read about here and use to verify your results: https://scikit-learn.org/stable/modules/naive_bayes.html
For confusion matrices, sklearn (again) provides some functionality that will let you plot a confusion matrix: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_predictions
For the ROC curve, you can see here: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Hope this is enough to get you started.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(label,targets,
test_size=0.20,random_state=42)
example of gaussian naive bayes
from sklearn.naive_bayes import GaussianNB
# define the model
model = GaussianNB()
# fit the model
model.fit(X_train,y_train)
predict=model.predict(x_test)
matrix = classification_report(y_test,predict)
print('Classification report :\n',matrix)
https://scikit-learn.org/stable/modules/cross_validation.html
I made a decision trees and logistical regression model. I am satisfied with the results. How do I use it on unsupervised data?
Also: Will I need to always use StandardScaler to new data?
While your question is too broad for SO I still want to give some short advices:
You need supervised data just for training stage of your model. When you already have trained model you can make predictions on unsupervised data (i.e. data that have no labels/targets) and model returns predicted labels. Usually you can do it by using predict method
Important moment: to use the predict method, it is necessary to transfer data to the model input in the same form as it was during training - the same set of features and the same number of features (excluding labels/targets of course)
The same goes for preprocessing - if you used StandardScaler for training data you must use it for new data too - the SAME StandardScaler (i.e. call transform method of already fitted on trining data scaler)
The philosophy of using StandatdScaler or some normalisation: is short - use it for linear model (and for your logistic regression). Read about it here for example: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
But for trees it is not necessary. Example: https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6
I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"
I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.
I am not sure how to take into account my one hot encoding, or if I even need to.
Input Data:
I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.
I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.
You can do like:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)
# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)
print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))
Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.
I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html).
Doing it the same ways as described by #silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn.
To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.
class DummyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, n_values='auto'):
self.n_values = n_values
def transform(self, X):
ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
return ohe.fit_transform(X)[:,:-1]
def fit(self, X, y=None, **fit_params):
return self
So building on the code of #silviomoreto, you would change line 6:
enc = DummyEncoder()
This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.
I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.
Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (https://pypi.python.org/pypi/statsmodels), as it provides better model statistics, e.g. p-values per variable, etc.
how to prepare data for sklearn LinearRegression
OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.
DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.
To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.
numerical columns: standardize if your model contains interactions or polynomial terms
categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.
drop='first'
As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X), which is similar to
pd.get_dummies(drop_first=True).
use regularized linear regression
If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.
limitation of p-value statistics
The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.
I have a training dataset of 8670 trials and each trial has a length of 125-time samples while my test set consists of 578 trials. When I apply SVM algorithm from scikit-learn, I get pretty good results.
However, when I apply logistic regression, this error occurs:
"ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1.0" .
My question is why SVM is able to give predictions but logistic regression gives this error?
Could it be possible that something is wrong in the dataset or just that logistic regression was not able to classify because the training samples look similar to it?
I read this in the following issue on a similar linear module:https://github.com/lensacom/sparkit-learn/issues/49
"Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:
train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()
To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution."
EDIT: I found a solution, the above analysis of the error was correct. This would be a solution.
To Shuffle the arrays in the same order I used a scikitlearn utils module:
from sklearn.utils import shuffle
X_shuf, Y_shuf = shuffle(X_transformed, Y)
Then use those shuffled arrays to train your model again and it'll work!
How does scikit-learn's sklearn.linear_model.LogisticRegression class work with regression as well as classification problems?
As given on the Wikipedia page as well as a number of sources, since the output of Logistic Regression is based on the sigmoid function, it returns a probability. Then how does the sklearn class work as both a classifier and regressor?
Logistic regression is a method for classification, not regression. This goes for scikit-learn as for anywhere else.
If you have entered continuous values as the target vector y, then LogisticRegression will most probably fail, as it interprets the unique values of y, i.e. np.unique(y) as different classes. So you may end up having as many classes as samples.
TL;DR: Logistic regression needs a categorical target variable, because it is a classification method.