I would like to use Isolation Forest for identifying the Outlier's in my dataset.
Training set contains 4000 records with 40 feature columns with value 1 or 0.
I know how to use the Isolation Forest for 2 features using the sample example given in scikit learn.
How do I use all the 40 Features and see the outliers ?
I simplified the scikit example a bit. X is your Dataset with 40 features and 4000 rows. In this example it is 3 features and 100 rows. You fit the classifier with clf.fit(X) to your numerical data X, to learn the classifier the "boundaries" of your data. In the next step you classify the same data X with respect to your learned model and get an array y with 100 entries, one for each row in your dataset. Each entry in y is -1 (Outlier) or 1 (Inliner).
import numpy as np
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
# Generate train data
s = rng.randn(100, 5)
X = np.r_[s + 2, s - 2, s - 5]
# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X)
y = clf.predict(X)
Related
I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).
My DataFrame is really large and unbalanced.
I need to make sampling on my DataFrame because it is really large
Balancing the DataFrame looks like this:
99.60% - 0
0.40 % - 1
ID
TARGET
111
1
222
1
333
0
444
1
...
...
How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.
How can I do that in Python ?
Assume you want a sample size = 1000
Try to use the following line :
df.sample(frac=1000/len(df), replace=True, random_state=1)
Perhaps this is what you need. stratify param makes sure you sample your data in a stratified fashion as you need
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(30000, 2)
y = np.random.randint(2, size =30000)
skf = train_test_split(X, y, train_size=100, test_size=100, stratify=y, shuffle=True)
I think the solution is to combine Oversampling and Undersampling.
Random Oversampling: Randomly duplicate examples in the minority class.
Random Undersampling: Randomly delete examples in the majority class.
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
under = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)
I have a large dataset of 23k rows. That data looks like something below:
import pandas as pd
d = {'Date': ["1-1-2020", '1-1-2020', "1-2-2020", "1-2-2020"], 'Stock': ["FB", "F", "FB", "F"],
"last_price": [230,8,241,9], "price":[241,9,240,8.5]}
df = pd.DataFrame(data=d)
Date Stock_id last_price price
0 1-1-2020 5 230 241.0
1 1-1-2020 41 8 9.0
2 1-2-2020 5 241 240.0
3 1-2-2020 41 9 8.5
Note that data includes many stocks on many different dates. How can I create a model that uses the feature for example last_price and stock id to predict next-day price? And that uses the old data to re-train the data.
Now, this was the best thing I could do. I used LinearRegression but any other model advice can work.
X = df[['Stock_id', 'last_price']]
y = df[['price']]
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import linear_model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
Index Actual Predicted
487 45 32
4154 420 512
Is there a way where the model is trained on the first 3000 rows? Then the model makes a prediction for say date 12-11-2020 and then adds 12-11-2020 info to make the prediction for 12-12-2020 and so on?
I was hoping to get something like this.
Date Actual Predicted
12-11-2020 45 32
12-11-2020 420 512
12-12-2020 43 34
12-12-2020 423 513
I don't think having the id in your training dataset is appropriate since ids and comparing them does not give any useable information and may result in a bad calculated linear function for your model. ID just signifies that you are talking about a specific stock and is constant for a specific stock in the whole dataset. Also the value of the Stock_id cannot does not have any meaning that can be used for comparing stocks together, for example having a Stock_id = 1 and Stock_id = 2 doesn't mean these 2 are closer together than Stock_id = 1 and Stock_id = 100, they are just names. So I think you should split your original dataset based on the Stock_id and only include last_price in each of these new training datasets (X). You can do that in several ways, one them being the groupby function of pandas:
grouped = df.groupby(df.Stock_id)
stock_1= grouped.get_group(1)
After that, you can use a for loop on the unique value of your Stock_id column to get all the ids and their dataframes. Then you define a regression model for each of these new datasets and use the fit method to train it.
To retrain or update your regression model, LinearRegression does not support partial fit and I think you need to use the fit method again each time you want to update your model. You can use the first N rows of each user to fit the model, then predict the value for the next last_price and add the predicted value to the N rows and re-fit the model on the extended dataset. However, if your model actually calculates a good line to predict the data, I don't think you will see that much of a difference by adding new predictions to the training dataset.
Another option is to use SGDRegressor instead of LinearRegression, since it has a partial_fit() method allows for incremental training which lets you train your model on new data without re-training the model on the whole dataset. You can find the documentation for this model here. Also this answer here explains the difference between SGDRegressor and Linear Regression.
If you still want to use LinearRegression and retrain the model, I suggest you use batches of data for updating your model, instead of retraining it on each new predicted value. You can wait for your predicted values to get to a certain number, for example 10, and then add these 10 new values to your training dataset and retrain the model just once. This answer here explains 3 approaches in retraining the model which might be useful for you.
the code is below that i wrote to predict the possible diseases using k-Means from a dataset which has 3 parameters , , is this correct?
but this is not giving accurate results like i want.
import pandas as pd #importing library for reading dataset
from sklearn.cluster import KMeans #using ML library in python for
utilizing kmeans
##reading the dataset from csv file and storing in variable called data..
data = pd.read_csv(r"C:\Users\Hassan Tariq\Disease
Prediction\DataSet.csv")
##selecting data cols from dataset.
X_Data = data.iloc[:,[1]] #first col as a part of first variable
Y_Data = data.iloc[:,[2,3]] ##second col as a part of second variable
##i have used two cols in second variable because we cannot train kmeans
on three parameters.
#initializing the model with 3 initial clusters.
model1 = KMeans(n_clusters=3, random_state=3)
#training model on the selected data..
prediction = model1.fit_predict(X_Data,Y_Data)
#printing the clusters prediction from the model.
print("Clustered Dataset: \n",prediction)
#printing the centroids which shows the data behavior in each cluster
print("Centroids of the clusters formed: \n",model1.cluster_centers_)
centeroids_collection = model1.cluster_centers_
#specifying the diseases which can be possible.
disease1 = ['Muscle Twitching','Nausea']
disease2 = ['Eye Irritation', 'Lung Irritation']
disease3 = ['Eye Irritation','Diarrhea']
#loop for iterating all the data in the dataset to predict the disease..
Don't try to hard code number of clusters value, First try to get number of clusters using Elbow method. Once you got number of clusters, try to fit in the model That way you will get more accuracy in your predictions. Sample code to get clusters is below -
X_std = StandardScaler().fit_transform(data)
Run local implementation of kmeans Here we tested 3 clusters
km = Kmeans(n_clusters=3, max_iter=100, random_state = 42)
km.fit(X_std) centroids = km.centroids`
labels_ are equivalent to calling fit(x) then predict
labels_ = km.predict(X_std)
labels_
I'm using dataset from Kaggle - Cardiovascular Disease Dataset.
The model has been trained and what I want to do is to label a single input(a row of 13 values)
inserted in dynamic way.
Shape of Dataset is 13 Features + 1 Target, 66k rows
#prepare dataset for train and test
dfCardio = load_csv("cleanCardio.csv")
y = dfCardio['cardio']
x = dfCardio.drop('cardio',axis = 1, inplace=False)
model = knn = KNeighborsClassifier()
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
model.fit(x_train, y_train)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ML is trained, what I want to do is to predict the label of this single row :
['69','1','151','22','37','0','65','140','90','2','1','0','0','1']
to return 0 or 1 for Target.
So I wrote this code :
import numpy as np
import pandas as pd
single = np.array(['69','1','151','22','37','0','65','140','90','2','1','0','0','1'])
singledf = pd.DataFrame(single)
final=singledf.transpose()
prediction = model.predict(final)
print(prediction)
but it gives error : query data dimension must match training data dimension
how can I fix the labeling for single row ? why I'm not able to predict a single case ?
Each instance in your dataset has 13 features and 1 label.
x = dfCardio.drop('cardio',axis = 1, inplace=False)
This line in the code removes what I assume is the label column from the data, leaving only the (13) feature columns.
The feature vector on which you are trying to predict, is 14 elements long. You can only predict on feature vectors that are 13 elements long because that is what the model was trained on.
if you are looking for a real and quick solution you can use this
import numpy as np
import pandas as pd
single = np.array([['69','1','151','22','37','0','65','140','90','2','1','0','0']])
prediction = model.predict(single)
print(prediction)
I disagree with the others, this is not a problem with including the target.
I had this problem too. The only way I got around it was to input part of x.
So:
x2=x.iloc[0:3]
then give the first row a new value:
x2.iloc[0]=single
ypred=model.predict(x2)
and just look at ypred[0].
Or try a dataframe with 2 values
I have the following code for gradient boosting classifier to be used for binary classification problem.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#Creating training and test dataset
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.30,random_state=1)
#Count of goods in the training set
#This count is 50000
y0 = len(y_train[y_train['bad_flag'] == 0])
#Count of bads in the training set
#This count is 100
y1 = len(y_train[y_train['bad_flag'] == 1])
#Creating the sample_weights array. Include all bad customers and
#twice the number of goods as bads
w0=(y1/y0)*2
w1=1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train['bad_flag'] == 0] = w0
sample_weights[y_train['bad_flag'] == 1] = w1
model=GradientBoostingClassifier(
n_estimators=100,max_features=0.5,random_state=1)
model=model.fit(X_train, y_train.values.ravel(),sample_weights)
My thinking about writing this code is as follows:-
sample_weights will allow model.fit to select all 100 bads and 200 goods from the training set and this same set of 300 customers will be used to fit 100 estimators in forward stage-wise fashion. I want to undersample my training set because the two response classes are highly imbalanced. Please let me know if my understanding of the code is correct?
Also, I would like to confirm that n_estimators=100 means that 100 estimators will be fit on the same set of 300 customers. This also means that there is no bootstrapping in gradient boosting classifier as seen in bagging classifier.
As far as I understand, this is not how it works. By default, you have GradientBoostingClassifier(subsample = 1.0) which means that the sample size that will be used at each stage (for each of the n_estimators) will be the same as in your original dataset. The weights will not change anything to the size of the subsample. If you want to enforce 300 observations for each stage, you need to set subsample = 300/(50000+100) in addition to the weight definition.
The answer is no. For each stage, a new fraction subsample of observations will be drawn. You can read more about it here: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting. It says:
At each iteration the base classifier is trained on a fraction subsample of the available training data.
So, as a result, there is some bootstraping combined with the boosting algorithm.