I want to fit xgboost model on a dataset based on a group of variables.
Here is the reproducible code.
train = pd.DataFrame({'A':np.random.randint(1,3,1000),
'B':np.random.randint(1,3,1000),
'C':np.random.uniform(1,10,1000),
'D':np.random.uniform(1,25,1000),
'E':np.random.uniform(10,50,1000),
'F':np.random.uniform(5,250,1000)})
This is the mock up data. Here I want to run this model based on columns 'A' and 'B'.
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(train.iloc[:,:-1], train.iloc[:,-1], test_size=test_size, random_state=seed)
Here I am splitting data into test and train datasets.
models = []
def xgb_model_fit(x):
xg_reg=xgb.XGBRegressor(objective ='reg:linear',colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5,alpha = 10,n_estimators = 10)
xg_reg.fit(X_train,y_train)
models.append([x.name,xg_reg])
Group by variable list is
gb = ['A','B']
Model building step is
X_train.groupby(gb,as_index=False).apply(xgb_model_fit)
Now I want to use the fitted models on test data on their respected group.
Question1:
In test data, my first group is (1,1). Now I want to use (1,1) group fitted model on test data group same group and then have to get fitted values. How to do this?
Question2:
Is there any direct way to fit model by group instead of running model for each group?
how can I do this task?
Related
I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:
DataFrame
Note that after dropping duplicates, the final dataframe only contains 149 observations.
Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.
X = pd_data.drop(target, axis=1)
y = pd_data[target]
y=y.astype('int')
protocol_type = [['tcp','udp','icmp']]
col_transformer = ColumnTransformer([
("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'),
['protocol_type']),
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89)
Finally I proceed to the model evaluation, which drops me the following result:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RFC', RandomForestClassifier()))
models.append(('SVM', SVC()))
#selector = SelectFromModel(estimator=model)
scaler = option2
selector = SelectKBest(score_func=f_classif,k = 3)
results=[]
for name, model in models:
pipeline = make_pipeline(col_transformer,scaler,selector)
#print(pipeline)
X_train_selected = pipeline.fit_transform(X_train,y_train)
#print(X_train_selected)
X_test_selected = pipeline.fit_transform(X_test,y_test)
modelo = model.fit(X_train_selected, y_train)
kf = KFold(n_splits=10, shuffle=True, random_state=89)
cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy')
results.append(cv_results)
print(name, cv_results)
plt.boxplot(results)
plt.show()
Boxplots from CV
My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?
You have 149 rows, of which 80% go into the training set, so 119. You then do 10-fold cross-validation, so each test fold has about 12 samples. So each individual test fold has only 13 possible accuracy scores; even if the classifiers predict some samples a little differently, they may have the same accuracy. (The common scores you see (1, 0.88, 0.71) don't line up with the fractions I'm expecting though, so maybe I've missed something?) So yes, possibly it's just the small number of rows, compounded with the cross-validation. Selecting down to just 3 features also probably contributes.
One quick thing to check is some continuous score of the models' performance, say log-loss or Brier score.
(And, Gaussian is probably the wrong Naive Bayes to use with your data, containing so many binary features.)
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I have spent 30 hours on this single problem de-bugging and it makes absolutely no sense, hopefully one of you guys can show me a different perspective.
The problem is that I use my training dataframe in a random forest and get very good accuracy 98%-99% but when I try and load in a new sample to predict on. The model ALWAYS guesses the same class.
# Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)
# Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)
# Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)
# Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
# Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)
print()
print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))
The way I am processing the data is the same, but when I predict on the X_test or X_train I get my normal 98% and when I predict on my new data it always guesses the same class.
# The json file is not in the correct format, this function normalizes it
normalized_json = json_normalizer(json_file, "", training=False)
# Turn the json into a list of dictionaries which contain the features
features_dict = create_dict(normalized_json, label=None)
# Convert the dictionaries into pandas dataframes
df = pd.DataFrame.from_records(features_dict)
print('Total amount of email samples: ', len(df))
print()
df = df.fillna(-1)
# One hot encodes string values
df = one_hot_encode(df, noOverride=True)
if 'label' in df.columns:
df = df.drop(['label'], axis='columns')
print(list(model.predict(df))[:100])
print(list(model.predict(X_train))[:100])
Above is my testing scenario, you can see in the last two lines I am predicting on X_train the data used to train the model and df the out of sample data that it always guesses class 0.
Some useful information:
The datasets are imbalanced; class 0 has about 150,000 samples while class 1 has about 600,000 samples
There are 141 features
changing the n_estimators and max_depth doesn't fix it
Any ideas would be helpful, also if you need more information let me know my brain is fried right now and that's all I could think of.
Fixed, The issue was the imbalance of datasets also I realized that changing the depth gave me different results.
For example, 10 trees with 3 depth -> seemed to work fine
10 trees with 6 depth -> back to guessing only the same class
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
models = [
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
# 5 Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
error :
UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
When using cross validation, it splits the whole train set such that the train and test (in cv) will have same distribution. If there are 10 objects with label "A", which are about 20% of whole examples, it will split it to groups where every group has 2 objects with label "A" so it will also 20% from test.
But what happens when a label "A" has only 1 object (one row with that class) and you try to split it for 5 groups? This is an error you get. It does not know how to handle that.
It's a bit hard to tell how to solve it without knowing what your data looks like and what are your needs. Different problems may have different solutions.
You can:
Remove problematic label from dataset. Check for all classes with extreme low frequency and group them all together to "Other" or something like that.
Give up on cv and use KFolds, which does not require that groups within cv will have same distribution.
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new