XGBoost pairwise setup - python - python

In XGBoost I have tried multiple ways to make pairwise group work with group set, but without success. The following code doesn't work when using set_group but is fine with set_group commented out for xgbTrain
import xgboost
import pandas as pd
from xgboost import DMatrix,train
xgb_params ={
'booster' : 'gbtree',
'eta': 0.1,
'gamma' : 1.0 ,
'min_child_weight' : 0.1,
'objective' : 'rank:pairwise',
'eval_metric' : 'merror',
#'num_class': 3, #
'max_depth' : 6,
'num_round' : 4,
'save_period' : 0
}
n_group=2
n_choice=3
#training dataset
dtrain=np.random.uniform(0,100,[n_group*n_choice,2])
dtarget=np.array([np.random.choice([0,1,2],3,False) for i in range(n_group)]).flatten()
dgroup=np.array([np.repeat(i,3)for i in range(n_group)]).flatten()
xgbTrain = DMatrix(dtrain, label = dtarget)
xgbTrain =xgbTrain.set_group(dgroup)
#watchlist
dtrain_eval=np.random.uniform(0,100,[n_group*n_choice,2])
xgbTrain_eval = DMatrix(dtrain_eval, label = dtarget)
#xgbTrain_eval =xgbTrain_eval .set_group(dgroup)
#test dataset
dtest=np.random.uniform(0,100,[n_group*n_choice,2])
dtestgroup=np.array([np.repeat(i,3)for i in range(n_group)]).flatten()
xgbTest = DMatrix(dtest)
#xgbTest =xgbTest.set_group(dgroup)
evallist = [(xgbTrain_eval, 'eval')]
rankModel = xgboost.train(params=xgb_params,dtrain=xgbTrain )
print(rankModel.predict( xgbTest))
The error returned seem to point to the lack of eval data but even specifying the evals as
rankModel = xgboost.train(params=xgb_params,dtrain=xgbTrain,evals=evallist )
the error remains.
Note that num_class is commented out but intuitively it should have a value either 3 ( here corresponding to the number of class ) or 2 (for the number of group in the case of pairwise ranking)?
Any help in pointing to what is wrong?
(Xgboost 0.6)

An error:
mea cupla, the set_group is incorrect and should be
xgbTrain.set_group(dgroup)
and not
xgbTrain =xgbTrain.set_group(dgroup)
The solution:
The data in the set_group should just be the count of each items per group with one item per group.
dgroup=np.array([n_choice for i in range(n_group)]).flatten()
That did it!

Related

I am able to select rows from a column based on one parameter, however not two?

So I am trying to select rows with labels values between 4501 and 4524 in the column vehicle.vehicle.id. When I try this
df = df.loc[df["vehicle.vehicle.id"].between(4501,4524)]
I get this
(0,11)
Where there is not anything in between as indicated (0,11), same thing when broke it down
df = df.loc[(df["vehicle.vehicle.id"] >= 4501) & (df["vehicle.vehicle.id"] <= 4524)]
print(df.shape)
(0,11)
However, when I tried this, it works, why will it not take multiple parameters?? (115,11) is indicating that it is able to find 115 instances of vehicle id greater than 4501.
df = df.loc[df["vehicle.vehicle.id"] >=4501]
print(df.shape)
(115,11)
TDLR; you dont have an issue with taking in multiple parameters.
The Detailed / possible Explanation below:
sharing the sample data like what i have done here, will be helpful in figuring out the issue. It also helps to give a much better context to the problem which u are facing.
like what #jezrael mentioned, it could be the case where, in your records, you do not have any records between the id(s) 4501 to 4524, but you have records after id 4524.
here is an example illustrated below based on your functions. there is nothing wrong with your query. perhaps what you are trying to find is not there.
Code Example Illustrated below:
import pandas as pd
_list = [
{'vehicle.vehicle.id' : 1, 'vehicle_name' : 'a'},
{'vehicle.vehicle.id' : 2, 'vehicle_name' : 'b'},
{'vehicle.vehicle.id' : 3, 'vehicle_name' : 'c'},
{'vehicle.vehicle.id' : 4, 'vehicle_name' : 'd'},
]
_df = pd.DataFrame(_list)
[![pandas table example][1]][1]
[1]: https://i.stack.imgur.com/YFb2V.png
##############
first approach
##############
_range = _df.loc[_df['vehicle.vehicle.id'].between(1,3)]
print(_range.shape)
# outputs: (3,2)
# means 3 rows, and 2 cols extracted
# row with the id 1,2,3 got extracted
##############
second approach
##############
_range = _df.loc[(_df['vehicle.vehicle.id'] > 1) & (_df['vehicle.vehicle.id'] < 4)]
print(_range.shape)
# outputs: (2,2)
# means 2 rows, and 2 cols extracted
# row with the id 2 and 3 got extracted
##############
third approach
##############
_range = _df.loc[_df['vehicle.vehicle.id'] > 3]
print(_range.shape)
# outputs: (1,2)
# means 1 rows, and 2 cols extracted
# row with the id 4 got extracted

Running T Test - Statistical Significant

If I want to calculate the statistically significant difference between the different species when looking at the occurrence of the virus, I tried:
data = {'WNV Present': ["negative","negative","positive","negative","positive","positive","negative","negative","negative" ],
'Species': ["Myotis","Myotis","Hoary","Myotis","Myotis","Keens","Myotis","Keens","Keens"]}
my_data = pd.DataFrame(data)
# binarized the WNV Present Column
my_data["WNV Present"] = np.where(my_data["WNV Present"] == "positive", 1, 0)
my_data
# Binarize the Species Column
dum_col3 = pd.get_dummies(my_data["Species"])
dum_col3
dummy_df5 = my_data.join(dum_col2)
dummy_df5.drop(["Species"], axis=1, inplace=True)
dummy_df5
#running t test
from scipy.stats import ttest_ind
set1 = dummy_df5[dummy_df5['WNV Present']==1]
set2 = dummy_df5[dummy_df5['Myotis']==1]
stats.ttest_ind(set1, set2)
My results:
Ttest_indResult(statistic=array([ 3. , 1.36930639, 1.36930639, -2.73861279]), pvalue=array([0.0240082 , 0.21994382, 0.21994382, 0.03379779]))
Why am I receiving various P value results? I tried running this again without binarizing the Species column but that also doesnt tell me if there is a significant difference between species.

Python - Kmeans - Add the centroids as a new column

Assume I have the following dataframe. How can I create a new column "new_col" containing the centroids? I can only create the column with the labs, not with the centroids.
Here is my code.
from sklearn import preprocessing
from sklearn.cluster import KMeans
numbers = pd.DataFrame(list(range(1,1000)), columns = ['num'])
kmean_model = KMeans(n_clusters=5)
kmean_model.fit(numbers[['num']])
kmean_model.cluster_centers_
array([[699. ],
[297. ],
[497.5],
[899.5],
[ 99. ]])
numbers['new_col'] = kmean_model.predict(numbers[['num']])
It is simple. Just use .labels_ as follows.
numbers['new_col'] = kmean_model.labels_
Edit. Sorry my mistake.
Make dictionary whose key is label and value is centers, and replace the new_col using the dictionary. See the following.
label_center_dict = {k:v for k, v in zip(kmean_model.labels_, kmean_model.cluster_centers_)}
numbers['new_col'] = kmean_model.labels_
numbers['new_col'].replace(label_center_dict, inplace = True)

Scikit-learn - ValueError: Input contains NaN, infinity or a value too large for dtype('float32') with Random Forest

First, I have checked the different posts concerning this error and none of them can solve my issue.
So I am using RandomForest and I am able to generate the forest and to do a prediction but sometimes during the generation of the forest, I get the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error occurs with the same dataset. Sometimes the dataset creates an error during the training and most of the time not. The error sometimes occurs at the start and sometimes in the middle of the training.
Here's my code :
import pandas as pd
from sklearn import ensemble
import numpy as np
def azureml_main(dataframe1 = None, dataframe2 = None):
# Execution logic goes here
Input = dataframe1.values[:,:]
InputData = Input[:,:15]
InputTarget = Input[:,16:]
limitTrain = 2175
clf = ensemble.RandomForestClassifier(n_estimators = 10000, n_jobs = 4 );
features=np.empty([len(InputData),10])
j=0
for i in range (0,14):
if (i == 1 or i == 4 or i == 5 or i == 6 or i == 8 or i == 9 or i == 10 or i == 11 or i == 13 or i == 14):
features[:,j] = (InputData[:, i])
j += 1
clf.fit(features[:limitTrain,:],np.asarray(InputTarget[:limitTrain,1],dtype = np.float32))
res = clf.predict_proba(features[limitTrain+1:,:])
listreu = np.empty([len(res),5])
for i in range(len(res)):
if(res[i,0] > 0.5):
listreu[i,4] = 0;
elif(res[i,1] > 0.5):
listreu[i,4] = 1;
elif(res[i,2] > 0.5):
listreu[i,4] = 2;
else:
listreu[i,4] = 3;
listreu[:,0] = features[limitTrain+1:,0]
listreu[:,1] = InputData[limitTrain+1:,2]
listreu[:,2] = InputData[limitTrain+1:,3]
listreu[:,3] = features[limitTrain+1:,1]
# Return value must be of a sequence of pandas.DataFrame
return pd.DataFrame(listreu),
I ran my code locally and on Azure ML Studio and the error occurs in both cases.
I am sure that it is not due to my dataset since most of the time I don't get the error and I am generating the dataset myself from different input.
This is a part of the dataset I use
EDIT: I probably found out that I had 0 value which were not real 0 value. The values were like
3.0x10^-314
I would presume somewhere in you dataframe you sometimes have nan values.
these can simply be removed using
dataframe1 = dataframe1.dropna()
However, with this approach you could be losing some valueable training data so it may be worth looking into .fillna() or sklearn.preprocessing.Imputer in order to augment some values for the nan cells in the df.
Without seeing the source of dataframe1 it is hard to give a full / complete answer but it is possible that some sort of train, test split is going on resulting in the dataframe being passed only having nan values some of the time.
Since I correct the problem of the edit, I have no more errors. I just replace 3.0x10^-314 values with zeros.
Some time ago I'v got unstable errors when I use explicit number of CPU in parameter such as your's n_jobs = 4. Try to not use n_jobs at all or use n_jobs = -1 for automatical CPU count detection. May be it will help.
Try to use float64 instead of float32.
EDIT :
Show us the dataset that did it

Input contains NaN, infinity or a value too large for dtype('float64') error but no values in dataset

I am working on the Titanic machine problem from Kaggle - the beginner one.
I am writing my code in python, and the model type is K-NN.
I am receiving the error 'Input contains NaN, infinity or a value too large for dtype('float64')', however, I have checked my data thoroughly. There are no infinite values, no NaN values, and no large values. The error is not thrown on my training set but is thrown on the test set - they are not different in values(Obviously different in content, but the type of value is same).
Here is my code:
import numpy as np
import pandas as pd
test_dataset = pd.read_csv('test.csv')
X_classt = test_dataset.iloc[:, 1].values.reshape((1,-1))
X_faret = test_dataset.iloc[:,8].values.reshape((1,-1))
X_Stpt = test_dataset.iloc[:,3:7]
X_embarkedt = test_dataset.iloc[:,10].values.reshape((-1,1))
X_onet = np.concatenate((X_classt,X_faret))
X_onet = np.matrix.transpose(X_onet)
X_twot = np.concatenate((X_Stpt,X_embarkedt),axis=1)
Xt = np.concatenate((X_onet,X_twot),axis=1)
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN',strategy ='mean', axis = 0)
imputer = imputer.fit(Xt[:,3:5])
Xt[:,3:5] = imputer.transform(Xt[:,3:5])
Xt_one = np.array(Xt[:,0:2],dtype = np.float)
ColThreet = Xt[:,2]
Xt_two = np.array(Xt[:,3:6],dtype=np.float)
ColSevent = Xt[:,6]
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
lett = LabelEncoder()
Xt[:,2] = lett.fit_transform(ColThreet)
lest = LabelEncoder()
Xt[:,6] = lest.fit_transform(Xt[:,6])
#This is where the error is thrown
ohct = OneHotEncoder(categorical_features=[6])
Xt = ohct.fit_transform(Xt).toarray()
Thank you for any help you can provide. I realize that my naming convention is weird, but it is because I used basically the same variables I did for my training code, so I added a 't' at the end of each variable to 'reuse' the names for the test set code.
Thanks in advance.
There are still null values and hence the error message. By quickly running your code I could see there is a null value in 2nd feature.
Just after Xt = np.concatenate((X_onet,X_twot),axis=1) I could see there are null values in 2nd and 4th feature
pd.DataFrame(Xt).isnull().sum()
while you just pass feature 3:5 for null handling
Just checking before encoding confirms this. Hope this helps.
Just a quick suggestion off topic.You should always include column headers, as it will help getting some intuition about data and results.
You could add the df['columnX'].fillna(0) to your dataframe to use 0 as a default value.

Categories