How can I extract the newly added rows after SMOTE (imblearn module) - python

Is it possible to extract the newly added rows from pandas dataframe that were created by imblearn's smote function?

I think I figured it out. Apparently they are being appended at the end of the fit_resample returned dataframe:
my target is "DIED"
smotez = SMOTENC([10,11], random_state=555, k_neighbors=10)
smote_tomek = SMOTETomek(random_state=555, smote=smotez , n_jobs=-1)
X_train_new, y_train_new = smote_tomek.fit_resample(X_train, y_train)
train_data_new = pd.concat([X_train_new.iloc[1:],y_train_new],axis=1)
train_data_new.dropna(inplace=True)
smote_data = train_data_new.iloc[len(train_data)-1:,]
print("Y_train_smote:\n", npunique(smote_data['DIED']),smote_data['DIED'].mean())
As you can see, all rows are of the minority class ("DIED")
Y_train_smote:
[[ 1 91936]] 1.0
Double-checking, the expression below should return 0:
print(len(smote_data) + len(X_train) - len(X_train_new))
0

Related

Drop bad data from dataset Tensorflow

I have a training pipeline using tf.data. Inside the dataset there is some bad elements, in my case a values of 0. How do i drop these bad data elements based on their value? I want to be able to remove them within the pipeline while training since the dataset is large.
Assume from the following pseudo code:
def parse_function(element):
height = element['height']
if height <= 0: skip() #How to skip this value
labels = element['label']
features['height'] = height
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
A suggestion would be using ds.skip(1) based on the feature value, or provide some sort of neutral weight/loss?
You can use tf.data.Dataset.filter:
def filter_func(elem):
""" return True if the element is to be kept """
return tf.math.greater(elem['height'],0)
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.filter(filter_func)
Assuming that element is a data frame in your code, then it would be:
def parse_function(element):
element = element.query('height>0')
labels = element['label']
features['height'] = element['height']
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
`

How to split dataframe according sub ID

I have a csv file with 3 columns containing image data set.1st column name 'ID' where ID represent patient id, 2nd and 3rd columns represent side and label of the data set respectively.I would like to split this dataframe in to test and train set according to patient ID in where patient Id wouldn't be repeat in both set.I mean the train ID would not present in the test set. Using this below code
# Defining a function for spliting dataframe into train and test
df_Datacopy = df_Data.copy() # copy the df
#df_Datacopy= df_Datacopy.sort_values(by=['ID'])
df_Datacopy = df_Datacopy.sample(frac=1)
train_df = df_Datacopy.sample(frac=0.80, random_state=0) # train spliting size 80%
# sorted according to ID
train_df= train_df.sort_values(by=['ID'])
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to ID
test_df= test_df.sort_values(by=['ID'])
u1 = np.unique(train_df['ID'])
u2 = np.unique(test_df['ID'])
print(set(u1).union(set(u2)))
I tried to split the test and train set,but the problem is the i seen that some ID present in both test and train set.It would be a great help for me if i get some help including code example.
Simple Python Lists Approach
So I would recommend using simple python lists for this as the preferred and simpler approach.Since you started with pandas I'll provide a way to use pandas methods to achieve something similar but with a possible worse outcome.
whole_dataset_list =df_copy.to_numpy().tolist()
patientid_list =df['ID'].to_numpy().tolist()
patientid_set =list(set(patientid_list))
import random as rand
rand.shuffle(patientid_set)
#Change the numbers as to represent a 80% slice of your dataset/10/10 respectively
train_set_by_patientID = patientid_set[0:800] # 80
val_set_by_patientID = patientid_set[800:900] # 10
test_set_by_patientID = patientid_set[1000:] # 10
After splitting these lists you can use them to obtain the final train/test/val splits as such.
for i in range(len(wholeDataset_list)):
curr_pt_id = wholeDataset_list[i]
if(curr_pt_id in train_set_by_patientID):
train_set.append(wholeDataset_list[i])
elif(curr_pt_id in val_set_by_patientID):
val_set.append(wholeDataset_list[i])
elif(curr_pt_id in test_set_by_patientID):
test_set.append(wholeDataset_list[i])
else:
raise RuntimeError("Whole dataset does not contain given i ")
Finally you can come back to a dataframe if you want as such:
train_df = pd.DataFrame(train_set, columns=df_copy.columns)
val_df = pd.DataFrame(val_set, columns=df_copy.columns)
test_df = pd.DataFrame(test_set, columns=df_copy.columns)
Second Option using Pandas Only:
Here sop_uid is a unique index. I am using a train/val/test split instead of a train/test split but that can be changed easily.
dff.sort_values(by="patient_id", axis=0, inplace=True)
count_study = dff.groupby_agg(by = 'patient_id', agg='count', agg_column_name='sop_uid', new_column_name="count_instances")
df_Datacopy = dict_dff
train_df = df_Datacopy.sample(frac=0.90, weights='count_study', random_state=0) # train spliting size 90%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to count_study
test_df= test_df.sort_values(by=['count_instances'], ascending = False)
#Sample
train_df = train_df.sample(frac=0.89, weights='count_study', random_state=0) # train spliting size 80%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
val_df = df_Datacopy.drop(train_df.index.append(test_df.index))
I recommend using a boolean mask to filter the dataset.
If you want to split 50/50 maybe checking if ID is even or uneven might work.
Since you didnt provide any sample data or furter detail on which citeria to split i suggest
train_df= df[df.ID % 2 == 0]
test_df = df[df.ID % 2 != 0]
Is that what you wanted to achieve?
If not maybe provide more information on what result you want.

Scikit-learn - ValueError: Input contains NaN, infinity or a value too large for dtype('float32') with Random Forest

First, I have checked the different posts concerning this error and none of them can solve my issue.
So I am using RandomForest and I am able to generate the forest and to do a prediction but sometimes during the generation of the forest, I get the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error occurs with the same dataset. Sometimes the dataset creates an error during the training and most of the time not. The error sometimes occurs at the start and sometimes in the middle of the training.
Here's my code :
import pandas as pd
from sklearn import ensemble
import numpy as np
def azureml_main(dataframe1 = None, dataframe2 = None):
# Execution logic goes here
Input = dataframe1.values[:,:]
InputData = Input[:,:15]
InputTarget = Input[:,16:]
limitTrain = 2175
clf = ensemble.RandomForestClassifier(n_estimators = 10000, n_jobs = 4 );
features=np.empty([len(InputData),10])
j=0
for i in range (0,14):
if (i == 1 or i == 4 or i == 5 or i == 6 or i == 8 or i == 9 or i == 10 or i == 11 or i == 13 or i == 14):
features[:,j] = (InputData[:, i])
j += 1
clf.fit(features[:limitTrain,:],np.asarray(InputTarget[:limitTrain,1],dtype = np.float32))
res = clf.predict_proba(features[limitTrain+1:,:])
listreu = np.empty([len(res),5])
for i in range(len(res)):
if(res[i,0] > 0.5):
listreu[i,4] = 0;
elif(res[i,1] > 0.5):
listreu[i,4] = 1;
elif(res[i,2] > 0.5):
listreu[i,4] = 2;
else:
listreu[i,4] = 3;
listreu[:,0] = features[limitTrain+1:,0]
listreu[:,1] = InputData[limitTrain+1:,2]
listreu[:,2] = InputData[limitTrain+1:,3]
listreu[:,3] = features[limitTrain+1:,1]
# Return value must be of a sequence of pandas.DataFrame
return pd.DataFrame(listreu),
I ran my code locally and on Azure ML Studio and the error occurs in both cases.
I am sure that it is not due to my dataset since most of the time I don't get the error and I am generating the dataset myself from different input.
This is a part of the dataset I use
EDIT: I probably found out that I had 0 value which were not real 0 value. The values were like
3.0x10^-314
I would presume somewhere in you dataframe you sometimes have nan values.
these can simply be removed using
dataframe1 = dataframe1.dropna()
However, with this approach you could be losing some valueable training data so it may be worth looking into .fillna() or sklearn.preprocessing.Imputer in order to augment some values for the nan cells in the df.
Without seeing the source of dataframe1 it is hard to give a full / complete answer but it is possible that some sort of train, test split is going on resulting in the dataframe being passed only having nan values some of the time.
Since I correct the problem of the edit, I have no more errors. I just replace 3.0x10^-314 values with zeros.
Some time ago I'v got unstable errors when I use explicit number of CPU in parameter such as your's n_jobs = 4. Try to not use n_jobs at all or use n_jobs = -1 for automatical CPU count detection. May be it will help.
Try to use float64 instead of float32.
EDIT :
Show us the dataset that did it

Input contains NaN, infinity or a value too large for dtype('float64') error but no values in dataset

I am working on the Titanic machine problem from Kaggle - the beginner one.
I am writing my code in python, and the model type is K-NN.
I am receiving the error 'Input contains NaN, infinity or a value too large for dtype('float64')', however, I have checked my data thoroughly. There are no infinite values, no NaN values, and no large values. The error is not thrown on my training set but is thrown on the test set - they are not different in values(Obviously different in content, but the type of value is same).
Here is my code:
import numpy as np
import pandas as pd
test_dataset = pd.read_csv('test.csv')
X_classt = test_dataset.iloc[:, 1].values.reshape((1,-1))
X_faret = test_dataset.iloc[:,8].values.reshape((1,-1))
X_Stpt = test_dataset.iloc[:,3:7]
X_embarkedt = test_dataset.iloc[:,10].values.reshape((-1,1))
X_onet = np.concatenate((X_classt,X_faret))
X_onet = np.matrix.transpose(X_onet)
X_twot = np.concatenate((X_Stpt,X_embarkedt),axis=1)
Xt = np.concatenate((X_onet,X_twot),axis=1)
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN',strategy ='mean', axis = 0)
imputer = imputer.fit(Xt[:,3:5])
Xt[:,3:5] = imputer.transform(Xt[:,3:5])
Xt_one = np.array(Xt[:,0:2],dtype = np.float)
ColThreet = Xt[:,2]
Xt_two = np.array(Xt[:,3:6],dtype=np.float)
ColSevent = Xt[:,6]
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
lett = LabelEncoder()
Xt[:,2] = lett.fit_transform(ColThreet)
lest = LabelEncoder()
Xt[:,6] = lest.fit_transform(Xt[:,6])
#This is where the error is thrown
ohct = OneHotEncoder(categorical_features=[6])
Xt = ohct.fit_transform(Xt).toarray()
Thank you for any help you can provide. I realize that my naming convention is weird, but it is because I used basically the same variables I did for my training code, so I added a 't' at the end of each variable to 'reuse' the names for the test set code.
Thanks in advance.
There are still null values and hence the error message. By quickly running your code I could see there is a null value in 2nd feature.
Just after Xt = np.concatenate((X_onet,X_twot),axis=1) I could see there are null values in 2nd and 4th feature
pd.DataFrame(Xt).isnull().sum()
while you just pass feature 3:5 for null handling
Just checking before encoding confirms this. Hope this helps.
Just a quick suggestion off topic.You should always include column headers, as it will help getting some intuition about data and results.
You could add the df['columnX'].fillna(0) to your dataframe to use 0 as a default value.

DataFrame has float values but calling to_csv() on it generates an empty CSV

Consider my following code:
columns = ['tf-idf','bag_of_words']
index = ['MultinomialNB', 'LinearSVC', 'LogisticRegression', \
'DecisionTreeClassifier','MLPClassifier']
df = pd.DataFrame(columns = columns, index = index)
estimators_dict = OrderedDict([('MultiNomialNB', MultinomialNB()), \
('LinearSVC', LinearSVC()), \
('LogisticRegression', LogisticRegression()), \
('DecisionTreeClassifier', DecisionTreeClassifier()), \
('MLPClassifier',MLPClassifier(max_iter=10))])
transformers_dict = OrderedDict([('tf-idf', TfidfVectorizer(max_features=500)), \
('bag_of_words', CountVectorizer())])
steps = []
for transformer_name, transformer in transformers_dict.items():
steps.append((transformer_name, transformer))
for estimator_name, estimator in estimators_dict.items():
steps.append((estimator_name, estimator))
model = Pipeline(steps)
predicted_labels = cross_val_predict(model, all_features,all_labels, cv=5)
# f1 is float
f1 = f1_score(all_labels, predicted_labels, average = 'weighted')
# writing to DataFrame
df[transformer_name][estimator_name] = round(f1,2)
# This correctly shows the value which just written
print(str(df[str(transformer_name)][str(estimator_name)])) # line a
del steps[1]
del steps[0]
# but writing to csv create a file with no values whatsoever
df.to_csv('classification_results_f1score') # line b
Quick little context: In my classification task I am using a set of feature transformers and another set of sklearn classifiers. I am running all possible combinations of these two sets to see which model performs the best.
I am calculating f1-score (a float value) of each model and storing it in a dataframe. The value is successfully written to the dataframe. I am able to verify the same by accessing it. (line a)
But after all the model runs are over (end of both the for loops) when I write the dataframe to a csv it is generating a csv as follows:
,tf-idf,bag_of_words
MultinomialNB,,
LinearSVC,,
LogisticRegression,,
DecisionTreeClassifier,,
MLPClassifier,,
What seems to be the issue here? Why are the values not showing up in the csv?

Categories