Merging only tested data back to the original dataframe by index

Merging only tested data back to the original dataframe by index - python

After training my model now i want to see my original dataframe along with y_pred values.
y_hats = model.predict(X_test) #got the predicted values
y_test['preds'] = y_hats #trying to join them (preds column next to y_test)(whose index is
#what needs to be connected back to original dataframe)
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
But the result I get is all y pred values are null. y_test has the correct index values to be connected to original dataframe. but when i create the 'preds' column its actually an array hence it doesnt work. And when I create a df out of the preds array the index resets, obviously. Any ideas how to fix this issue?

Related

Merging some list and dictionary data into one ndarray

I have a few data structures returned from a RandomForestClassifier() and from encoding string data from a CSV. I am predicting the probability of certain crimes happening given some weather data. The model part works well but I'm a bit of a Python nooby and can't wrap my head around merging this data.
Here's a dumbed down version of what I have:
#this line is pseudo code
data = from_csv_file
label_dict = { 'Assault': 0, 'Robbery': 1 }
# index 0 of each cell in predictions is Assault, index 1 is Robbery
encoded_labels = [0, 1]
# Probabilities of crime being assault or robbery
predictions = [
[0.4, 0.6],
[0.1, 0.9],
[0.8, 0.2],
...
]
I'd like to add a new column to data for each crime label with the cell contents being the probability, e.g. new columns called prob_Assault and prob_Robbery. Eventually I'd like to add a boolean column (True/False) that shows if the prediction was correct or not.
How could I go about this? Using Python 3.10, pandas, numpy, and scikit-learn.
EDIT: Might be easier for some if you saw the important part of my actual code
# Training data X, Y
tr_y = tr['offence']
tr_x = tr.drop('offence', axis=1)
# Test X (what to predict)
test_x = test_data.drop('offence', axis=1)
clf = RandomForestClassifier(n_estimators=40)
fitted = clf.fit(tr_x, tr_y)
pred = clf.predict_proba(test_x)
encoded_labels = fitted.classes_
# I also have the encodings dictionary that shows the encodings for crime types

You are on the right track. What you need is to reformat the predictions from list to a numpy array and then access to its columns:
import numpy as np
predictions = np.array(predictions)
data["prob_Assault"] = predictions[:,0]
data["prob_Robbery"] = predictions[:,1]
I am assuming that data is a pandas dataframe. I am not sure how you want to evaluate these probabilities, but you can use logical statements in the pandas as well:
data["prob_Assault"] == 0.8 # For example, 0.8 is the correct probability
The code above will return a Series of boolean such as:
0 True
1 False
2 False
...
You can assign these values to the dataframe as a new column:
data["check"] = data["prob_Assault"] == 0.8
Or even select the True rows of the dataframe:
data[data["prob_Assault"] == 0.8]

Maybe I misunderstood your problem, but if not, that could be a solution :
Create a dataframe with two columns : prob_Assault and prob_Robbery.
predictions_df = pd.DataFrame(predictions, columns = ['prob_Assault', 'prob_Robbery'])
Join that predictions_df to your data

How indexing of sliced data frame works in pandas

How to get the correct row from a datfarme which is sliced?
To show what I mean, look at this code sample:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import numpy as np
data=pd.DataFrame()
data['one']=range(0,1000)
data['p1']=data['one']+1
data['p2']=data['one']+2
label=data['p1']%2==0
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=100)
lgb_model = lgb.LGBMClassifier(objective = 'binary')
lgb_fitted = lgb_model.fit(X_train, y_train, verbose = False)
y_prob=lgb_fitted.predict_proba(X_test)
y_prob= pd.DataFrame(y_prob,columns = ['No','Yes'])
model_uncertain=y_prob.loc[(y_prob['Yes'] >= .5) & (y_prob['Yes'] <= .52)]
model_uncertain
My question:
How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
To make sure that I am getting the right row, I test it using passing the same row to
predict_proba using the following code as I should get the same result:
y_prob_3=lgb_fitted.predict_proba([X_test.iloc[3]])
y_prob_3
But the result is not the same.
I think I am not sending the correct row to predict_proba, as it should return the same value for a row.
What is the correct way to find the n row in model_uncertain and find the corresponding row in X_test data frame?

How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
You're on the right track:
>>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
>>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]
Yes, indexes are preserved between the original dataframe and its slices (unless you reset the index somewhere in between).
To make sure that I am getting the right row, I test it using passing the same row to predict_proba using the following code as I should get the same result (...) But the result is not the same.
Why do you think they're not the same? In the dataframe image you can't see all of the decimals. A better way to confirm if they're the same (well, really really similar) would be to use something like np.isclose to compare model_uncertain.iloc[0] (first row of dataframe) and X_train.loc[3] (row where index is 3):
>>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].values)

Get a feature importance from SHAP Values

iw ould like to get a dataframe of important features. With the code below i have got the shap_values and i am not sure, what do the values mean. In my df are 142 features and 67 experiments, but got an array with ca. 2500 values.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
I have tried to store them in a df:
rf_resultX = pd.DataFrame(shap_values, columns = ['shap_values'])
but got: ValueError: Shape of passed values is (18, 142), indices imply (18, 1)
142 - the number of the features.
18 - i have no idea.
I believe it works as follows:
shap_values need to be averaged.
and paired with the feature names: pd.DataFrame(feature_names, columns = ['feature_names'])
Does anybody have an experience, how to interpret shap_values?
At first i thought, that the number of values are the number of features x number of rows.

Combining the other two answers like this worked for me.
feature_names = X_train.columns
rf_resultX = pd.DataFrame(shap_values, columns = feature_names)
vals = np.abs(rf_resultX.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
shap_importance.head()

shap_values have (num_rows, num_features) shape; if you want to convert it to dataframe, you should pass the list of feature names to the columns parameter: rf_resultX = pd.DataFrame(shap_values, columns = feature_names).
Each sample has its own shap value for each feature; the shap value tells you how much that feature has contributed to the prediction for that particular sample; this is called a local explanation. You could average shap values for each feature to get a feeling of global feature importance, but I'd suggest you take a look at the documentation since the shap package itself provides much more powerful visualizations/interpretations.

From https://github.com/slundberg/shap/issues/632
vals = np.abs(shap_values.values).mean(0)
feature_names = train_x.columns()
feature_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
feature_importance.head()

I wrote a short function for this which also works for multi-class classifications. It expects the data as a pandas DataFrame, a list of shap value arrays with one array for each class, and optionally a list of columns for which you want the average shap values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
def shap_feature_ranking(data, shap_values, columns=[]):
if not columns: columns = data.columns.tolist() # If columns are not given, take all columns
c_idxs = []
for column in columns: c_idxs.append(data.columns.get_loc(column)) # Get column locations for desired columns in given dataframe
if isinstance(shap_values, list): # If shap values is a list of arrays (i.e., several classes)
means = [np.abs(shap_values[class_][:, c_idxs]).mean(axis=0) for class_ in range(len(shap_values))] # Compute mean shap values per class
shap_means = np.sum(np.column_stack(means), 1) # Sum of shap values over all classes
else: # Else there is only one 2D array of shap values
assert len(shap_values.shape) == 2, 'Expected two-dimensional shap values array.'
shap_means = np.abs(shap_values).mean(axis=0)
# Put into dataframe along with columns and sort by shap_means, reset index to get ranking
df_ranking = pd.DataFrame({'feature': columns, 'mean_shap_value': shap_means}).sort_values(by='mean_shap_value', ascending=False).reset_index(drop=True)
df_ranking.index += 1
return df_ranking

For the latest version 0.40.0:
feature_names = shap_values.feature_names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
vals = np.abs(shap_df.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)

Tensorflow. Batch Tensor modify an entry (tensor)

I am following the example: https://www.tensorflow.org/tutorials/structured_data/time_series.
In my case I have a sensor which collect the data every hour. This one has not being really reliable during the last months and I have lost some data. To solve this problem, the values have being replaced with the previous valid value. I got many duplicated values and I think this is the reason why my NN is unable to predict anything. I do not want to skip the wrong values before creating the dataset because it will create time series with no consecutive values.
I would like to create the timeseries dataset as in the example and then, remove the entries/outputs (tensors) which has certain duplicity in de data or update the tensor values with the value 0.
def hasMultipleDuplicatedElements (mylist, multiplicity):
return Counter(mylist[:,0]).most_common(1)[0][1] >multiplicity
WindowGenerator.hasMultipleDuplicatedElements = hasMultipleDuplicatedElements
def dsCleanedRowsWithHighMultiplycity(self,ds,multiplicity):
for batch in ds:
dataBatch=batch.numpy()
for j in range (len (dataBatch)):
selectedDataBatch=dataBatch[j]
indices = tf.constant([[j] for j in range(len(selectedDataBatch))])
inputData =(selectedDataBatch[:self.input_width])
labelData= (selectedDataBatch[self.input_width:])
if ( hasMultipleDuplicatedElements(inputData,multiplicity) or
( hasMultipleDuplicatedElements(labelData,multiplicity) )):
#print(batch[j])
tf.tensor_scatter_nd_update(batch[j], indices,
tf.zeros(shape=selectedDataBatch.shape,dtype=tf.float32),
name=None)
#print(batch[j])
WindowGenerator.dsCleanedOfRowsWithHighMultipliciy = dsCleanedOfRowsWithHighMultipliciy
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32,)
self.dsCleanedRowsWithHighMultiplycity(ds,10)
ds = ds.map(self.split_window)
return ds
The dataset contains batches, each one with 32 entries/outputs(tensors). I scan every entry/output looking for duplicated data, which a minimum of 10 times. I manage to spot this entries and create a new tensor with tf.tensor_scatter_nd_update but what I would like is to update the original tensor inside the batch.
If there is a way to remove the wrong tensor from the batch, it would also be an acceptable solution.
Thanks in advance!

SMOTE on dataframe of arrays issues

I'm trying to attempt to SMOTE on a dataframe full of sliding windows here:
DataFrame
I'm using imblearn's SMOTE() function on it. Without any manipulation, I'm getting an error that each cell must have a size 1 array. SMOTING individually by rows or exploding the dataframe and SMOTING on each window (same index) results in a ValueError because there is only one class in each SMOTE attempt. How do I get around this problem of wanting to SMOTE an entire sliding window without aggregating them or getting dimensional errors by keeping it in the dataframe in the first picture?
new_df_labels = X_with_labels.reset_index().apply(pd.Series.explode)
new_df = X_smoted.reset_index().apply(pd.Series.explode)
np.unique(new_df.index)
X_list = pd.DataFrame(columns = X_smoted.columns)
y_list = []
for j in np.unique(new_df.index):
new_df1 = new_df[new_df.index == j]
new_df_labels1 = new_df_labels[new_df_labels.index == j]
X_smoted_1, y_smoted_1 = smot.fit_resample(new_df1, new_df_labels1['Activity'])
X_list = X_list.append(X_smoted_1)
y_list.append(y_smoted_1.ravel())
Exploded DataFrame

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging only tested data back to the original dataframe by index - python

Related

Merging some list and dictionary data into one ndarray

How indexing of sliced data frame works in pandas

Get a feature importance from SHAP Values

Tensorflow. Batch Tensor modify an entry (tensor)

SMOTE on dataframe of arrays issues

Categories

Resources