I'm trying to randomly select a row from a pandas DataFrame based on provided weights. I tried to use .sample() method with these parameters, but can't get the syntax working:
import pandas as pd
df = pd.DataFrame({
'label': [1,0,1,-1],
'ind': [2,3,6,8],
})
df.sample(n=1, weights=[0.5, 0.4, 0.1], axis=0)
labels are 1,0 and -1 and I want to assign different weights to each label for random selection.
You should scale the weight so it matches the expected distribution:
weights = {-1:0.1, 0:0.4, 1:0.5}
scaled_weights = (pd.Series(weights) / df.label.value_counts(normalize=True))
df.sample(n=1, weights=df.label.map(scaled_weights) )
Test distribution with 10000 samples
(df.sample(n=10000, replace=True, random_state=1,
weights=df.label.map(scaled_weights))
.label.value_counts(normalize=True)
)
Output:
1 0.5060
0 0.3979
-1 0.0961
Name: label, dtype: float64
For each row, divide the desired weight by the frequency of that label in the df:
weights=df['label'].replace({1:0.5,0:0.4,-1:0.1})/df.groupby('label')['label'].transform('count')
df.sample(n=1, weights=weights, axis=0)
You can try following code. It assigns desired weights from dictionary to your rows in df (assuming you gave them in such an order). In case you want weights to be dependent from number of elements - you can replace lambda with more complex function.
w = df['label'].apply( lambda x: {-1:0.5, 0:0.4, 1:0.1}[x] )
df.sample(n=1, weights=w, axis=0)
Related
I have a few data structures returned from a RandomForestClassifier() and from encoding string data from a CSV. I am predicting the probability of certain crimes happening given some weather data. The model part works well but I'm a bit of a Python nooby and can't wrap my head around merging this data.
Here's a dumbed down version of what I have:
#this line is pseudo code
data = from_csv_file
label_dict = { 'Assault': 0, 'Robbery': 1 }
# index 0 of each cell in predictions is Assault, index 1 is Robbery
encoded_labels = [0, 1]
# Probabilities of crime being assault or robbery
predictions = [
[0.4, 0.6],
[0.1, 0.9],
[0.8, 0.2],
...
]
I'd like to add a new column to data for each crime label with the cell contents being the probability, e.g. new columns called prob_Assault and prob_Robbery. Eventually I'd like to add a boolean column (True/False) that shows if the prediction was correct or not.
How could I go about this? Using Python 3.10, pandas, numpy, and scikit-learn.
EDIT: Might be easier for some if you saw the important part of my actual code
# Training data X, Y
tr_y = tr['offence']
tr_x = tr.drop('offence', axis=1)
# Test X (what to predict)
test_x = test_data.drop('offence', axis=1)
clf = RandomForestClassifier(n_estimators=40)
fitted = clf.fit(tr_x, tr_y)
pred = clf.predict_proba(test_x)
encoded_labels = fitted.classes_
# I also have the encodings dictionary that shows the encodings for crime types
You are on the right track. What you need is to reformat the predictions from list to a numpy array and then access to its columns:
import numpy as np
predictions = np.array(predictions)
data["prob_Assault"] = predictions[:,0]
data["prob_Robbery"] = predictions[:,1]
I am assuming that data is a pandas dataframe. I am not sure how you want to evaluate these probabilities, but you can use logical statements in the pandas as well:
data["prob_Assault"] == 0.8 # For example, 0.8 is the correct probability
The code above will return a Series of boolean such as:
0 True
1 False
2 False
...
You can assign these values to the dataframe as a new column:
data["check"] = data["prob_Assault"] == 0.8
Or even select the True rows of the dataframe:
data[data["prob_Assault"] == 0.8]
Maybe I misunderstood your problem, but if not, that could be a solution :
Create a dataframe with two columns : prob_Assault and prob_Robbery.
predictions_df = pd.DataFrame(predictions, columns = ['prob_Assault', 'prob_Robbery'])
Join that predictions_df to your data
iw ould like to get a dataframe of important features. With the code below i have got the shap_values and i am not sure, what do the values mean. In my df are 142 features and 67 experiments, but got an array with ca. 2500 values.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
I have tried to store them in a df:
rf_resultX = pd.DataFrame(shap_values, columns = ['shap_values'])
but got: ValueError: Shape of passed values is (18, 142), indices imply (18, 1)
142 - the number of the features.
18 - i have no idea.
I believe it works as follows:
shap_values need to be averaged.
and paired with the feature names: pd.DataFrame(feature_names, columns = ['feature_names'])
Does anybody have an experience, how to interpret shap_values?
At first i thought, that the number of values are the number of features x number of rows.
Combining the other two answers like this worked for me.
feature_names = X_train.columns
rf_resultX = pd.DataFrame(shap_values, columns = feature_names)
vals = np.abs(rf_resultX.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
shap_importance.head()
shap_values have (num_rows, num_features) shape; if you want to convert it to dataframe, you should pass the list of feature names to the columns parameter: rf_resultX = pd.DataFrame(shap_values, columns = feature_names).
Each sample has its own shap value for each feature; the shap value tells you how much that feature has contributed to the prediction for that particular sample; this is called a local explanation. You could average shap values for each feature to get a feeling of global feature importance, but I'd suggest you take a look at the documentation since the shap package itself provides much more powerful visualizations/interpretations.
From https://github.com/slundberg/shap/issues/632
vals = np.abs(shap_values.values).mean(0)
feature_names = train_x.columns()
feature_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
feature_importance.head()
I wrote a short function for this which also works for multi-class classifications. It expects the data as a pandas DataFrame, a list of shap value arrays with one array for each class, and optionally a list of columns for which you want the average shap values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
def shap_feature_ranking(data, shap_values, columns=[]):
if not columns: columns = data.columns.tolist() # If columns are not given, take all columns
c_idxs = []
for column in columns: c_idxs.append(data.columns.get_loc(column)) # Get column locations for desired columns in given dataframe
if isinstance(shap_values, list): # If shap values is a list of arrays (i.e., several classes)
means = [np.abs(shap_values[class_][:, c_idxs]).mean(axis=0) for class_ in range(len(shap_values))] # Compute mean shap values per class
shap_means = np.sum(np.column_stack(means), 1) # Sum of shap values over all classes
else: # Else there is only one 2D array of shap values
assert len(shap_values.shape) == 2, 'Expected two-dimensional shap values array.'
shap_means = np.abs(shap_values).mean(axis=0)
# Put into dataframe along with columns and sort by shap_means, reset index to get ranking
df_ranking = pd.DataFrame({'feature': columns, 'mean_shap_value': shap_means}).sort_values(by='mean_shap_value', ascending=False).reset_index(drop=True)
df_ranking.index += 1
return df_ranking
For the latest version 0.40.0:
feature_names = shap_values.feature_names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
vals = np.abs(shap_df.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
For a regression problem, I have a training data set with :
- 3 variables with a gaussian distribution
- 20 variables with a uniform distribution.
All my variables are continious, between [0;1].
The problem is the test data, used to score my regression model has an uniform distribution for all the variables.
Actually, I have bad results at tail-end distribution, so I want to oversample my training set, in order to duplicate the rarest rows.
So my idea is to bootstrap (using sampling with replacement) on my training set in order to have a set of data with the same distribution as the test set.
In order to do that, my idea (don't know if it's a good one !) is to add 3 columns with intervals for my 3 variables and use this columns to stratify the resampling.
Example :
First, generating the data
from scipy.stats import truncnorm
def get_truncated_normal(mean=0.5, sd=0.15, min_value=0, max_value=1):
return truncnorm(
(min_value - mean) / sd, (max_value - mean) / sd, loc=mean, scale=sd)
generator = get_truncated_normal()
import numpy as np
from sklearn.preprocessing import MinMaxScaler
S1 = generator.rvs(1000)
S2 = generator.rvs(1000)
S3 = generator.rvs(1000)
u = np.random.uniform(0, 1, 1000)
Then check the distribution :
import seaborn as sns
sns.distplot(u);
sns.distplot(S2);
It's OK, so I'll add categories columns
import pandas as pd
df = pd.DataFrame({'S1':S1,'S2':S2,'S3':S3,'Unif':u})
BINS_NUMBER = 10
df['S1_range'] = pd.cut(df.S1,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S2_range'] = pd.cut(df.S2,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S3_range'] = pd.cut(df.S3,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
a check
df.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
It's good for me.
So now I'll try to resample but it's not working as intended
from sklearn.utils import resample
df_resampled = resample(df,replace=True,n_samples=1000, stratify=df['S1_range'])
df_resampled.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
So it's not working, I get the same distribution in output as in input...
Can you help me ?
Perhaps it's not the good way to do this ?
Thanks !!
Rather than writing code from scratch to resample your continuous data, you should take advantage a library for resampling regression data.
Whereas the popular libraries (imbalanced-learn, etc), focus on classification (categorical) variables, there is a recent Python library (called resreg - RESampling for REGression) that allows you to resample your continuous data (resreg GitHub page)
Also, rather than bootstraping, you may want to generate synthetic data points at the tail ends of your normally distributed variables, as doing this will likely lead to much better results (see this paper). Similar to SMOTE for classification, which interpolates between features, you can use SMOTER (SMOTE for regression) in the resreg package to generate synthetic values in regression/continuous data.
Here is an example of how you would use resreg to achieve resampling with a few lines of code:
import numpy as np
import resreg
cl = np.percentile(y,10) # Oversample values less than the 10th percentile
ch = np.percentile(y,90) # Oversample values less than the 10th percentile
# Assign relevance scores to indicate which samples in your dataset are
# to be resampled. Values below cl and above ch are assigned a relevance
# value above 0.5, other values are assigned a relevance value above 0.5
relevance = resreg.sigmoid_relevance(X, y, cl=cl, ch=ch)
# Resample the relevant values (i.e relevance >= 0.5) by interpolating
# between nearest k-neighbors (k=5). By setting over='balance', the
# relevant values are oversampled so that the number of relevant and
# irrelevant values are equal
X_res, y_res = resreg.smoter(X, y, relevance=relevance, relevance_threshold=0.5, k=5, over='balance', random_state=0)
My solution:
def create_sampled_data_set(n_samples_by_bin=1000,
n_bins=10,
replace=True,
save_csv=True):
"""In order to have the same distribution for S1..S3 between training
set and test set, this function will generate a new
training set resampled
Return: (X_train, y_train)
"""
def stratified_sample_df_(df, col, n_samples, replace=True):
if replace:
n = n_samples
else:
n = min(n_samples, df[col].value_counts().min())
df_ = df.groupby(col).apply(lambda x: x.sample(n, replace=replace))
df_.index = df_.index.droplevel(0)
return df_
X_train, y_train = load_data_for_train()
# merge the dataframe for the sampling. Target will be removed after
X_train = pd.merge(
X_train, y_train[['Target']], left_index=True, right_index=True)
del y_train
# build a categorical feature, from S1..S3 distribution
disc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
disc.fit(X_train[['S1', 'S2', 'S3']])
y_bin = disc.transform(X_train[['S1', 'S2', 'S3']])
del disc
vint = np.vectorize(np.int)
y_bin = vint(y_bin)
y_concat = []
for i in range(len(y_bin)):
a = y_bin[i, 0].astype('str')
b = y_bin[i, 1].astype('str')
c = y_bin[i, 2].astype('str')
y_concat.append(a + ';' + b + ';' + c)
del y_bin
X_train['S_Class'] = y_concat
del y_concat
X_train_resampled = stratified_sample_df_(
X_train, 'S_Class', n_samples_by_bin)
del X_train
y_train_resampled = X_train_resampled[['Target']].copy()
y_train_resampled.rename(
columns={y_train_resampled.columns[0]: 'Target'}, inplace=True)
X_train_resampled = X_train_resampled.drop(['S_Class', 'Target'], axis=1)
# save in file for further usage
if save_csv:
X_train_resampled.to_csv(
"./data/training_input_resampled.csv", sep=",")
y_train_resampled.to_csv(
"./data/training_output_resampled.csv", sep=",")
return(X_train_resampled,
y_train_resampled)
I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)
How do I select specific rows from a python dataframe for an ols regression in PANDAS?
I have a pandas dataframe with 1,000 rows. I want to regress column A on columns B + C, for the first 10 rows. When I type:
mod = pd.ols(y=df[‘A’], x=df[[‘B’,’C’]], window=10)
I get the regression results for rows 991-1000. How do I specify that I want the FIRST (or second, etc.) 10 rows?
Thanks in advance.
I think you can use iloc:
mod = pd.ols(y=df['A'].iloc[2:12], x=df[['B','C']].iloc[2:12], window=10)
Or ix:
mod = pd.ols(y=df.ix[2:12, 'A'], x=df.ix[2:12, ['B', 'C']], window=10)
If you need all groups use range:
for i in range(10):
#print i, i+10
mod = pd.ols(y=df['A'].iloc[i:i + 10], x=df[['B','C']].iloc[i:i + 10], window=10)
If you need help about ols, try help(pd.ols) in IPython, because this function in pandas docs is missing:
In [79]: help(pd.ols)
Help on function ols in module pandas.stats.interface:
ols(**kwargs)
Returns the appropriate OLS object depending on whether you need
simple or panel OLS, and a full-sample or rolling/expanding OLS.
Will be a normal linear regression or a (pooled) panel regression depending
on the type of the inputs:
y : Series, x : DataFrame -> OLS
y : Series, x : dict of DataFrame -> OLS
y : DataFrame, x : DataFrame -> PanelOLS
y : DataFrame, x : dict of DataFrame/Panel -> PanelOLS
y : Series with MultiIndex, x : Panel/DataFrame + MultiIndex -> PanelOLS
Parameters
----------
y: Series or DataFrame
See above for types
x: Series, DataFrame, dict of Series, dict of DataFrame, Panel
weights : Series or ndarray
The weights are presumed to be (proportional to) the inverse of the
variance of the observations. That is, if the variables are to be
transformed by 1/sqrt(W) you must supply weights = 1/W
intercept: bool
True if you want an intercept. Defaults to True.
nw_lags: None or int
Number of Newey-West lags. Defaults to None.
nw_overlap: bool
Whether there are overlaps in the NW lags. Defaults to False.
window_type: {'full sample', 'rolling', 'expanding'}
'full sample' by default
window: int
size of window (for rolling/expanding OLS). If window passed and no
explicit window_type, 'rolling" will be used as the window_type
Panel OLS options:
pool: bool
Whether to run pooled panel regression. Defaults to true.
entity_effects: bool
Whether to account for entity fixed effects. Defaults to false.
time_effects: bool
Whether to account for time fixed effects. Defaults to false.
x_effects: list
List of x's to account for fixed effects. Defaults to none.
dropped_dummies: dict
Key is the name of the variable for the fixed effect.
Value is the value of that variable for which we drop the dummy.
For entity fixed effects, key equals 'entity'.
By default, the first dummy is dropped if no dummy is specified.
cluster: {'time', 'entity'}
cluster variances
Examples
--------
# Run simple OLS.
result = ols(y=y, x=x)
# Run rolling simple OLS with window of size 10.
result = ols(y=y, x=x, window_type='rolling', window=10)
print(result.beta)
result = ols(y=y, x=x, nw_lags=1)
# Set up LHS and RHS for data across all items
y = A
x = {'B' : B, 'C' : C}
# Run panel OLS.
result = ols(y=y, x=x)
# Run expanding panel OLS with window 10 and entity clustering.
result = ols(y=y, x=x, cluster='entity', window_type='expanding', window=10)
Returns
-------
The appropriate OLS object, which allows you to obtain betas and various
statistics, such as std err, t-stat, etc.