I am working with sklift to describe what the uplift for a given treatment (in this case marketing discount) is. When training the model, we can get both probabilities, such as:
# model results: conditional probabilities of treatment effect
# probability of performing the targeted action (visits):
#prob_treat = model_sm.trmnt_preds_ # probability in treatment group
#prob_control = model_sm.ctrl_preds_ # probability in control group
But when I try to get these prob_treat and prob_control for inference (unseen data), I cannot find anything in the docs. All I can do is to get the uplift using model.predict(X_val). I am in need to understand for inference what the probas are if treatment are given vs not given. Can some one help?
I didn't use sklift and I use causalml package which is from Uber.
But I have some ideas about your question.
From your description I guess you are using T-learner or S-learner which predict the probability seperately in treatment group and control group.
So, the result(uplift value) you want is prob_treat-prob_control.
And the probas when treatment are given or not given are counter-factual outcomes.
Related
I made the ML Model using the Resnet9 algorithm. In this model, I have 38 classes.
After the prediction, It returns the tuple like this:
It is returning negative values. In the above case, the highest value is -204.0966 means it has the highest chance of being the correct one.
I want to make it in percentage to see what the confidence(Accuracy) of that image. If anyone wants to see the code please tell me.
You should apply the softmax function, which maps vector values to values that sum to 1 (it's a way to get a probability distribution) and multiplies the result by 100 to get percentages:
import torch
torch.nn.functional.softmax(model_output) * 100
Softmax # Wikipedia
I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !
Suppose I have a paragraph:
Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."
And have the following test_wrds,
Test_wrds = ['Power curve', 'data-driven','wind turbines']
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string. For example, Test_wrds Power curve appeared first in 1st sentence hence but when we select 2nd sentence there are another Power curve words thus the output would be something like this
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.
And likewise, I would like to slice sentences for data-driven and wind turbines and saved them in separate strings.
How can I implement this using Python in a simple way?
So far I found code which basically removes the entire sentence whenever any Text_wrds is in.
def remove_sentence(Str_wrds , Test_wrds):
return ".".join((sentence for sentence in input.split(".")
if Test_wrds not in sentence))
But I don't understand how to use this for my problem.
update on the problem: Basically, whenever there is test_wrds present in the paragraph, I would like to slice that sentence as well as before and after one sentence and saved it on a single string. So for example for three text_wrds I am expected to get three strings which basically covers sentences with text_wrds individually. I attached pdf, for example, the output, I am looking for
You could define a function something like this one
def find_sentences( word, text ):
sentences = text.split('.')
findings = []
for i in range(len(sentences)):
if word.lower() in sentences[i].lower():
if i==0:
findings.append( sentences[i+1]+'.' )
elif i==len(sentences)-1:
findings.append( sentences[i-1]+'.' )
else:
findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
return findings
This can then be called as
findings = find_sentences( 'Power curve', Str_wrds )
With some pretty printing
for finding in findings:
print( finding +'\n')
We get the results
However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.
The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.
The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..
which I hope is what you where looking for :)
When you say,
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string.
I guess you mean that, all the sentences that have one of the words in Test_wrds in them, the sentence before them, and after them, should also be selected.
Function
def remove_sentence(Str_wrds: str, Test_wrds):
# store all selected sentences
all_selected_sentences = {}
# initialize empty dictionary
for k in Test_wrds:
# one element for each occurrence
all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())
# list of sentences
sentences = Str_wrds.split(".")
word_counter = {}.fromkeys(Test_wrds,0)
for i, sentence in enumerate(sentences):
for j, word in enumerate(Test_wrds):
# case insensitive
if word.lower() in sentence.lower():
if i == 0: # first sentence
chosen_sentences = sentences[0:2]
elif i == len(sentences) - 1: # last sentence
chosen_sentences = sentences[-2:]
else:
chosen_sentences = sentences[i - 1:i + 2]
# get which occurrence of the word is it
k = word_counter[word]
all_selected_sentences[word][k] += '.'.join(
[s for s in chosen_sentences
if s not in all_selected_sentences[word][k]]) + "."
word_counter[word] += 1 # increment the word counter
return all_selected_sentences
Running this
answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)
with the provided values for Str_wrds and Test_wrds,
returns this output
{
'Power curve': [
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
],
'data-driven': [
' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
],
'wind turbines': [
' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
]
}
Notes:
the function returns a dict of lists
every key is a word in Test_wrds, and list element is an occurrence of the word.
for example, because the word 'power curve' occurs 4 times in the entire text, the value for 'power curve' in the output is a list of 4 elements.
I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !
I am working on a classification problem. I have around 1000 features and target variable has 2 classes. All the 1000 features have values 1 or 0. I am trying to find feature importance but my feature importance values varies from 0.0 - 0.003. I am not sure if such low value is meaningful.
Is there a way I can increase feature importance.
# Variable importance
rf = RandomForestClassifier(min_samples_split=10, random_state =1)
rf.fit(X, Y)
print ("Features sorted by their score:")
a = (list(zip(map(lambda x: round(x, 3), rf.feature_importances_), X)))
I would really appreciate any help! Thanks
Since you only have two target classes you can perform an unequal variance t-test which has been useful to find important features in a binary classification task when all other feature ranking methods have failed me. You can implement this using scipy.stats.ttest_ind function. It basically is a statistical test that checks whether the two distributions are different. if the returned p-value is less than 0.05, they can be assumed to be different distributions. To implement for each feature, follow these steps:
Extract all predictor values for class 1 and 2 respectively.
Run test_ind on these two distributions, specifying that they're variance is unknown, and make sure it's a two tailed t-test
If the p-value is less than 0.05, this feature is important.
Alternatively, you can do this for all your features and use the p-value as the measure of feature importance. The lower, the p-value, the higher the importance of a feature.
Cheers!