Sentence selection surrounded to a particular words - python

Suppose I have a paragraph:
Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."
And have the following test_wrds,
Test_wrds = ['Power curve', 'data-driven','wind turbines']
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string. For example, Test_wrds Power curve appeared first in 1st sentence hence but when we select 2nd sentence there are another Power curve words thus the output would be something like this
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.
And likewise, I would like to slice sentences for data-driven and wind turbines and saved them in separate strings.
How can I implement this using Python in a simple way?
So far I found code which basically removes the entire sentence whenever any Text_wrds is in.
def remove_sentence(Str_wrds , Test_wrds):
return ".".join((sentence for sentence in input.split(".")
if Test_wrds not in sentence))
But I don't understand how to use this for my problem.
update on the problem: Basically, whenever there is test_wrds present in the paragraph, I would like to slice that sentence as well as before and after one sentence and saved it on a single string. So for example for three text_wrds I am expected to get three strings which basically covers sentences with text_wrds individually. I attached pdf, for example, the output, I am looking for

You could define a function something like this one
def find_sentences( word, text ):
sentences = text.split('.')
findings = []
for i in range(len(sentences)):
if word.lower() in sentences[i].lower():
if i==0:
findings.append( sentences[i+1]+'.' )
elif i==len(sentences)-1:
findings.append( sentences[i-1]+'.' )
else:
findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
return findings
This can then be called as
findings = find_sentences( 'Power curve', Str_wrds )
With some pretty printing
for finding in findings:
print( finding +'\n')
We get the results
However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.
The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.
The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..
which I hope is what you where looking for :)

When you say,
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string.
I guess you mean that, all the sentences that have one of the words in Test_wrds in them, the sentence before them, and after them, should also be selected.
Function
def remove_sentence(Str_wrds: str, Test_wrds):
# store all selected sentences
all_selected_sentences = {}
# initialize empty dictionary
for k in Test_wrds:
# one element for each occurrence
all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())
# list of sentences
sentences = Str_wrds.split(".")
word_counter = {}.fromkeys(Test_wrds,0)
for i, sentence in enumerate(sentences):
for j, word in enumerate(Test_wrds):
# case insensitive
if word.lower() in sentence.lower():
if i == 0: # first sentence
chosen_sentences = sentences[0:2]
elif i == len(sentences) - 1: # last sentence
chosen_sentences = sentences[-2:]
else:
chosen_sentences = sentences[i - 1:i + 2]
# get which occurrence of the word is it
k = word_counter[word]
all_selected_sentences[word][k] += '.'.join(
[s for s in chosen_sentences
if s not in all_selected_sentences[word][k]]) + "."
word_counter[word] += 1 # increment the word counter
return all_selected_sentences
Running this
answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)
with the provided values for Str_wrds and Test_wrds,
returns this output
{
'Power curve': [
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
],
'data-driven': [
' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
],
'wind turbines': [
' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
]
}
Notes:
the function returns a dict of lists
every key is a word in Test_wrds, and list element is an occurrence of the word.
for example, because the word 'power curve' occurs 4 times in the entire text, the value for 'power curve' in the output is a list of 4 elements.

Related

How to find appropriate clustering algorithm to cluster my data? [duplicate]

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !

How to find most optimal number of clusters with K-Means clustering in Python

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !

Interpreting logistic regression coefficients of scaled features

I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).
Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.

Evaluating vector distance measures

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
diststats=basicstatsmaker(distzscores)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)
In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
else:
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

Fitting autoregressive models to EEG timeseries

So I read that it is possible to fit AR models to EEG data and then use the AR coefficients as features for clustering or classifying data : e.g. Mohammadi et al, Person identification by using AR model for EEG signals, 2006.
As a quality control step, and as an aid for explanation, I wanted to visually see the type of timeseries produced/simulated by the fitted model. This would also allow me to show the prototype model if I was doing K means or something for classification.
However, all I seem to be able to produce is noise!
Any steps towards getting towards what I want would be more than welcome.
section1 = data[88000:91800]
section2 = data[0:8000]
section3 = data[143500:166000]
section1 -= np.mean(section1)
section2 -= np.mean(section2)
section3 -= np.mean(section3)
When plotted:
maxOrder = 20
model_one = AR(section1).fit(maxOrder, ic = 'aic', trend = 'nc')
model_two = AR(section2).fit(maxOrder, ic = 'aic', trend = 'nc')
model_three = AR(section3).fit(maxOrder, ic = 'aic', trend = 'nc')
fake1 = arma_generate_sample(model_one.params,[1],1000, sigma = 1)
fake2 = arma_generate_sample(model_two.params,[1],1000,sigma = 1)
fake3 = arma_generate_sample(model_three.params,[1],1000,sigma = 1)
ax1.plot(fake1)
ax2.plot(fake2)
ax3.plot(fake3)
The standard simplest more-or-less-true thing to say about EEG data is that it has a 1/f or "pink" distribution. An interesting thing about 1/f signals is that they are non-stationary, and cannot be correctly modelled by an ARMA process of any order. (1/f means that low frequency fluctuations are arbitrarily large, which means that arbitrarily far apart points remain correlated, and the more data you have, the further apart the correlations you can detect -- the ACF never converges to anything finite. Also, it's important to realize that spectral content and ARMA-like processes are super super related, because a signal's auto-correlation function totally determines its spectral distribution, and vice-versa -- the two functions are Fourier transforms of each other.)
So basically this means that anything you do using basic time series statistics is going to be a huge theory-violating hack. It doesn't mean it won't work in practice to produce some useful classification features, but calibrate your expectations accordingly... it might well be that the results you're getting are exactly the same as Mohammadi et al got, and they just didn't didn't bother to do any checking/reporting of goodness of fit.
There are ways to model 1/f noise directly, via wavelets or ARIMA processes.
Depending on your data, you may also need to worry about deviations from the simple 1/f distribution: stuff like alpha (which produces a substantial bump in the spectral distribution at 10 Hz), artifacts like muscle noise, electrical line noise, and heart beat (which also cause substantial deviations from the simple 1/f spectrum -- muscle in particular produces very distinctive broad-band ~whitish noise), and eye blinks (which produce huge impulse deviations that aren't going to be well-modelled by any technique that assumes stationarity or works in the frequency domain).
There's more discussion (with references) of these issues in section 5.3 of my thesis, though in the context of doing ERP-like analyses rather than machine learning.

Categories