Related
Suppose I have a paragraph:
Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."
And have the following test_wrds,
Test_wrds = ['Power curve', 'data-driven','wind turbines']
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string. For example, Test_wrds Power curve appeared first in 1st sentence hence but when we select 2nd sentence there are another Power curve words thus the output would be something like this
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.
And likewise, I would like to slice sentences for data-driven and wind turbines and saved them in separate strings.
How can I implement this using Python in a simple way?
So far I found code which basically removes the entire sentence whenever any Text_wrds is in.
def remove_sentence(Str_wrds , Test_wrds):
return ".".join((sentence for sentence in input.split(".")
if Test_wrds not in sentence))
But I don't understand how to use this for my problem.
update on the problem: Basically, whenever there is test_wrds present in the paragraph, I would like to slice that sentence as well as before and after one sentence and saved it on a single string. So for example for three text_wrds I am expected to get three strings which basically covers sentences with text_wrds individually. I attached pdf, for example, the output, I am looking for
You could define a function something like this one
def find_sentences( word, text ):
sentences = text.split('.')
findings = []
for i in range(len(sentences)):
if word.lower() in sentences[i].lower():
if i==0:
findings.append( sentences[i+1]+'.' )
elif i==len(sentences)-1:
findings.append( sentences[i-1]+'.' )
else:
findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
return findings
This can then be called as
findings = find_sentences( 'Power curve', Str_wrds )
With some pretty printing
for finding in findings:
print( finding +'\n')
We get the results
However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.
Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.
The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.
The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..
which I hope is what you where looking for :)
When you say,
I would like to select before and after 1 sentence whenever Test_wrds found it in a paragraph and list them as a separate string.
I guess you mean that, all the sentences that have one of the words in Test_wrds in them, the sentence before them, and after them, should also be selected.
Function
def remove_sentence(Str_wrds: str, Test_wrds):
# store all selected sentences
all_selected_sentences = {}
# initialize empty dictionary
for k in Test_wrds:
# one element for each occurrence
all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())
# list of sentences
sentences = Str_wrds.split(".")
word_counter = {}.fromkeys(Test_wrds,0)
for i, sentence in enumerate(sentences):
for j, word in enumerate(Test_wrds):
# case insensitive
if word.lower() in sentence.lower():
if i == 0: # first sentence
chosen_sentences = sentences[0:2]
elif i == len(sentences) - 1: # last sentence
chosen_sentences = sentences[-2:]
else:
chosen_sentences = sentences[i - 1:i + 2]
# get which occurrence of the word is it
k = word_counter[word]
all_selected_sentences[word][k] += '.'.join(
[s for s in chosen_sentences
if s not in all_selected_sentences[word][k]]) + "."
word_counter[word] += 1 # increment the word counter
return all_selected_sentences
Running this
answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)
with the provided values for Str_wrds and Test_wrds,
returns this output
{
'Power curve': [
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
],
'data-driven': [
' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
],
'wind turbines': [
' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
]
}
Notes:
the function returns a dict of lists
every key is a word in Test_wrds, and list element is an occurrence of the word.
for example, because the word 'power curve' occurs 4 times in the entire text, the value for 'power curve' in the output is a list of 4 elements.
thank you for taking a look at this. I have failure data for tires over a 5 year period. For each tire, I have the start date(day0), the end date(dayn), and the number of miles driven for each day. I used the total miles each car drove to create 2 distributions, one weibull, one ecdf. My hope is to be able to use those distributions to predict the probability a tire will fail 50 miles in the future during the life of the tire. So an an example, if its 2 weeks into the life of a tire, and the total miles is currently 100 miles and the average miles per week is 50. I want to predict the probability it will fail at 150 miles/ in a week.
My thinking is that if I can get the probabilities of all tires active on a given day, I can sum the probability of each tires failure to get a prediction of how many tires will need to be replaced for a given time period in the future of the given day.
My current methodology is to fit a distribution using 3 years of failure data using scipy.weibull_min and statsmodel.ecdf. Then if a tire is currently at 100 miles and we expect the next week to add 50 miles to that I get the cdf of 150.
However, currently after I run this across all tires that are on the road on the date I am predicting from and sum their respective probabilities I get a prediction that is ~50% higher than what the actual number of tire replacements is. My first thought is that it is an issue with my methodology. Does it sound valid or am I doing something dumb?
This might be too late of a reply but perhaps it will help someone in the future reading this.
If you are looking to make predictions, you need to fit a parametric model (like the Weibull Distribution). The ecdf (Empirical CDF / Nonparametric model) will give you an indication of how well the parametric model fits but it will not allow you to make any future predictions.
To fit the parametric model, I recommend you use the Python reliability library.
This library makes it fairly straightforward to fit a parametric model (especially if you have right censored data) and then use the fitted model to make the kind of predictions you are trying to make. Scipy won't handle censored data.
If you have failure data for a population of tires then you will be able to fit a model. The question you asked (about the probability of failure in the next week given that it has survived 2 weeks) is called conditional survival. Essentially you want CS(1|2) which means the probability it will survive 1 more week given that it has survived to week 2. You can find this as the ratio of the survival functions (SF) at week 3 and week 2: CS(1|2) = SF(2+1)/SF(2).
Let's take a look at some code using the Python reliability library. I'll assume we have 10 failure times that we will use to fit our distribution and from that I'll find CS(1|2):
from reliability.Fitters import Fit_Weibull_2P
data = [113, 126, 91, 110, 146, 147, 72, 83, 57, 104] # failure times (in weeks) of some tires from our vehicle fleet
fit = Fit_Weibull_2P(failures=data, show_probability_plot=False)
CS_1_2 = fit.distribution.SF([3])[0] / fit.distribution.SF([2])[0] # conditional survival
CF_1_2 = 1 - CS_1_2 # conditional failure
print('Probability of failure of any given tire failing in the next week give it has survived 2 weeks:', CF_1_2)
'''
Results from Fit_Weibull_2P (95% CI):
Point Estimate Standard Error Lower CI Upper CI
Parameter
Alpha 115.650803 9.168086 99.008075 135.091084
Beta 4.208001 1.059183 2.569346 6.891743
Log-Likelihood: -47.5428956288772
Probability of failure in the next week given it has survived 2 weeks: 1.7337430857633507e-07
'''
Let's now assume you have 250 vehicles in your fleet, each with 4 tires (1000 tires in total). The probability of 1 tire failing is CF_1_2 = 1.7337430857633507e-07
We can find the probability of X tires failing (throughout the fleet of 1000 tires) like this:
X = [0, 1, 2, 3, 4, 5]
from scipy.stats import poisson
print('n failed probability')
for x in X:
PF = poisson.pmf(k=x, mu=CF_1_2 * 1000)
print(x, ' ', PF)
'''
n failed probability
0 0.9998266407198806
1 0.00017334425253100934
2 1.502671996412269e-08
3 8.684157279833254e-13
4 3.764024409898102e-17
5 1.305170259061071e-21
'''
These numbers make sense because I generated the data from a weibull distribution with a characteristic life (alpha) of 100 weeks, so we'd expect that the probability of failure during week 3 should be very low.
If you have further questions, feel free to email me directly.
I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.
So I read that it is possible to fit AR models to EEG data and then use the AR coefficients as features for clustering or classifying data : e.g. Mohammadi et al, Person identification by using AR model for EEG signals, 2006.
As a quality control step, and as an aid for explanation, I wanted to visually see the type of timeseries produced/simulated by the fitted model. This would also allow me to show the prototype model if I was doing K means or something for classification.
However, all I seem to be able to produce is noise!
Any steps towards getting towards what I want would be more than welcome.
section1 = data[88000:91800]
section2 = data[0:8000]
section3 = data[143500:166000]
section1 -= np.mean(section1)
section2 -= np.mean(section2)
section3 -= np.mean(section3)
When plotted:
maxOrder = 20
model_one = AR(section1).fit(maxOrder, ic = 'aic', trend = 'nc')
model_two = AR(section2).fit(maxOrder, ic = 'aic', trend = 'nc')
model_three = AR(section3).fit(maxOrder, ic = 'aic', trend = 'nc')
fake1 = arma_generate_sample(model_one.params,[1],1000, sigma = 1)
fake2 = arma_generate_sample(model_two.params,[1],1000,sigma = 1)
fake3 = arma_generate_sample(model_three.params,[1],1000,sigma = 1)
ax1.plot(fake1)
ax2.plot(fake2)
ax3.plot(fake3)
The standard simplest more-or-less-true thing to say about EEG data is that it has a 1/f or "pink" distribution. An interesting thing about 1/f signals is that they are non-stationary, and cannot be correctly modelled by an ARMA process of any order. (1/f means that low frequency fluctuations are arbitrarily large, which means that arbitrarily far apart points remain correlated, and the more data you have, the further apart the correlations you can detect -- the ACF never converges to anything finite. Also, it's important to realize that spectral content and ARMA-like processes are super super related, because a signal's auto-correlation function totally determines its spectral distribution, and vice-versa -- the two functions are Fourier transforms of each other.)
So basically this means that anything you do using basic time series statistics is going to be a huge theory-violating hack. It doesn't mean it won't work in practice to produce some useful classification features, but calibrate your expectations accordingly... it might well be that the results you're getting are exactly the same as Mohammadi et al got, and they just didn't didn't bother to do any checking/reporting of goodness of fit.
There are ways to model 1/f noise directly, via wavelets or ARIMA processes.
Depending on your data, you may also need to worry about deviations from the simple 1/f distribution: stuff like alpha (which produces a substantial bump in the spectral distribution at 10 Hz), artifacts like muscle noise, electrical line noise, and heart beat (which also cause substantial deviations from the simple 1/f spectrum -- muscle in particular produces very distinctive broad-band ~whitish noise), and eye blinks (which produce huge impulse deviations that aren't going to be well-modelled by any technique that assumes stationarity or works in the frequency domain).
There's more discussion (with references) of these issues in section 5.3 of my thesis, though in the context of doing ERP-like analyses rather than machine learning.
This is likely a math problem as much as it is a programming problem, but I seem to be encountering severe oscillations in temperature in my class method "update()" when warp is set for a high value (1000+) in the code below. All temperatures are in Kelvin for simplicity.
(I am not a programmer by profession. This formatting is likely unpleasant.)
import math
#Critical to the Stefan-Boltzmann equation. Otherwise known as Sigma
BOLTZMANN_CONSTANT = 5.67e-8
class GeneratorObject(object):
"""Create a new object to run thermal simulation on."""
def __init__(self, mass, emissivity, surfaceArea, material, temp=0, power=5000, warp=1):
self.tK = temp #Temperature of the object.
self.mass = mass #Mass of the object.
self.emissivity = emissivity #Emissivity of the object. Always between 0 and 1.
self.surfaceArea = surfaceArea #Emissive surface area of the object.
self.material = material #Store the material name for some reason.
self.specificHeat = (0.45*1000)*self.mass #Get the specific heat of the object in J/kg (Iron: 0.45*1000=450J/kg)
self.power = power #Joules/Second (Watts) input. This is for heating the object.
self.warp = warp #Warp Multiplier. This pertains to how KSP's warp multiplier works.
def update(self):
"""Update the object's temperature according to it's properties."""
#This method updates the object's temperature according to heat losses and other factors.
self.tK -= (((self.emissivity * BOLTZMANN_CONSTANT * self.surfaceArea * (math.pow(self.tK,4) - math.pow(30+273.15,4))) / self.specificHeat) - (self.power / self.specificHeat)) * self.warp
The law used is the Stefan-Boltzmann law for calculating black-body heat losses:
Temp -= (Emissivity*Sigma*SurfaceArea*(Temp^4-Amb^4))/SpecificHeat)
This was ported from a KSP plugin for quicker debugging. Object.update() is called 50 times per second.
Would there be a solution to preventing these extreme oscillations that doesn't involve executing the code multiple times per step?
Your integration scheme is bad as already hinted by #Beta and #tom10. The integration timestep is self.warp units of time, i.e. self.warp seconds since your work with physical units. This is not the way things are done. You should first convert the equation to a dimensionless format by expressing each term in some sort of computational units. For example, the Stefan-Boltzmann constant and the self.power could be measured in units, in which the constant is 1. Then you should determine the characteristic time for the object, e.g. the time by which the temperature reaches to a certain degree the equilibrium one. If there are many such objects, you should find the smallest of all characteristic times and use it as unit of measurement for the time. Then the integration timestep should be about an order of magnitude less than the characteristic time, otherwise you completely miss the correct solution to the differential equation and end up with wild oscillations.
Example of what happens now: Let's take an 1 kg iron sphere. With surface area of 3,05.10^(-3) m^2 the radiative heating/cooling power is up to 1,73.10^(-10) W/K^4. With self.power equal to 5 kW, the radiative power equates the internal one when the temperature reaches 2319 K and that's the equilibrium temperature. At low temperatures the radiative heating/cooling is negligible and with the internal heating only you end up with temperature rate of 11,1 K/s. If warp is 1000+, your first integration step results in temperature of 11100 K or more, which overshoots the equilibrium one 5 times. Now the radiative energy is orders of magnitude higher than the internal heating and it leads to huge cool-down rate - multiply by 1000+ and you end up with negative temperature. And then the cycle repeats with higher and higher absolute temperatures until you hit outside the range of the floating-point arithmetic.
Here is a hint for you: if self.power is kept constant, then the equation has an analytical solution. Find it (or use a tool like Maple or Mathematica to find it for you) and then plot the solution. See how your timestep of 1000+ units compares to the timescale of the solution, i.e. the time it takes for the system to reach an almost equilibrium state.
I guess KSP = Kerbal Space Platform, so I gather this is a problem in game physics. If so maybe an approximation with the same qualitative behavior is sufficient. Maybe an exponential curve which starts at the initial temperature and falls to the ambient temperature is enough. Pick the decay constant by matching the heat transfer at the initial time.
Sometimes an approximation is good enough. I don't know if this is one of those situations.