I have my cumulative bike riding time for the first 14 days
time = [2.29,2.29,3.15,3.89,4.72,5.21,5.21,5.55,5.8,6.18,6.44,6.9,7.11,7.32]
I know this values described by equation
y(t) = a*ln(t+b)
Where:
t - day of my riding
a,b - coefficients to be founded
I need to find the correct coefficients a and b for 14 days values with optimization sum squares of deviations(pretty simple with solver in Excel). And then predict 30 day value. How to find this coefficients in python?
Thanks for helping me!
Related
So I'm having a hard time conceptualizing how to make mathematical representation of my solution for a simple logistic regression problem. I understand what is happening conceptually and have implemented it, but I am answering a question which asks for a final solution.
Say I have a simple two column dataset denoting something like likelihood of getting a promotion per year worked, so the likelihood would increase the person accumulates experience. Where X denotes the year and Y is a binary indicator for receiving a promotion:
X | Y
1 0
2 1
3 0
4 1
5 1
6 1
I implement logistic regression to find the probability per year worked of receiving a promotion, and get an output set of probabilities that seem correct.
I get an output weight vector that that is two items, which makes sense as there are only two inputs. The number of years X, and when I fix the intercept to handle bias, it adds a column of 1s. So one weight for years, one for bias.
So I have two few questions about this.
Since it is easy to get an equation of the form y = mx + b as a decision boundary for something like linear regression or a PLA, how can similarly I denote a mathematical solution with the weights of the logistic regression model? Say I have a weights vector [0.9, -0.34], how can I convert this into an equation?
Secondly, I am performing gradient descent which returns a gradient, and I multiply that by my learning rate. Am I supposed to update the weights at every epoch? As my gradient never returns zeros in this case so I am always updating.
Thank you for your time.
The logistic regression is trying to map the input value (x = years) to the output value (y=likelihood) through this relationship:
where theta and b are the weights you are trying to find.
The decision boundary will then be defined as L(x)>p or <p. where L(x) is the right term of the equation above. That is the relationship you want.
You can eventually transform it to a more linear form like the one of linear regression by passing the exponential in numerator and taking the log on both sides.
thank you for taking a look at this. I have failure data for tires over a 5 year period. For each tire, I have the start date(day0), the end date(dayn), and the number of miles driven for each day. I used the total miles each car drove to create 2 distributions, one weibull, one ecdf. My hope is to be able to use those distributions to predict the probability a tire will fail 50 miles in the future during the life of the tire. So an an example, if its 2 weeks into the life of a tire, and the total miles is currently 100 miles and the average miles per week is 50. I want to predict the probability it will fail at 150 miles/ in a week.
My thinking is that if I can get the probabilities of all tires active on a given day, I can sum the probability of each tires failure to get a prediction of how many tires will need to be replaced for a given time period in the future of the given day.
My current methodology is to fit a distribution using 3 years of failure data using scipy.weibull_min and statsmodel.ecdf. Then if a tire is currently at 100 miles and we expect the next week to add 50 miles to that I get the cdf of 150.
However, currently after I run this across all tires that are on the road on the date I am predicting from and sum their respective probabilities I get a prediction that is ~50% higher than what the actual number of tire replacements is. My first thought is that it is an issue with my methodology. Does it sound valid or am I doing something dumb?
This might be too late of a reply but perhaps it will help someone in the future reading this.
If you are looking to make predictions, you need to fit a parametric model (like the Weibull Distribution). The ecdf (Empirical CDF / Nonparametric model) will give you an indication of how well the parametric model fits but it will not allow you to make any future predictions.
To fit the parametric model, I recommend you use the Python reliability library.
This library makes it fairly straightforward to fit a parametric model (especially if you have right censored data) and then use the fitted model to make the kind of predictions you are trying to make. Scipy won't handle censored data.
If you have failure data for a population of tires then you will be able to fit a model. The question you asked (about the probability of failure in the next week given that it has survived 2 weeks) is called conditional survival. Essentially you want CS(1|2) which means the probability it will survive 1 more week given that it has survived to week 2. You can find this as the ratio of the survival functions (SF) at week 3 and week 2: CS(1|2) = SF(2+1)/SF(2).
Let's take a look at some code using the Python reliability library. I'll assume we have 10 failure times that we will use to fit our distribution and from that I'll find CS(1|2):
from reliability.Fitters import Fit_Weibull_2P
data = [113, 126, 91, 110, 146, 147, 72, 83, 57, 104] # failure times (in weeks) of some tires from our vehicle fleet
fit = Fit_Weibull_2P(failures=data, show_probability_plot=False)
CS_1_2 = fit.distribution.SF([3])[0] / fit.distribution.SF([2])[0] # conditional survival
CF_1_2 = 1 - CS_1_2 # conditional failure
print('Probability of failure of any given tire failing in the next week give it has survived 2 weeks:', CF_1_2)
'''
Results from Fit_Weibull_2P (95% CI):
Point Estimate Standard Error Lower CI Upper CI
Parameter
Alpha 115.650803 9.168086 99.008075 135.091084
Beta 4.208001 1.059183 2.569346 6.891743
Log-Likelihood: -47.5428956288772
Probability of failure in the next week given it has survived 2 weeks: 1.7337430857633507e-07
'''
Let's now assume you have 250 vehicles in your fleet, each with 4 tires (1000 tires in total). The probability of 1 tire failing is CF_1_2 = 1.7337430857633507e-07
We can find the probability of X tires failing (throughout the fleet of 1000 tires) like this:
X = [0, 1, 2, 3, 4, 5]
from scipy.stats import poisson
print('n failed probability')
for x in X:
PF = poisson.pmf(k=x, mu=CF_1_2 * 1000)
print(x, ' ', PF)
'''
n failed probability
0 0.9998266407198806
1 0.00017334425253100934
2 1.502671996412269e-08
3 8.684157279833254e-13
4 3.764024409898102e-17
5 1.305170259061071e-21
'''
These numbers make sense because I generated the data from a weibull distribution with a characteristic life (alpha) of 100 weeks, so we'd expect that the probability of failure during week 3 should be very low.
If you have further questions, feel free to email me directly.
I am working on time-series classification problem using CNN. The dataset used is financial stock market data (like Yahoo Finance). I am using some technical indicators calculated using raw values high,low,volume,open,close.
One of the technical indicators is MACD (Moving Average Convergence Divergence) using TA Library. However, it is written, in most places, that it is calculated for n_fast = 12 and n_slow = 26 periods with RSI (Relative Strength Index) being calculated for 14 days and n_sign = 9 (parameter of macd_diff() in ta library).
So, if I am calculating RSI for 5 days period then how do we set these n_fast and n_slow values according to it? Should these be n_fast = 3 and n_slow = 8. Also, what should be the value of n_sign then? I am new to finance domain.
For a machine learning problem, I want to derive the hourly PV power of a specific system given various weather parameter, including hourly GHI and DHI, but no DNI. If I would take one of the pvLib DNI estimation models, I always need the zenith angle. Since I have hourly values for Irradiance, I cannot be very specific regarding the angle. Would you take an hourly average? There is always the problem that angles close to 90° result in super high DNI values.
So far I tried to manually calculate hourly DNI = (GHI-DHI)/cos(zenith), taking the mean of 5 min resolution zenith angles for the hourly zenith. The sunrise in the location is almost always before 7 am, so I should get some very small PV power in hour 6 of the day. However, due to the fact that I take the average which is almost always over 90°, I get 0 kW AC power or for the few days when the mean angle is just below 90° I get 40 kW AC power, which is the system's maximum limited by the inverters and this in these early hours even more unrealistic.
ModelChain Parameters:
pvsys_ref=pvsyst
loc_ref=loc
orient_strat_ref=None
sky_mod_ref='ineichen'
transp_mod_ref='haydavies'
sol_pos_mod_ref='nrel_numpy'
airm_mod_ref='kastenyoung1989'
dc_mod_ref='cec'
ac_mod_ref=None
aoi_mod_ref='physical'
spec_mod_ref='no_loss'
temp_mod_ref='sapm'
loss_mod_ref='no_loss'
The required weather panda consists out of the hourly simulated ghi, dhi, temp and windspeed as well as the manually calculated dni.
Usually the midpoint of the hour is used to calculate the sun position/sun zenith, and for the sunset and sunrise hours, the midpoint of the period when the sun is above the horizon.
To calculate DNI from GHI and DHI, try using the function dni in pvlib.irradiance:
https://pvlib-python.readthedocs.io/en/latest/generated/pvlib.irradiance.dni.html#pvlib.irradiance.dni
I'm a PhD student in sociology working on my dissertation. In the course of some data analysis, I have bumped up against the following problem.
I have a table of measured values of some variable over a series of years. The values count, "how many events of a certain type there are in a given year"? Here is a sample of what it looks like:
year var
1983 22
1984 55
1985 34
1986 29
1987 15
1988 20
1989 41
So, e.g. in 1984, 55 such events occured over the whole year.
One way to represent this data over the domain of real numbers in [1983, 1990) is with a piecewise function f:
f(x) = var if floor(x) == year, for all x in [1983, 1990).
This function plots a series of horizontal lines of width 1, mapping a bar chart of the variable. The area under each of these lines is equal to the variable's value in that year. However, for this variable, I know that in each year, the rate is not constant over the whole year. In other words, the events don't suddenly jump from one yearly rate to another rate overnight on Dec 31, as the (discontinuous) function f seems to present. I don't know exactly how the rate changes, but I'd like to assume a smooth transition from year to year.
So, what I want is a function g which is both continuous and smooth (continuously differentiable) over the domain [1983, 1990), which also preserves the yearly totals. That is, the definite integral of g from 1984 to 1985 must still equal 55, and same for all other years. (So, for example, an n-degree polynomial which hits all the midpoints of the bars will NOT work.) Also, I'd like g to be a piecewise function, with all the pieces relatively simple -- quadratics would be best, or a sinusoid.
In sum: I want g to be a series of parabolas defined over each year, which smoothly transition from one to the other (left and right limits of g'(x) should be equal at the year boundaries), and where the area under each parabola is equal to the totals given by my data above.
I've drawn a crude version of what I want here. The cartoon uses the same data as above, with the black curve representing my hoped-for function, g. Toward the right end things got particularly bad, esp 1988 and 1989. But it's just meant to show a picture of what I would like to end up with.
Thanks for your help, or for pointing me towards other resources you think might be helpful!
PS I have looked at this paper which is linked inside this question. I agree with the authors (see section 4) that if I could replace my data with pseudodata d' using matrix A, from which I could very simply generate some sort of smooth function, that would be great, but they do not say how A could be obtained. Just some food for thought. Thanks again!
PPS What I need is a reliable method of generating g, given ANY data table as above. I actually have hundreds of these kinds of yearly count data, so I need a general solution.
You need the integral of your curve to go through a specific set of points, defined by the cumulative totals, so...
Interpolate between the cumulative totals to get an integral curve, and then
take the derivative of that to get the function you're looking for.
Since you want your function to be "continuous and smooth", i.e., C1-continuous, the integral curve you interpolate needs to be C2-continuous, i.e., it has to have continuous first and second derivatives. You can use polynomial interpolation, sinc interpolation, splines of sufficient degree, etc.
Using "natural" cubic splines to interpolate the integral will give you a piece-wise quadratic derivative that seems to satisfy all your requirements.
There's a pretty good description of the natural cubic splines here: http://mathworld.wolfram.com/CubicSpline.html
If your goal is to transform discrete data into a continuous representation, I would recommend looking up Kernel Density Estimation. KDE essentially models each data point as a (usually) Gaussian distribution and sums up the distribution, resulting in a smooth continuous distribution. This blog does a very thorough treatment of KDE using the SciPy module.
One of the downsides of KDE is that it does not provide an analytic solution. If that is your goal, I would recommend looking up polynomial regression.