Need input: Linear Regression prediction of difficulty of routes quite bad

Need input: Linear Regression prediction of difficulty of routes quite bad - python

(Data: https://1drv.ms/u/s!ArLDiUd-U5dtg1H6y1_0f_m5f2by?e=OmKeWp)
I'm trying to predict the difficulty of a route. A route consists of a series of points, each 10 meters apart. Each point has the following information:
Path width
Forest density
Falling Velocity (What speed will your body reach in case of falling)
Slope
For each route there is also a given difficulty. But those difficulties were given by different persons and differ heavily. So one person gave a route a 4. But another one may have given this route a 2. So the data contains human errors.
What i did so far:
I calculated the mean and std for each route. So I took all points of one route and used it to calculate those statistic values. I also added the length of a route (number of points * 10).
(diff = difficulty of the route. Values from 1-12)
Then I took those values and put them into a Linear Regression Model. Which turned out to be a good start:
Mean Absolute Error: 1.239902061226418
Mean Squared Error: 2.3566221702532917
Root Mean Squared Error: 1.53512936596669
Problem
But now I don't know what to do to improve that, since I'm lacking the knowledge in machine learning.
I had the idea of using a neural network and just putting in all the points. The longest route is 5300 points long, so I would just say, 5300 inputs per route and fill points with 0 values for those routes, that are not long enough.
Any info or input for something like that?
But I would also like to get a good result by using predictor values like shown above (mean, std and so on). So what can I do to improve the prediction?

Below are some of the steps you need to follow to develop a best model:
check for the outliers in the data and normilaze the data
Check the strength of the correlation between the independent and dependent
variable.
Imputing the missing values or creating a separate segement
to handle the missing values in the data columns.
Look for the variation inflation factor and tolerance
This will imporve the data quality and imporve the accuracy of the model.

Related

How to apply differential privacy on list of data?

How to apply differential privacy on a list of data.
OpenMined release a differential privacy project called PyDP 2 years ago.
On the examples provided, they showed how to compute the PyDP on the data by computing some statistical features such as the mean, Max, Median.
Is there a way to apply a differential privacy to the list of dataset, and get the list of data back, without computing any statistical feature yet ?
e.g. input_list = [1.03,2.23,3.058,4.97]
out_put_differential_privacy_list = dp_function(input_list)
out_put_differential_privacy_list
>> [1.01,2.03,3.8,4.04]
How is the noise added to the data (they use laplacian)?
Is the noise added taking into account the whole data set, or is it added considering each single value at a time ?
I couldn't fine the github code for pydp.algorithms.laplacian.
These are the statistical features they showed how to compute.
from pydp.algorithms.laplacian import (
BoundedSum,
BoundedMean,
BoundedStandardDeviation,
Count,
Max,
Min,
Median,
)
Are they also functions to compute differential privacy percentiles ?
Any other resources will also be welcome.

Here are my two cents on the question,
The idea of differential privacy is to publish aggregated information of sensitive values only if noise is added to the aggregated info. This will in terms make it infeasible to match sensitive values to their owners, and also make the dataset not highly dependent on any particular data from the dataset.
The way that noise is added to the differential privacy is by injecting Laplace noise into each pieces of data at a time which in terms will add noise to the overall dataset, the essential idea of DP would be the following:
A(D,f) = f(D) + noise
A = some randomised algorithm
This is to ensure that the result each time will be slightly different.
f = sensitivity of a function such that this sensitivity will be used to determine to what degree an individual piece of data will affect the output.
D = the dataset you want to 'mask', the overall thing, in your case it would be the list of numbers.
noise = Laplace noise, i.e. lambda = delta f/epsilon = 1/epsilon
The epsilon value here is sort of indicates the privacy loss on adding/removing an entry from the dataset, i.e. making adjusments to the dataset. The smaller the epsilon is, the less privacy loss on the adjustments made on the dataset, which means better protection for privacy.
And as you can see now, the noise are only dependent on the sentitivyt and epsilon value, and has nothing to do with the underlying dataset.
... they showed how to compute the PyDP on the data by computing some statistical features such as the mean, Max, Median.
lets say for example, say we have a bunch of numbers like you have here, we could first find the max number out of the list, which would be 4.97, then we can just try to draw the eta from Lap(4.7/epsilon). I believe the idea is to sort of anchor the data around some certain statistical feature.
Hope this is somewhat useful :)

Implementing ARIMA(X) model in TensorFlow

I'm currently scratching my head about how I might implement a classic ARIMA(X) model using base TensorFlow (and optionally Keras). The equation I am attempting to setup has the following form:
Where d represents the level of differencing applied to the input observed time series, p is the auto-regressive order, and q is the moving average order. The part which is stumping me currently is the calculation/estimation of the residuals epsilon. The auto-regression portion is a simple linear regression on the lagged samples, while the same is true for the terms involving the exogenous series (X). When I am estimating the residuals, should I simply feed the q-many previous steps into the current estimated parameters, and compute the residuals as y_true-y_predict? Though this also begs the question of: How does one estimate the residuals for observations where you have no previous observations? Do we simply estimate residuals 0 through q simply on a chosen random distribution of set variance (e.g. Normal, Poisson, etc.) with a mean of 0?
I have looked at the source for the statsmodels package to try to understand it, but it is quite opaque. Part of the reason for implementing the model this way is that it needs to fit into a fairly standard ecosystem at the company I work for, and we need control over what slices of data the model is fitted to at a given time step. This is because some data may arrive (much) later than the time stamp it relates to, due to lag at the source etc.
Thank you for any help you might be able to offer.

Pattern prediction in time series

Has anyone tried to predict a specific pattern in time series data?
Example: In a specific time, there is a huge upward spike in certain variables in a time series...
How would I build a model to predict that spike when next time it occurs?
Please do respond if anyone working in this area.
I tried with converting that particular series of data in a NumPy array and trying to feed in the model.But Its not allowing me.
Here is the data looks like
This data is generated in a controlled manner so that we can have these spikes near to near.. In actual case this could b random, and our main objective is to catch this pattern and make a count.

Das, you could try implementing LSTM based Neural Network Models.
See:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
It is still preferred that the data contains a trend. If the upward spike happens around the same time of the recurring time interval, it is more likely that you get a better prediction result.
In the image you shared, there seems to be trend in the data. Hence LSTM models can pretty efficiently extract the pattern and output a prediction.
Statistical modelling of the data can also provide better results.
See: https://orangematter.solarwinds.com/2019/12/15/holt-winters-forecasting-simplified/
Das, if outputting the total number of peaks is solely the requirement, then I think heavy neural network models are bit of an overkill. However, neural network models also can pretty well do the job, but require lot of data input for training and fine tuning the weights and biases to give a really good result.
How about you try implementing a thresholding based technique, where you increment a counter every time the data value crosses the preset threshold? In such an approach you should ensure to group very nearby peaks together so that the count is just one for that case. Here you could set a threshold on the x axis too.
ie:- For instance with respect to the given plot, let the y-threshold be 4. Then you will get a count 5 if you consider the y axis threshold (y value 4) alone. This is because for x value at 15:48.2, there are two peaks that cross y value 4. So suppose you set a threshold in the x axis too, then these nearby peaks shall be grouped together within the preset limit and the final count will be 4 (which is the requirement).

NEAT algorithm result precision

I am a PhD student who is trying to use the NEAT algorithm as a controller for a robot and I am having some accuracy issues with it. I am working with Python 2.7 and for it and am using two NEAT python implementations:
The NEAT which is in this GitHub repository: https://github.com/CodeReclaimers/neat-python
Searching in Google, it looks like it has been used in some projects with succed.
The multiNEAT library developed by Peter Chervenski and Shane Ryan: http://www.multineat.com/index.html.
Which appears in the "official" software web page of NEAT software catalog.
While testing the first one, I've found that my program converges quickly to a solution, but this solution is not precise enough. As lack of precision I want to say a deviation of a minimum of 3-5% in the median and average related to the "perfect" solution at the end of the evolution (Depending on the complexity of the problem, an error around 10% is normal for my solutions. Furthermore, I could said that I've "never" seen an error value under the 1% between the solution given by the NEAT and the solution that it is the correct one). I must said that I've tried a lot of different parameter combinations and configurations (this is an old problem for me).
Due to that, I tested the second library. The MultiNEAT library converges quickly and easier that the previous one. (I assume that is due to the C++ implementation instead the pure Python) I get similar results, but I still have the same problem; lack of accuracy. This second library has different configuration parameters too, and I haven't found a proper combination of them to improve the performance of the problem.
My question is:
Is it normal to have this lack of accuracy in the NEAT results? It achieves good solutions, but not good enough for controlling a robot arm, which is what I want to use it for.
I'll write what I am doing in case someone sees some conceptual or technical mistake in the way I set out my problem:
To simplify the problem, I'll show a very simple example: I have a very simple problem to solve, I want a NN that may calculate the following function: y = x^2 (similar results are found with y=x^3 or y = x^2 + x^3 or similar functions)
The steps that I follow to develop the program are:
"Y" are the inputs to the network and "X" the outputs. The
activation functions of the neural net are sigmoid functions.
I create a data set of "n" samples given values to "X" between the
xmin = 0.0 and the xmax = 10.0
As I am using sigmoid functions, I make a normalization of the "Y"
and "X" values:
"Y" is normalized linearly between (Ymin, Ymax) and (-2.0, 2.0) (input range of sigmoid).
"X" is normalized linearly between (Xmin, Xmax) and (0.0, 1.0) (the output range of sigmoid).
After creating the data set, I subdivide in in a train sample (70%
percent of the total amount), a validation sample and a test sample
(15% each one).
At this point, I create a population of individuals for doing
evolution. Each individual of the population is evaluated in all the
train samples. Each position is evaluated as:
eval_pos = xmax - abs(xtarget - xobtained)
And the fitness of the individual is the average value of all the train positions (I've selected the minimum too but it gives me worse performance).
After the whole evaluation, I test the best obtained individual
against the test sample. And here is where I obtained those
"un-precise values". Moreover, during the evaluation process, the
maximum value where "abs(xtarget - xobtained) = 0" is never
obtained.
Furthermore, I assume that how I manipulate the data is right because, I use the same data set for training a neural network in Keras and I get much better results than with NEAT (an error less than a 1% is achievable after 1000 epochs in a layer with 5 neurons).
At this point, I would like to know if what is happened is normal because I shouldn't use a data set of data for developing the controller, it must be learned "online" and NEAT looks like a suitable solution for my problem.
Thanks in advance.
EDITED POST:
Firstly, Thanks for comment nick.
I'll answer your questions below::
I am using the NEAT algorithm.
Yes, I've carried out experiments increasing the number of individuals in the population and the generations number. A typical graph that I get is like this:
Although the population size in this example is not such big, I've obtained similar results in experiments incrementing the number of individuals or the number of generations. Populations of 500 in individuals and 500 generations, for example. In this experiments, the he algorithm converge fast to a solution, but once there, the best solution is stucked and it does not improve any more.
As I mentioned in my previous post, I've tried several experiments with many different parameters configurations... and the graphics are more or less similar to the previous showed.
Furthermore, other two experiments that I've tried were: once the evolution reach the point where the maximum value and the median converge, I generate other population based on that genome with new configuration parameters where:
The mutation parameters change with a high probability of mutation (weight and neuron probability) in order to find new solutions with the aim to "jumping" from the current genome to other better.
The neuron mutation is reduced to 0, while the weight "mutation probability" increase for "mutate weight" in a lower range in order to get slightly modifications with the aim to get a better adjustment of the weights. (trying to get a "similar" functionality as backprop. making slighty changes in the weights)
This two experiments didn't work as I expected and the best genome of the population was also the same of the previous population.
I am sorry, but I do not understand very well what do you want to say with "applying your own weighted penalties and rewards in your fitness function". What do you mean with including weight penalities in the fitness function?
Regards!

Disclaimer: I have contributed to these libraries.
Have you tried increasing the population size to speed up the search and increasing the number of generations? I use it for a trading task, and by increasing the population size my champions were found much sooner.
Another thing to think about is applying your own weighted penalties and rewards in your fitness function, so that anything that doesn't get very close right away is "killed off" sooner and the correct genome is found faster. It should be noted that neat uses a fitness function to learn as a opposed to gradient descent so it wont converge in the same way and its possible you may have to train a bit longer.
Last question, are you using the neat or hyperneat algo from multineat?

what does "observation offset" and "predicted state mean" mean in pykalman standard filtercorrect module?

I am using a module pykalman in which I imported a function named _filter_correct in pyklaman.standard in order to correct my forecast data.
There are parameters in the function I do not understand even after trying to understand what is behind the function.
They say that:
observation_offset is offset for observation at time t
predict_state_mean which is the mean of state at time t given observations from times[0...t-1].
I understood the other parameters
I am forecasting by the way a univariate parameter(wind speed) for 7 days 6 hours ahead that I am concatenating!
Can you help me on the parameters of this function ?

As far as I understand observation_offset is really the measured noise mean. Kalman requires the noise to be zero mean Gaussian signal where an offset will break the prediction, library eliminated the offset by himself so asks for you to give value before. (But I don't think that necessary because I guess it calculates it anyway. Giving it before will just help for first iterations. But not sure because documentation is very poor. Another clue for hypothesis is that he assigned NULL as default value, so NULL should not be a problem)
For example, if your thermometer always gives the result as real_temperature + 2 + zero_mean_gaussion_noise then your observation_offset must be 2 to draw back noise to zero mean. Here you can find more detailed information.
I again assume (because couldn't find any predicted_state_mean so far through googling)predicted_state_mean as the name suggest the predicted result value. Kalman Filter has two phases; prediction and update. In prediction depending on your system dynamics, with the knowledge of your current state, you are predicting the next state.
Such as if you have a car at point 0 and moving with 5 units/seconds, then after 2 seconds your car will be at point 10. Kalman first predicts this information then compares the predicted and observed (your sensor data) results in the update state.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.