How to apply differential privacy on list of data? - python

How to apply differential privacy on a list of data.
OpenMined release a differential privacy project called PyDP 2 years ago.
On the examples provided, they showed how to compute the PyDP on the data by computing some statistical features such as the mean, Max, Median.
Is there a way to apply a differential privacy to the list of dataset, and get the list of data back, without computing any statistical feature yet ?
e.g. input_list = [1.03,2.23,3.058,4.97]
out_put_differential_privacy_list = dp_function(input_list)
out_put_differential_privacy_list
>> [1.01,2.03,3.8,4.04]
How is the noise added to the data (they use laplacian)?
Is the noise added taking into account the whole data set, or is it added considering each single value at a time ?
I couldn't fine the github code for pydp.algorithms.laplacian.
These are the statistical features they showed how to compute.
from pydp.algorithms.laplacian import (
BoundedSum,
BoundedMean,
BoundedStandardDeviation,
Count,
Max,
Min,
Median,
)
Are they also functions to compute differential privacy percentiles ?
Any other resources will also be welcome.

Here are my two cents on the question,
The idea of differential privacy is to publish aggregated information of sensitive values only if noise is added to the aggregated info. This will in terms make it infeasible to match sensitive values to their owners, and also make the dataset not highly dependent on any particular data from the dataset.
The way that noise is added to the differential privacy is by injecting Laplace noise into each pieces of data at a time which in terms will add noise to the overall dataset, the essential idea of DP would be the following:
A(D,f) = f(D) + noise
A = some randomised algorithm
This is to ensure that the result each time will be slightly different.
f = sensitivity of a function such that this sensitivity will be used to determine to what degree an individual piece of data will affect the output.
D = the dataset you want to 'mask', the overall thing, in your case it would be the list of numbers.
noise = Laplace noise, i.e. lambda = delta f/epsilon = 1/epsilon
The epsilon value here is sort of indicates the privacy loss on adding/removing an entry from the dataset, i.e. making adjusments to the dataset. The smaller the epsilon is, the less privacy loss on the adjustments made on the dataset, which means better protection for privacy.
And as you can see now, the noise are only dependent on the sentitivyt and epsilon value, and has nothing to do with the underlying dataset.
... they showed how to compute the PyDP on the data by computing some statistical features such as the mean, Max, Median.
lets say for example, say we have a bunch of numbers like you have here, we could first find the max number out of the list, which would be 4.97, then we can just try to draw the eta from Lap(4.7/epsilon). I believe the idea is to sort of anchor the data around some certain statistical feature.
Hope this is somewhat useful :)

Related

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?
Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

Need input: Linear Regression prediction of difficulty of routes quite bad

(Data: https://1drv.ms/u/s!ArLDiUd-U5dtg1H6y1_0f_m5f2by?e=OmKeWp)
I'm trying to predict the difficulty of a route. A route consists of a series of points, each 10 meters apart. Each point has the following information:
Path width
Forest density
Falling Velocity (What speed will your body reach in case of falling)
Slope
For each route there is also a given difficulty. But those difficulties were given by different persons and differ heavily. So one person gave a route a 4. But another one may have given this route a 2. So the data contains human errors.
What i did so far:
I calculated the mean and std for each route. So I took all points of one route and used it to calculate those statistic values. I also added the length of a route (number of points * 10).
(diff = difficulty of the route. Values from 1-12)
Then I took those values and put them into a Linear Regression Model. Which turned out to be a good start:
Mean Absolute Error: 1.239902061226418
Mean Squared Error: 2.3566221702532917
Root Mean Squared Error: 1.53512936596669
Problem
But now I don't know what to do to improve that, since I'm lacking the knowledge in machine learning.
I had the idea of using a neural network and just putting in all the points. The longest route is 5300 points long, so I would just say, 5300 inputs per route and fill points with 0 values for those routes, that are not long enough.
Any info or input for something like that?
But I would also like to get a good result by using predictor values like shown above (mean, std and so on). So what can I do to improve the prediction?
Below are some of the steps you need to follow to develop a best model:
check for the outliers in the data and normilaze the data
Check the strength of the correlation between the independent and dependent
variable.
Imputing the missing values or creating a separate segement
to handle the missing values in the data columns.
Look for the variation inflation factor and tolerance
This will imporve the data quality and imporve the accuracy of the model.

Is there a form of lazy evaluation where a function (like mean) returns an approximate value when operating on arrays

For example we want to calculate mean of a list of numbers where the list is so long. and that numbers when sorted are nearly linear (or we can find a linear Regression Model for data). Mathematically we can aggregate mean by
((arr[0] + arr[length(arr)]) / 2 ) + intercept
Or in the case, linear model is nearly constant (slope coefficient is nearly 1). we can calculate approximately:
mean(arr[n/const]) = mean(arr)
The same concept is applied for the two cases. and is so basic.
Is there a way: pattern, function (hopefully in python), or any studies to suggest and that can help will be gratefully welcome; of course such a pattern if exists should be general and not only for the mean case (probably any function
or at least aggregate functions like: sum, mean ...). (as I don't have a strong mathematical background, and I'm new to machine learning, please tolerate my ignorance).
Please let me know if anything is not clear.
The Law of Large Numbers states that as sample size increases, an average of a sample of observations converges to the true population average with probability 1.
Therefore, if your hypothetical array is too big to average, you could at the very least take average of a large sample and know that you are close to the true population mean.
You can sample from a numpy array using numpy.random.choice(arr,n) where arr is your array and n is as many elements as you wish (or are able) to sample.
There are more general solutions to such jobs like Dask package, for example: http://dask.pydata.org/en/latest/
It can optimize calculation graphs, parallelize calculation and many more.

Predictions with ARIMA (python statsmodels)

I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?
I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python

Using fourier analysis for time series prediction

For data that is known to have seasonal, or daily patterns I'd like to use fourier analysis be used to make predictions. After running fft on time series data, I obtain coefficients. How can I use these coefficients for prediction?
I believe FFT assumes all data it receives constitute one period, then, if I simply regenerate data using ifft, I am also regenerating the continuation of my function, so can I use these values for future values?
Simply put: I run fft for t=0,1,2,..10 then using ifft on coef, can I use regenerated time series for t=11,12,..20 ?
I'm aware that this question may be not actual for you anymore, but for others that are looking for answers I wrote a very simple example of fourier extrapolation in Python https://gist.github.com/tartakynov/83f3cd8f44208a1856ce
Before you run the script make sure that you have all dependencies installed (numpy, matplotlib). Feel free to experiment with it.
P.S. Locally Stationary Wavelet may be better than fourier extrapolation. LSW is commonly used in predicting time series. The main disadvantage of fourier extrapolation is that it just repeats your series with period N, where N - length of your time series.
It sounds like you want a combination of extrapolation and denoising.
You say you want to repeat the observed data over multiple periods. Well, then just repeat the observed data. No need for Fourier analysis.
But you also want to find "patterns". I assume that means finding the dominant frequency components in the observed data. Then yes, take the Fourier transform, preserve the largest coefficients, and eliminate the rest.
X = scipy.fft(x)
Y = scipy.zeros(len(X))
Y[important frequencies] = X[important frequencies]
As for periodic repetition: Let z = [x, x], i.e., two periods of the signal x. Then Z[2k] = X[k] for all k in {0, 1, ..., N-1}, and zeros otherwise.
Z = scipy.zeros(2*len(X))
Z[::2] = X
When you run an FFT on time series data, you transform it into the frequency domain. The coefficients multiply the terms in the series (sines and cosines or complex exponentials), each with a different frequency.
Extrapolation is always a dangerous thing, but you're welcome to try it. You're using past information to predict the future when you do this: "Predict tomorrow's weather by looking at today." Just be aware of the risks.
I'd recommend reading "Black Swan".
you can use the library that #tartakynov posted and, to not repeat exactly the same time series in the forcast (overfitting), you can add a new parameter to the function called n_param and fix a lower bound h for the amplitudes of the frequencies.
def fourierExtrapolation(x, n_predict,n_param):
usually you will find that, in a signal, there are some frequencies that have significantly higher amplitude than others, so, if you select this frequencies you will be able to isolate the periodic nature of the signal
you can add this two lines who are determinated by certain number n_param
h=np.sort(x_freqdom)[-n_param]
x_freqdom=[ x_freqdom[i] if np.absolute(x_freqdom[i])>=h else 0 for i in range(len(x_freqdom)) ]
just adding this you will be able to forecast nice and smooth
another useful article about FFt:
forecast FFt in R

Categories