Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures

Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures - python

I am using the Python version of the Shogun Toolbox.
I want to use the LinearTimeMMD, which accepts data under the streaming interface CStreamingFeatures. I have the data in the form of two RealFeatures objects: feat_p and feat_q. These work just fine with the QuadraticTimeMMD.
In order to use it with the LinearTimeMMD, I need to create StreamingFeatures objects from these - In this case, these would be StreamingRealFeatures, as far as I know.
My first approach was using this:
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space() returns -1. Also, if I try calling gen_p.get_streamed_features(100) this results in a Memory Access Error.
I tried another approach using StreamingFileFromFeatures:
streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)
gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)
But this results in the same situation with the same described problems.
It seems that in both cases, the contents of the RealFeatures object handed to the StreamingRealFeatures object cannot be accessed.
What am I doing wrong?
EDIT: I was asked for a small working example to show the error:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import laplace, norm
def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):
# sample from both distributions
X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
Y=laplace.rvs(size=n, loc=mu, scale=b)
return X,Y
# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())
test_features = gen_p.get_streamed_features(1)
print("success")
EDIT 2: The Output of the working example:
Dimensions: -1
Number of features: -1
Number of vectors: 1
Speicherzugriffsfehler (Speicherabzug geschrieben)
EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05
# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)
EDIT 4: Additional code sample showing the growing mmd problem:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
def mmd(n):
X = [(1.0,i) for i in range(n)]
Y = [(2.0,i) for i in range(n)]
X = np.array(X)
Y = np.array(Y)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(2, len(X)))
feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(100)
mmd.set_num_samples_q(100)
alpha = 0.05
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("N =", n)
print("MMD_l[X,Y]^2=%.2f" % statistic)
print()
for n in [1000, 10000, 15000, 20000, 25000, 30000]:
mmd(n)
Output:
N = 1000
MMD_l[X,Y]^2=-12.69
N = 10000
MMD_l[X,Y]^2=-40.14
N = 15000
MMD_l[X,Y]^2=-49.16
N = 20000
MMD_l[X,Y]^2=-56.77
N = 25000
MMD_l[X,Y]^2=-63.47
N = 30000
MMD_l[X,Y]^2=-69.52

For some reason, the pythonenv in my machine is broken. So, I couldn't give a snippet in Python. But let me point to a working example in C++ which attempts to address the issues (https://gist.github.com/lambday/983830beb0afeb38b9447fd91a143e67).
I think the easiest way is to create a StreamingRealFeatures instance directly from RealFeatures instance (like you tried the first time). Check test1() and test2() methods in the gist which shows the equivalence of using RealFeatures and StreamingRealFeatures in the use-case in question. The reason you were getting weird results when streaming directly is that in order to start the streaming process we need to call the start_parser method in the StreamingRealFeatures class. We handle these technicalities internally inside MMD classes. But when trying to use it directly, we need to invoke that separately (See test3() method in my attached example).
Please note that the compute_statistic() method doesn't return MMD directly, but rather returns \frac{n_x\times n_y}{n_x+n_y}\times MMD^2 (as mentioned in the doc http://shogun.ml/api/latest/classshogun_1_1CMMD.html). With that in mind, maybe the results you are getting for varying number of samples make sense.
Hope it helps.

Related

Statsmodels: vector_ar and IRAnalysis

I'm trying to estimate impulse response functions of a -1 standard-deviation shock to a 3-dimension VAR using statsmodels.tsa, however I'm currently having issues with setting the shock magnitude.
This gives me the IRFs for a 1 s.d. shock, the default:
import numpy as np
import statsmodels.tsa as sm
model = sm.vector_ar.var_model.VAR(endog = data)
fitted = model.fit()
shock= -1*fitted.sigma_u
irf = sm.vector_ar.irf.IRAnalysis(model = fitted)
The function IRAnalysis takes an argument P, an upper diagonal matrix that sets the shocks, I found this looking at the source code. However inputting P as shown below doesn't seem to be doing anything.
irf = statsmodels.tsa.vector_ar.irf.IRAnalysis(model = fitted, P = -np.linalg.cholesky(model.fitted_U))
I would really appreciate some help.
Thanks in advance.

I have had the same question and finally found something that works on my end.
instead of using the IRAnalysis explicitly, I found that transforming the VAR model into it's MA representation was the best way to adjust the size of the shock.
from statsmodels.tsa.vector_ar.irf import IRAnalysis
J = fitted.ma_rep(T)
J = shock*np.array(J)
This will give you the output of the irfs for T periods.
I also wanted the standard error bands on my plots, so I did something similar to that particular function as well.
G, H = fitted.irf_errband_mc(orth=False, repl=1000, steps=T, signif=0.05, seed=None, burn=100, cum=False)
Hope this helps

lmfit for exponential data returns linear function

I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()

This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.

Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.

customizing np.fft.fft function in python

I wish to perform a fourier transform of the function 'stress' from 0 to infinity and extract the real and imaginary parts. I have the following code that does it using a numerical integration technique:
import numpy as np
from scipy.integrate import trapz
import fileinput
import sys,string
window = 200000 # length of the array I wish to transform (number of data points)
time = np.linspace(1,window,window)
freq = np.logspace(-5,2,window)
output = [0]*len(freq)
for index,f in enumerate(freq):
visco = trapz(stress*np.exp(-1j*f*t),t)
soln = visco*(1j*f)
output[index] = soln
print 'f storage loss'
for i in range(len(freq)):
print freq[i],output[i].real,output[i].imag
This gives me a nice transformation of my input data.
Now I have an array of size 2x10^6, and using the above technique is not feasible(computation time scales as O(N^2)), so I have turned to the inbuilt fft function in numpy.
There aren't too many arguments that you can specify to change this function, and so I'm finding it difficult to customize it to my needs.
So far I have
import numpy as np
import fileinput
import sys, string
np.set_printoptions(threshold='nan')
N = len(stress)
fvi = np.fft.fft(stress,n=N)
gprime = fvi.real
gdoubleprime = fvi.imag
for i in range(len(stress)):
print gprime[i], gdoubleprime[i]
And it's not giving me accurate results.
The DFT in python is of the form A_k = summation(a_m * exp(-2*piimk/n)) where the summation is from m = 0 to m = n-1 (http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.fft.html). How can I change it to the form that I have mentioned in my first code, i.e. exp(-1jfreq*t) (freq is the frequency and t is the time which have already been predefined)? Or is there a post processing of the data that I have to do?
Thanks in advance for all your help.

SciPy: generating custom random variable from PMF

I'm trying to generate random variables according to a certain ugly distribution, in Python. I have an explicit expression for the PMF, but it involves some products which makes it unpleasant to obtain and invert the CDF (see below code for explicit form of PMF).
In essence, I'm trying to define a random variable in Python by its PMF and then have built-in code do the hard work of sampling from the distribution. I know how to do this if the support of the RV is finite, but here the support is countably infinite.
The code I am currently trying to run as per #askewchan's advice below is:
import scipy as sp
import numpy as np
class x_gen(sp.stats.rv_discrete):
def _pmf(self,k,param):
num = np.arange(1+param, k+param, 1)
denom = np.arange(3+2*param, k+3+2*param, 1)
p = (2+param)*(np.prod(num)/np.prod(denom))
return p
pa_limit = limitrv_gen()
print pa_limit.rvs(alpha,n=1)
However, this returns the error while running:
File "limiting_sim.py", line 42, in _pmf
num = np.arange(1+param, k+param, 1)
TypeError: only length-1 arrays can be converted to Python scalars
Basically, it seems that the np.arange() list isn't working somehow inside the def _pmf() function. I'm at a loss to see why. Can anyone enlighten me here and/or point out a fix?
EDIT 1: cleared up some questions by askewchan, edits reflected above.
EDIT 2: askewchan suggested an interesting approximation using the factorial function, but I'm looking more for an exact solution such as the one that I'm trying to get work with np.arange.

You should be able to subclass rv_discrete like so:
class mydist_gen(rv_discrete):
def _pmf(self, n, param):
return yourpmf(n, param)
Then you can create a distribution instance with:
mydist = mydist_gen()
And generate samples with:
mydist.rvs(param, size=1000)
Or you can then create a frozen distribution object with:
mydistp = mydist(param)
And finally generate samples with:
mydistp.rvs(1000)
With your example, this should work, since factorial automatically broadcasts. But, it might fail for large enough alpha:
import scipy as sp
import numpy as np
from scipy.misc import factorial
class limitrv_gen(sp.stats.rv_discrete):
def _pmf(self, k, alpha):
#num = np.prod(np.arange(1+alpha, k+alpha))
num = factorial(k+alpha-1) / factorial(alpha)
#denom = np.prod(np.arange(3+2*alpha, k+3+2*alpha))
denom = factorial(k + 2 + 2*alpha) / factorial(2 + 2*alpha)
return (2+alpha) * num / denom
pa_limit = limitrv_gen()
alpha = 100
pa_limit.rvs(alpha, size=10)

Print current residual from callback in scipy.sparse.linalg.cg

I am using scipy.sparse.linalg.cg to solve a large, sparse linear system, and it works fine, except that I would like to add a progress report, so that I can monitor the residual as the solver works. I've managed to set up a callback, but I can't figure out how to access the current residual from inside the callback. Calculating the residual myself is possible, of course, but that is a rather heavy operation, which I'd like to avoid. Have I missed something, or is there no efficient way of getting at the residual?

The callback is only sent xk, the current solution vector. So you don't have direct access to the residual. However, the source code shows resid is a local variable in the cg function.
So, with CPython, it is possible to use the inspect module to peek at the local variables in the caller's frame:
import inspect
import numpy as np
import scipy as sp
import scipy.sparse as sparse
import scipy.sparse.linalg as splinalg
import random
def report(xk):
frame = inspect.currentframe().f_back
print(frame.f_locals['resid'])
N = 200
A = sparse.lil_matrix( (N, N) )
for _ in xrange(N):
A[random.randint(0, N-1), random.randint(0, N-1)] = random.randint(1, 100)
b = np.random.randint(0, N-1, size = N)
x, info = splinalg.cg(A, b, callback = report)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures - python

Related

Statsmodels: vector_ar and IRAnalysis

lmfit for exponential data returns linear function

customizing np.fft.fft function in python

SciPy: generating custom random variable from PMF

Print current residual from callback in scipy.sparse.linalg.cg

Categories

Resources