I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular :
sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)
(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html)
I am trying to implement the example I find in the Stanford NLP tutorial site:
The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2
The problem is I keep getting different results, without figuring out the reason yet.
I get the concept of Mutual Information and feature selection, I just don't understand how it is implemented in Python. What I do is that I provide the mutual_info_score method with two arrays based on the NLP site example, but it outputs different results. The other interesting fact is that anyhow you play around and change numbers on those arrays you are most likely to get the same result. Am I supposed to use another data structure specific to Python or what is the issue behind this? If anyone has used this function successfully in the past it would be of a great help to me, thank you for your time.
I encountered the same issue today. After a few trials I found the real reason: you take log2 if you strictly followed NLP tutorial, but sklearn.metrics.mutual_info_score uses natural logarithm(base e, Euler's number). I didn't find this detail in sklearn documentation...
I verified this by:
import numpy as np
def computeMI(x, y):
sum_mi = 0.0
x_value_list = np.unique(x)
y_value_list = np.unique(y)
Px = np.array([ len(x[x==xval])/float(len(x)) for xval in x_value_list ]) #P(x)
Py = np.array([ len(y[y==yval])/float(len(y)) for yval in y_value_list ]) #P(y)
for i in xrange(len(x_value_list)):
if Px[i] ==0.:
continue
sy = y[x == x_value_list[i]]
if len(sy)== 0:
continue
pxy = np.array([len(sy[sy==yval])/float(len(y)) for yval in y_value_list]) #p(x,y)
t = pxy[Py>0.]/Py[Py>0.] /Px[i] # log(P(x,y)/( P(x)*P(y))
sum_mi += sum(pxy[t>0]*np.log2( t[t>0]) ) # sum ( P(x,y)* log(P(x,y)/( P(x)*P(y)) )
return sum_mi
If you change this np.log2 to np.log, I think it would give you the same answer as sklearn. The only difference is that when this method returns 0, sklearn will return a number very near to 0. ( And of course, use sklearn if you don't care about log base, my piece of code is just for demo, it gives poor performance...)
FYI, 1)sklearn.metrics.mutual_info_score takes lists as well as np.array; 2) the sklearn.metrics.cluster.entropy uses also log, not log2
Edit: as for "same result", I'm not sure what you really mean. In general, the values in the vectors don't really matter, it is the "distribution" of values that matters. You care about P(X=x), P(Y=y) and P(X=x,Y=y), not the value x,y.
The code below should provided a result: 0.00011053558610110256
c=np.concatenate([np.ones(49), np.zeros(27652), np.ones(141), np.zeros(774106) ])
t=np.concatenate([np.ones(49), np.ones(27652), np.zeros(141), np.zeros(774106)])
computeMI(c,t)
Related
I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!
My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:
means = {}
list_of_samples = df["sample"].unique()
for sample_number in list_of_samples:
sample = df[ df["sample"] == sample_number ]
list_of_subsamples = sample["subsample"].unique()
means[sample_number] = {}
for subsample_number in list_of_subsamples:
subsample = sample[ sample["subsample"] == subsample_number ]
means[sample_number][subsample_number] = subsample["value"].mean()
If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:
mean = df["value"].mean()
it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:
sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()
The program just hangs. I've left it for an hour and still not gotten a result!
How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!
edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.
Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:
import vaex
df = vaex.example()
df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']
I am currently using fipy but am still relatively new to the nuiances associated with the package. While I have been able to regenerate the desired heatmap from the examples folder in for the mesh20x20 diffusion example using the command line, I have struggled to replicate it within a Spyder IDE. I am using python version 3.8 . It is simple enough to generate it using the "examples" folder from the command line the command line image generated, however, when I attempt to "re-program" it I end up with iterations of the following. the following result. I am hoping to be able to regenerate the smooth color transition from the examples folder, as opposed to the discrete dichromatic option that I have been limited to at present. I believe there is some issues with the viewer in some capacity I believe some related issues may have cropped up in the past for others, potentially as it relates to colorbar reformatting, though I have not yet been capable of effectively implementing these workarounds to generate the desired imagery. datamin and datamax in Viewer() did not work
I would be greatly indebted for any assitance the community could provide.
from fipy.terms.transientTerm import TransientTerm
from fipy.terms.implicitDiffusionTerm import ImplicitDiffusionTerm
from fipy.terms.explicitDiffusionTerm import ExplicitDiffusionTerm
from fipy.meshes.nonUniformGrid2D import NonUniformGrid2D
from fipy.variables.cellVariable import CellVariable
from fipy.viewers.matplotlibViewer.matplotlib2DViewer import Matplotlib2DViewer
####
#Global Inputs
D=1
steps=10
#Dimensional Inputs
nx=20
dx=1
ny=20
dy=1
L=dx*nx
#Temporal Inputs
#nt=20
#dt=1
#cell variable initial values
value=0
#construct mesh from dimensional pts
mesh=NonUniformGrid2D(nx=nx, dx=dx, ny=ny, dy=dy)
#construct term variables phi with name, mesh design
phi=CellVariable(name="solutionvariable", mesh=mesh, value=0)
#construct boundary conditions
#dirichlet ---> we can an automatic application of neumann to top right and bottom left
valueTopLeft=0
valueBottomRight=1
#assign boundary conditions to a face or cell
X, Y=mesh.faceCenters
facesTopLeft=((mesh.facesLeft & (Y > L/2 )) | (mesh.facesTop &( X < L/2)))
facesBottomRight=((mesh.facesRight & (Y < L/2)) | (mesh.facesBottom & (X > L/2)))
#constrain variables
phi.constrain(valueTopLeft, facesTopLeft)
phi.constrain(valueBottomRight, facesBottomRight)
#equation construction
eq=TransientTerm()==ExplicitDiffusionTerm(coeff=D)
#equation solving and either viewing and/or extraction
timestepduration=0.9 *(dx**2)/(2*D)
for step in range(steps):
eq.solve(var=phi, dt=timestepduration)
print(phi[step])
viewer=Matplotlib2DViewer(vars=phi, datamin=0, datamax=1)
viewer.axes.set_title("Solutionvbl(Step %d)" % (step+1,))
Figured it out I think. I was using ExplicitDiffusion and the example utilizes ImplicitDiffusion. When I tried this all I got back was a blank monochromatic image (and returned zeros for my phi[step] at the end. I am happy to report that once a "kickstart" value is provided in the value section for cellVariable (I used 0.001), and utilized in conjunction with ImplicitDiffusion, and the timestepduration is increased from its limit of 0.9x**2/2D to the utilized 9x**2/2D used in the example documentation it more or less adheres to the image generated when run from the command line. Grateful to have this sorted. Hope this provides assistance to anyone else who might run into a similar problem.
I am trying to translate some Julia code to Python. This is code for the multinomial distribution, and I am stuck in the last part of the it. I don't know how to write it in Python, because I want to know if there is a package that will do what I want. I don't know if I can do this using SciPy.stats, because the documentation seems kind of limited.
Here you will find a part of the Julia code where I'm stuck:
x1s is an array, x also is an array, OrthoNNDist is a name of a struct:
Base.length(d::OrthoNNDist) = length(d.x0)
Distributions.rand(d::OrthoNNDist) = rand(d.x1s)
Distributions.pdf(d::OrthoNNDist, x::Vector) = x in d.x1s ? d.prob : 0.0
Distributions.pdf(d::OrthoNNDist) = fill(d.prob, size(d.x1s))
Distributions.logpdf(d::OrthoNNDist, x::Vector) = log(pdf(d, x))
I'm processing some experimental data in Python 3. The data (raw_data in my code) is pretty noisy:
One of my goal is to find the peaks, and for this I'd like to filter the noise. Based on what I found in the documentation of SciPy's Signal module, the theory of filtering seems to be really complicated, and unfortunately I have zero background. Of course I got to learn it sooner or later - and I intend to - but now now the profit doesn't worth the time (and learning filter theory isn't the purpose of my work), so I shamefully copied the code in Lyken Syu's answer without a chance of understanding the background:
import numpy as np
from scipy import signal as sg
from matplotlib import pyplot as plt
# [...] code, resulting in this:
raw_data = [arr_of_xvalues, arr_of_yvalues] # xvalues are in decreasing order
# <magic beyond my understanding>
n = 20 # the larger n is, the smoother the curve will be
b = [1.0 / n] * n
a = 2
filt = sg.lfilter(b, a, raw_data)
filtered = sg.lfilter(b, a, filt)
# <\magic>
plt.plot(filtered[0], filtered[1], ".")
plt.show()
It kind of works:
What concerns me is the curve from 0 to the beginning of my dataset the filter adds. I guess it's a property of the IIR filter I used, but I don't know how to prevent this. Also, I couldn't make other filters work so far. I need to use this code on other experimental results alike this, so I need a somewhat more general solution than e.g. cutting out all y<10 points.
Is there a better (possibly simpler) way, or choice of filter that is easy to implement without serious theoretical background?
How, if, could I prevent my filter adding that curve to my data?
With this code:
X = numpy.array(range(0,5))
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
I get
tuple index out of range
self.n_features = obs[0].shape[1]
So what are you supposed to pass .fit() exactly? The hidden states AND emissions in a tuple? If so in what order? The documentation is less than helpful.
I noticed it likes being passed tuples as this does not give an error:
X = numpy.column_stack([range(0,5),range(0,5)])
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
Edit:
Let me clarify a bit, the documentation indicates that the ordinality of the array must be:
List of array-like observation sequences (shape (n_i, n_features)).
This would almost indicate that you pass a tuple for each sample that indicates in a binary fashion which observations are present. However their example indicates otherwise:
# pack diff and volume for training
X = np.column_stack([diff, volume])
hence the confusion
It would appear the GaussianHMM function is for multivariate-emission-only HMM problems, hence the requirement to have >1 emission vectors. When the documentation refers to 'n_features' they are not referring to the number of ways emissions can express themselves but the number of orthogonal emission vectors.
Hence, "features" (the orthogonal emission vectors) are not to be confused with "symbols" which, in sklearn's parlance (which is likely shared with the greater hmm community for all I know), refer to what actual unique values the system is capable of emitting.
For univariate emission-vector problems, use MultinomialHMM.
Hope that clarifies for anyone else who want to use this stuff without becoming the world's foremost authority on HMMs :)
I realize this is an old thread but the problem in the example code is still there. I believe the example is now at this link and still giving the same error:
tuple index out of range
self.n_features = obs[0].shape[1]
The offending line of code is:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit(X)
Which should be:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit([X])