Factors Analysis using MDP in Python

Factors Analysis using MDP in Python - python

Excuse my ignorance, I'm very new to Python. I'm trying to perform factor analysis in Python using MDP (though I can use another library if there's a better solution).
I have an m by n matrix (called matrix) and I tried to do:
import mdp
mdp.nodes.FANode()(matrix)
but I get back an error. I'm guessing maybe my matrix isn't formed properly? My goal is find out how many components are in the data and find out which rows load onto which components.
Here is the traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mdp/signal_node.py", line 630, in __call__
return self.execute(x, *args, **kwargs)
File "mdp/signal_node.py", line 611, in execute
self._pre_execution_checks(x)
File "mdp/signal_node.py", line 480, in _pre_execution_checks
self.train(x)
File "mdp/signal_node.py", line 571, in train
self._check_input(x)
File "mdp/signal_node.py", line 429, in _check_input
if not x.ndim == 2:
AttributeError: 'list' object has no attribute 'ndim'
Does anyone have any idea what's going on, and feel like explaining it to a Python newbie?

I have absolutely no experience with mdp, but it looks like it expects your matrices to be passed as a Numpy array instead of a list. Numpy is a package for high performance scientific computing. You can go to the Numpy home page and install it. After doing so, try altering your code to this:
import mdp, numpy
mdp.nodes.FANode()(numpy.array(matrix))

As Stephen said, the data must be a numpy array. More precisely it must be a 2D array, with the first index representing the different sampes and the second index representing the data dimensions (using the wrong order here can lead to the "singular matrix" error).
You should also take a look at the MDP documentation, which should answer all your questions. If that doesn't help there is the MDP user mailing list.

Related

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.

That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)

How to iteratively load read_pixel and write to envi file; python3

I want to load hyperspectral data per pixel into an array and write this pixel out again using Python 3.5. I want to calculate something with the spectral information of this Pixel.
I have tried two different ways and both don't work the way I want.
First of all I have updated spectral package since the last version was stated not to work with iteratively envi.save_image but still my approach does not work.
Second my approaches both are not very good with my double for loop - I know -
If anyone could please help me on my problem.
1st:
myfile=open_image('input.hdr')
for i in range(0,myfile.shape[0]):
for j in range(0,myfile.shape[1]):
mypixel = (myfile.read_pixel(i,j))
envi.save_image('output.hdr', mypixel, dtype=np.int16)
1st example does not save the image rather gives me the error code
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "/dtc/Python/Masking.py", line 132, in <module>
envi.save_image('test.hdr', mypixel, dtype=np.int16)#, metadata=myfile.metadata)
File "/usr/local/lib/python3.5/site-packages/spectral/io/envi.py", line 415, in save_image
data, metadata = _prepared_data_and_metadata(hdr_file, image, **kwargs)
File "/usr/local/lib/python3.5/site-packages/spectral/io/envi.py", line 568, in _prepared_data_and_metadata
add_image_info_to_metadata(image, metadata)
File "/usr/local/lib/python3.5/site-packages/spectral/io/envi.py", line 613, in add_image_info_to_metadata
metadata['samples'] = image.shape[1]
IndexError: tuple index out of range
2nd:
myfile=open_image('input.hdr')
envi.create_image('test.hdr',ext='.bip', interleave='bip',dtype='h',force=True,metadata=myfile.metadata)
open('test.bip', 'w').close() # empties the created file
file = open('test.bip', 'ab')#ab #opens the created file for appending the new bands
for i in range(0,myfile.shape[0]):
for j in range(0,myfile.shape[1]):
mypixel = (myfile.read_pixel(i,j))
file.write(mypixel)
file.close()
myfile.close()
The second example saves the image but stores the Pixel in a different order and messes up my image.

So this is the very short, fast and easy solution thanks to a colleague.
myfile=envi.open('input.hdr') #opens image for calculating with it
imageArray = 10000*myfile[:,:,:] #do some math with it;
#10000* is needed because input data are thresholded between {0;10000}
#and during processing get thresholded between {0;1}.
#For preventing 0 in the output with datatype int the thresholding to {0;10000} is necessary again
envi.save_image('test.hdr',imageArray,dtype=np.int16,metadata=myfile.metadata,force=True)

I have to say in advance that I am not familiar with the spectral package and envi and therefore unfortunately cannot offer a ready-to-use solution. Besides, I am not sure if I correctly understood what you are trying to do with your image.
But just some thoughts: Could the write/save function inside the for loop cause your problem, because every pixel is treated in the exact same way and it gets overwritten? I can not relate to the IndexError though.
Maybe you need a function where you can rather write a certain pixel to an empty image passing also i and j. A second option could be to save each pixel in an array and save it to an image at once after the for loop.

petsc4py: Creating AIJ Matrix from csc_matrix results in TypeError

I am trying to create a petsc-matrix form an already existing csc-matrix. With this in mind I created the following example code:
import numpy as np
import scipy.sparse as sp
import math as math
from petsc4py import PETSc
n=100
A = sp.csc_matrix((n,n),dtype=np.complex128)
print A.shape
A[1:5,:]=1+1j*5*math.pi
p1=A.indptr
p2=A.indices
p3=A.data
petsc_mat = PETSc.Mat().createAIJ(size=A.shape,csr=(p1,p2,p3))
This works perfectly well as long as the matrix only consist of real values. When the matrix is complex running this piece of code results in a
TypeError: Cannot cast array data from dtype('complex128') to dtype('float64') according to the rule 'safe'.
I tried to figure out where the error occurs exactly, but could not make much sense of the traceback:
petsc_mat = PETSc.Mat().createAIJ(size=A.shape,csr=(p1,p2,p3)) File "Mat.pyx", line 265, in petsc4py.PETSc.Mat.createAIJ (src/petsc4py.PETSc.c:98970)
File "petscmat.pxi", line 662, in petsc4py.PETSc.Mat_AllocAIJ (src/petsc4py.PETSc.c:24264)
File "petscmat.pxi", line 633, in petsc4py.PETSc.Mat_AllocAIJ_CSR (src/petsc4py.PETSc.c:23858)
File "arraynpy.pxi", line 136, in petsc4py.PETSc.iarray_s (src/petsc4py.PETSc.c:8048)
File "arraynpy.pxi", line 117, in petsc4py.PETSc.iarray (src/petsc4py.PETSc.c:7771)
Is there an efficient way of creating a petsc matrix (of which i want to retrieve some eigenpairs later) from a complex scipy csc matrix ?
I would be really happy if you guys could help me find my (hopefully not too obvious) mistake.

I had troubles getting PETSc to work, so I configured it more than just once, and in the last run I obviously forgot the option --with-scalar-type=complex.
This is what I should have done:
Either check the log file $PETSC_DIR/arch-linux2-c-opt/conf/configure.log.
Or take a look at the reconfigure-arch-linux2-c-opt.py.
There you can find all options you used to configure PETSc. In case you use SLEPc as well, you also need to recompile it. Now since I added the option (--with-scalar-type=complex) to the reconfigure script and ran it, everything works perfectly fine.

Meanshift in scikit learn (python) doesn't understand datatype

I have a dataset which has 7265 samples and 132 features.
I want to use the meanshift algorithm from scikit learn but I ran into this error:
Traceback (most recent call last):
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
labels, centers = getClusters(data,clusters)
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
ms.fit(np.array(dataarray))
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
cluster_all=self.cluster_all)
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
return self._fit(X)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
My code:
dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
If I check the datatype of the data variable I see:
print isinstance( dataarray, np.ndarray )
>>> True
The bandwidth is 0.925538333061 and the dataarray.dtype is float64
I'm using scikit learn 0.14.1
I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?
EDIT:
The data can be found here:
(pickle format) : http://ojtwist.be/datatocluster.p
and : http://ojtwist.be/datatocluster.npz

That`s a bug in scikit project. It is documented here.
There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.
If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.
Try this:
>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)
And then use data2 into your code.
It worked for me.
If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)
Edit: You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.
Good luck!

ValueError: too many boolean indices

I am trying to visualize some data using matpolid, but i got this error
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 2775, in griddata
tri = delaunay.Triangulation(x,y)
File "C:\Python27\lib\site-packages\matplotlib\delaunay\triangulate.py", line 98, in __init__
duplicates = self._get_duplicate_point_indices()
File "C:\Python27\lib\site-packages\matplotlib\delaunay\triangulate.py", line 137, in _get_duplicate_point_indices
return j_sorted[mask_duplicates]
ValueError: too many boolean indices
It happens when i call function
data=griddata(self.dataX,self.dataY,self.dataFreq,xi,yi)
Does anyone know why I got that error? I suppoes it it something with parameters, but I can figure out what

Might be worth updating your matplotlib. There has been a lot of work on the triangulation code that has made it into v1.3.0.
The what's new page for matplotlib v1.3.0 can be found at http://matplotlib.org/users/whats_new.html#triangular-grid-interpolation

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Factors Analysis using MDP in Python - python

Related

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

How to iteratively load read_pixel and write to envi file; python3

petsc4py: Creating AIJ Matrix from csc_matrix results in TypeError

Meanshift in scikit learn (python) doesn't understand datatype

ValueError: too many boolean indices

Categories

Resources