bioinformatics sequence clustering in Python - python

I am trying to find a new method to cluster sequence data. I implemented my method and got an accuracy rate for it. Now I should compare it with available methods to see whether it works as I expected or not.
Is it possible to tell me what are the most famous methods in bioinformatics domain and what are the packages corresponded to those methods in Python? I am an engineer and have no idea about the most accurate methods in this field that I should compare my method to them.

Two common used methods are:
CH-hit, http://weizhongli-lab.org/cd-hit/
Uclust (USEARCH, 32bit version is free) https://drive5.com/usearch/
Both are command line tools and written in C++ (i think)

It also depends on the question for which tool you need(data reduction, otu clustering, making a tree, etc..). These days you see a shift in cluster tools that uses a more dynamic approach instead of a fixed similarity cutoff.
Example:
DADA2
UNOISE
Seekdeep
Fixed clustering:
CD-HIT
uclust
vsearch

Related

Replace objetive function in Word2Vec Gensim

I'm doing my final degree project. I need to create an extended version of the word2vec algorithm, changing the default objective function of the original paper. This has already been done (check this paper). In that paper, they only say the new objective function, but they do not say how they have run the model.
Now, I need to extend that model too, with another function, but I'm not sure if I have to implement word2vec myself with the new function, or there is a way to replace it in the Gensim word2vec implementation.
I have checked the Word2Vec Gensim documentation but I have not seen any parameter to do this. Do you have any idea how to do it? It is even possible?
I was unsure if this StackExchange site was the correct one, maybe https://ai.stackexchange.com/ is more appropriate.
There's no official support in Gensim for simply dropping in your own objective function.
However, the full source code is available – https://github.com/RaRe-Technologies/gensim – so by editing it, or using it as a model for your own implementation, you could theoretically do anything.
Beware, though:
the code has gone through a lot of optimization & customization for new options that may not be relevant to your needs, so may not be the most clean & simple starting point
for performance, the core routines are written in Cython – see the .pyx files – which can be especially hard to debug, and rely on library bulk array functions that may obscure how to implement your alternate function instead

How to perform adjoint sensitivity in Python (preferably through CVODE)

I want to implement the adjoint sensitivity analysis in python, in order to determine the gradient of my objective function with respect to some parameters. In specific the objective function depends on the solution of a differential equation which in turn depends on said parameters which I am looking to find the optimum of.
To perform this there are numerous good packages both in Julia (see here), as well as CVODES from SUNDIALS, however the latter which does apparently have a wrapper made for python, does not include sensitivity analysis capabilities according to this link. Furthermore, I have looked into SALib for sensitivity analysis, but as far as I understand this refers to some other type of 'sensitivity analysis' and therefore adjoint or even forward sensitivity analysis is not included (correct me if I am wrong on this one).
Thus my question is, does a version of CVODES exist in python with sensitivity analysis capabilities, or is there there any other package where one can use in order to perform adjoint sensitivity analys?
You can easily call Julia code / packages from Python with pyjulia.
https://github.com/JuliaPy/pyjulia
You can try Assimulo, which is a Python wrapper of the SUNDIALS suite. I've been using it for some years now and it works pretty robustly. So far, I have performed forward sensitivity analysis on ODE systems with moderate number of states/parameters using CVODEs (less than 20 states, less than 10 parameters). It works pretty well in terms of robustness (can handle stiff problems, and also supports a variety of linear solvers for sparse problems) and speed, and also supports DAEs through IDAs.
I have installed Assimulo using conda, which deals with all the dependency tree (including SUNDIALS in its more recent version). Finally, I'm not aware whether adjoint sensitivity analysis can be performed using Assimulo. If you find something, let us all know.

Python: Create Nomograms from Data (using PyNomo)

I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)

Does Matlab's fminimax apply Pareto optimality?

I am working on multi-objective optimization in Matlab, and am using the fiminimax in the Optimization toolbox. I want to know if fminimax applies Pareto optimization, and if not, why? Also, can you suggest a multi-objective optimization package in Matlab or Python that does use Pareto?
For python, DEAP may be the one you're looking for. Extensive documentation with a lot of real life examples, and a really helpful Google Groups forum. It implements two robust MO algorithms: NSGA-II and SPEA-II.
Edit (as requested)
I am using DEAP for my MSc thesis, so I will let you know how we are using Pareto optimality. Setting DEAP up is pretty straight-forward, as you will see in the examples. Use this one as a starting point. This is the short version, which uses the built-in algorithms and operators. Read both and then follow these guidelines.
As the OneMax example is single-objective, it doesn't use MO algorithms. However, it's easy to implement them:
Change your evaluation function so it returns a n-tuple with the desired scores. If you want to minimize standard deviation too, something like return sum(individual), numpy.std(individual) would work.
Also, modify the weights parameter of the base.Fitness object so it matches that returned n-tuple. A positive float means maximization, while a negative one means minimization. You can use any real number, but I would stick with 1.0 and -1.0 for the sake of simplicity.
Change your genetic operators to cxSimulatedBinaryBounded(), mutPolynomialBounded() and selNSGA2(), for crossover, mutation and selection operations, respectively. These are the suggested methods, as they were developed by the NSGA-II authors.
If you want to use one of the embedded ready-to-go algorithms in DEAP, choose MuPlusLambda().
When calling the algorithm, remember to change the halloffame parameter from HallOfFame() to ParetoFront(). This will return all non-dominated individuals, instead of the best lexicographically sorted "best individuals in all generations". Then you can resolve your Pareto Front as desired: weighted sum, custom lexicographic sorting, etc.
I hope that helps. Take into account that there's also a full, somehow more advanced, NSGA2 example available here.
For fminimax and fgoalattain it looks like the answer is no. However, the genetic algorithm solver, gamultiobj, is Pareto set-based, though I'm not sure if it's the kind of multi-objective optimization function you want to use. gamultiobj implements the NGSA-II evolutionary algorithm. There's also this package that implements the Strengthen Pareto Evolutionary Algorithm 2 (SPEA-II) in C with a Matlab mex interface. It's a bit old so you might want to recompile it (you'll need to anyways if you're not on Windows 32-bit).

Implementing alternative forms of LDA

I am using Latent Dirichlet Allocation with a corpus of news data from six different sources. I am interested in topic evolution, emergence, and want to compare how the sources are alike and different from each other over time. I know that there are a number of modified LDA algorithms such as the Author-Topic model, Topics Over Time, and so on.
My issue is that very few of these alternate model specifications are implemented in any standard format. A few are available in Java, but most exist as conference papers only. What is the best way to go about implementing some of these algorithms on my own? I am fairly proficient in R and jags, and can stumble around in Python when given long enough. I am willing to write the code, but I don't really know where to start and I don't know C or Java. Can I build a model in JAGS or Python just having the formulas from the manuscript? If so, can someone point me at an example of doing this? Thanks.
My friend's response is below, pardon the language please.
First I wrote up a Python implementation of the collapsed Gibbs sampler seen here (http://www.pnas.org/content/101/suppl.1/5228.full.pdf+html) and fleshed out here (http://cxwangyi.files.wordpress.com/2012/01/llt.pdf). This was slow as balls.
Then I used a Python wrapping of a C implementation of this paper (http://books.nips.cc/papers/files/nips19/NIPS2006_0511.pdf). Which is fast as f*ck, but the results are not as great as one would see with NMF.
But NMF implementations I've seen, with scitkits, and even with the scipy sparse-compatible recently released NIMFA library, they all blow the f*ck up on any sizable corpus. My new white whale is a sliced, distributed implementation of the thing. This'll be non-trivial.
In Python, do you know of PyMC? It's flexible in specifying both the model and the fitting algorithm.
Also, when starting with R and JAGS, there is this tutorial on "Using JAGS in R with the rjags Package" together with a collection of examples.

Categories