I'm doing my final degree project. I need to create an extended version of the word2vec algorithm, changing the default objective function of the original paper. This has already been done (check this paper). In that paper, they only say the new objective function, but they do not say how they have run the model.
Now, I need to extend that model too, with another function, but I'm not sure if I have to implement word2vec myself with the new function, or there is a way to replace it in the Gensim word2vec implementation.
I have checked the Word2Vec Gensim documentation but I have not seen any parameter to do this. Do you have any idea how to do it? It is even possible?
I was unsure if this StackExchange site was the correct one, maybe https://ai.stackexchange.com/ is more appropriate.
There's no official support in Gensim for simply dropping in your own objective function.
However, the full source code is available – https://github.com/RaRe-Technologies/gensim – so by editing it, or using it as a model for your own implementation, you could theoretically do anything.
Beware, though:
the code has gone through a lot of optimization & customization for new options that may not be relevant to your needs, so may not be the most clean & simple starting point
for performance, the core routines are written in Cython – see the .pyx files – which can be especially hard to debug, and rely on library bulk array functions that may obscure how to implement your alternate function instead
Related
I'm trying to learn from an example which uses an older version of gensim. In particular, I have a section of code like:
word_vectors = Word2Vec(vector_size=word_vector_dim, min_count=1)
word_vectors.build_vocab(corpus_iterable)
word_vectors.intersect_word2vec_format(pretrained_dir + 'GoogleNews-vectors-negative300.bin.gz', binary=True)
My understanding is that this fills the word vector vocabulary with pre-trained word vectors when available. When the words in my vocabulary are not in the pretrained vectors, they are initialized to random values. However, the method intersect_word2vec_format doesn't exist in the latest version of gensim. What is the cleanest way to replicate this process in gensim 4.0.0?
The .intersect_word2vec_format() method still exists, but as an operation on a set of word-vectors, has moved to KeyedVectors. So in some cases, older code that had called the method on a Word2Vec model itself will need to call it on the model's .wv property, holding a KeyedVectors object, instead. EG:
w2v_model = Word2Vec(vector_size=word_vector_dim, min_count=1)
w2v_model.build_vocab(corpus_iterable)
# (you'll likely need another workaround here, see below)
w2v_model.wv.intersect_word2vec_format(pretrained_dir + 'GoogleNews-vectors-negative300.bin.gz', binary=True)
However, you'll still hit some problems:
It's always been at best an experimental, advanced feature – and not a part of any well-documented processes. So it's best used if you're able to review its source code, & understand what limits & tradeoffs will come with using such (partially)-pre-initialized word-vectors, maybe-further-trained or maybe-frozen (depending on the vectors_lockf values chosen).
The equally experimental vectors_lockf functionality will now, in Gensim 4+, require manual initialization by the knowledgeable - & because .intersect_word2vec_format() assumes a particular pre-allocation, that method will break in Gensim 4.1 without an explicit workaround. See this open issue for more details.
Most generally: pre-initializing with other word-vectors is at best a fussy, advanced technique, so be sure to study the code, consider the potential tradeoffs, & carefully evaluate its effects on your end-results, before embracing it. It's not an easy, automatic, or well-characterized shortcut.
I am trying to find a new method to cluster sequence data. I implemented my method and got an accuracy rate for it. Now I should compare it with available methods to see whether it works as I expected or not.
Is it possible to tell me what are the most famous methods in bioinformatics domain and what are the packages corresponded to those methods in Python? I am an engineer and have no idea about the most accurate methods in this field that I should compare my method to them.
Two common used methods are:
CH-hit, http://weizhongli-lab.org/cd-hit/
Uclust (USEARCH, 32bit version is free) https://drive5.com/usearch/
Both are command line tools and written in C++ (i think)
It also depends on the question for which tool you need(data reduction, otu clustering, making a tree, etc..). These days you see a shift in cluster tools that uses a more dynamic approach instead of a fixed similarity cutoff.
Example:
DADA2
UNOISE
Seekdeep
Fixed clustering:
CD-HIT
uclust
vsearch
is it possible, using the Z3 API (e.g. the Python API), to save the current state of a solver, including what the solver has learned (in SAT solving we would say the "learned clauses") in a file in SMT2 format?
Because I would like to be able to save the state of the solver in a temporary file in order to resume solving later, in order to have some time to understand what further queries I should make to it.
Many thanks in advance...
SMT2 has no provisions of saving a given solvers state, which will no doubt differ widely from solver to solver. Each solver might have different mechanisms of doing so, however, but it will definitely not be in SMTLib2 format.
Since your question is entirely Z3 specific, I recommend asking it on https://github.com/Z3Prover/z3/issues to see if they might have anything interesting. So far as I know, however, this isn't possible currently.
At the end Levent was right :)
Below are some observations by Nikolaj Bjorner, from the Z3 github website.
"The state of the solver isn't fully serializable to SMT2 format.
You can print the solver to smt2 format based on the current assertions,
but not learned clauses/units using the sexpr() method on the Solver object."
...
"We don't expose ways to print internal state. You could perhaps interrupt the solver, then clone it using the "translate" methods and access the translated solver state using internal print utilities. You would have to change the code a bit to get to this state.
The print features on solvers don't access the internal state of any of the solvers, instead they look at the asserted formulas and print them.
I don't translate learned lemmas. For example, the code in smt_context.cpp line 176 is disabled because it didn't help with any performance enhancements. Similarly, the copy code in sat_solver does not copy learned clauses even though it retains unit literals and binary clauses that are learned."
You can see the above comments by Nicolaj at this link.
I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)
I am using Latent Dirichlet Allocation with a corpus of news data from six different sources. I am interested in topic evolution, emergence, and want to compare how the sources are alike and different from each other over time. I know that there are a number of modified LDA algorithms such as the Author-Topic model, Topics Over Time, and so on.
My issue is that very few of these alternate model specifications are implemented in any standard format. A few are available in Java, but most exist as conference papers only. What is the best way to go about implementing some of these algorithms on my own? I am fairly proficient in R and jags, and can stumble around in Python when given long enough. I am willing to write the code, but I don't really know where to start and I don't know C or Java. Can I build a model in JAGS or Python just having the formulas from the manuscript? If so, can someone point me at an example of doing this? Thanks.
My friend's response is below, pardon the language please.
First I wrote up a Python implementation of the collapsed Gibbs sampler seen here (http://www.pnas.org/content/101/suppl.1/5228.full.pdf+html) and fleshed out here (http://cxwangyi.files.wordpress.com/2012/01/llt.pdf). This was slow as balls.
Then I used a Python wrapping of a C implementation of this paper (http://books.nips.cc/papers/files/nips19/NIPS2006_0511.pdf). Which is fast as f*ck, but the results are not as great as one would see with NMF.
But NMF implementations I've seen, with scitkits, and even with the scipy sparse-compatible recently released NIMFA library, they all blow the f*ck up on any sizable corpus. My new white whale is a sliced, distributed implementation of the thing. This'll be non-trivial.
In Python, do you know of PyMC? It's flexible in specifying both the model and the fitting algorithm.
Also, when starting with R and JAGS, there is this tutorial on "Using JAGS in R with the rjags Package" together with a collection of examples.