I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).
My prototype uses Python's set as the obvious data type.
When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.
Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.
I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.
I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.
Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?
Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)
(Edit: to clarify what I'm looking for from an API, I want something like:
def merge(document_sets):
probe_i = 0
probe_set = document_sets[probe_i]
document_id = GET_FIRST(probe_set)
while IS_VALID(document_id):
# See if the document is present in all sets
for i in range(1, len(document_sets)):
# dynamically adapt to favor the least matching set
target_i = (i + probe_i) % len(document_sets)
target = document_sets[target_i]
if document_id not in target_set:
probe_i = target_id
probe_set = document_sets[probe_i]
document_id = GET_NEXT(probe_set, document_id)
break
else:
yield document_id
I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.
This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.)
)
Yes, there are some, though I'm not sure if they're suitable for your use case: but it seems none of them are what you asked for.
BioPython has a Trie implementation in C.
Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520
Other (some very stale) implementations:
http://pypi.python.org/pypi/radix
py-radix is an implementation of a
radix tree data structure for the
storage and retrieval of IPv4 and IPv6
network prefixes.
https://bitbucket.org/markon/patricia-tree/src
A Python implementation of
patricia-tree
http://pypi.python.org/pypi/trie
A prefix tree (trie) implementation.
http://pypi.python.org/pypi/logilab-common/0.50.3
patricia.py : A Python implementation
of PATRICIA trie (Practical Algorithm
to Retrieve Information Coded in
Alphanumeric).
I've recently added iteration support to datrie, you may give it a try.
Related
I have a big text file from which I want to count the occurrences of known phrases. I currently read the whole text file line by line into memory and use the 'find' function to check whether a particular phrase exists in the text file or not:
found = txt.find(phrase)
This is very slow for large file. To build an index of all possible phrases and store them in a dict will help, but the problem is it's challenging to create all meaningful phrases myself. I know that Lucene search engine supports phrase search. In using Lucene to create an index for a text set, do I need to come up with my own tokenization method, especially for my phrase search purpose above? Or Lucene has an efficient way to automatically create an index for all possible phrases without the need for me to worry about how to create the phrases?
My main purpose is to find a good way to count occurrences in a big text.
Summary: Lucene will take care of allocating higher matching scores to indexed text which more closely match your input phrases, without you having to "create all meaningful phrases" yourself.
Start Simple
I recommend you start with a basic Lucene analyzer, and see what effect that has. There is a reasonably good chance that it will meet your needs.
If that does not give you satisfactory results, then you can certainly investigate more specific/targeted analyzers/tokenizers/filters (for example if you need to analyze non-Latin character sets).
It is hard to be more specific without looking at the source data and the phrase matching requirements in more detail.
But, having said that, here are two examples (and I am assuming you have basic familiarity with how to create a Lucene index, and then query it).
All of the code is based on Lucene 8.4.
CAVEAT - I am not familiar with Python implementations of Lucene. So, with apologies, my examples are in Java - not Python (as your question is tagged). I would imagine that the concepts are somewhat translatable. Apologies if that's a showstopper for you.
A Basic Multi-Purpose Analyzer
Here is a basic analyzer - using the Lucene "service provider interface" syntax and a CustomAnalyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("asciiFolding")
.build();
The above analyzer tokenizes your input text using Unicode whitespace rules, as encoded into the ICU libraries. It then standardizes on lowercase, and maps accents/diacritics/etc. to their ASCII equivalents.
An Example Using Shingles
If the above approach proves to be weak for your specific phrase matching needs (i.e. false positives scoring too highly), then one technique you can try is to use shingles as your tokens. Read more about shingles here (Elasticsearch has great documentation).
Here is an example analyzer using shingles, and using the more "traditional" syntax:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.StopwordAnalyzerBase;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
...
StopwordAnalyzerBase.TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
// default shingle size is 2:
tokenStream = new ShingleFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
In this example, the default shingle size is 2 (two words per shingle) - which is a good place to start.
Finally...
Even if you think this is a one-time exercise, it is probably still worth going to the trouble to build some Lucene indexes in a repeatable/automated way (which may take a while depending on the amount of data you have).
That way, it will be fast to run your set of known phrases against the index, to see how effective each index is.
I have deliberately not said anything about your ultimate objective ("to count occurrences"), because that part should be relatively straightforward, assuming you really do want to find exact matches for known phrases. It's possible I have misinterpreted your question - but at a high level I think this is what you need.
I am using a lexicon of positive and negative words, and I want to count how many positive and negative words appear in each document from a large corpus. The corpus has almost 2 million documents, so the code I'm running is taking too long to count all these occurrences.
I have tried using numpy, but get a memory error when trying to convert the list of documents into an array.
This is the code I am currently running to count just the positive words in each document.
reviews_pos_wc = []
for review in reviews_upper:
pos_words = 0
for word in review:
if word in pos_word_list:
pos_words += 1
reviews_pos_wc.append(pos_words)
After running this for half an hour, it only gets through 300k documents.
I have done a search for similar questions on this website. I found someone else doing a similar thing, but not nearly on the same scale as they only used one document. The answer suggested using the Counter class, but I thought this would just add more overhead.
It appears that your central problem is that you don't have the hardware needed to do the job you want in the time you want. For instance, your RAM appears insufficient to hold the names of 2M documents in both list and array form.
I do see a couple of possibilities. Note that "vectorization" is not a magic solution to large problems; it's merely a convenient representation that allows certain optimizations to occur among repeated operations.
Regularize your file names, so that you can represent their names in fewer bytes. Iterate through a descriptive expression, rather than the full file names. This could give you freedom to vectorize something later.
Your variable implies that your lexicon is a list. This has inherently linear access. Change this to a data structure amenable to faster search, such as a set (hash function) or some appropriate search tree. Even a sorted list with an interpolation search would speed up your work.
Do consider using popular modules (such as Collections); let the module developers optimize the common operations on your behalf. Write a prototype and time its performance: given the simplicity of your processing, the coding shouldn't take long.
Does that give you some ideas for experimentation? I'm hopeful that my first paragraph proves to be unrealistically pessimistic (i.e. that something does provide a solution, especially the lexicon set).
This is py2neo 1.6.
My question is how to generate the unique_identifier for each idea (see commented lines) in order to have a distinct filename for the image.
For the moment we are using python’s uuid.
I wonder if there is some utility in neo4j that can associate a distinct number to each node when the node is added to the index, and so that we can use this number as our unique_identifier
def create_idea_node(idea_text):
#basepath = 'http://www.example.com/ideas/img/'
#filename= str(unique_identifier)+'.png'
#idea_image_url = basepath + filename
newidea_node, = getGraph().create({"idea": idea_text, "idea_image_url": idea_image_url})
_getIdeasIndex().add("idea", idea_text, new_idea_node)
return OK
def _getIdeasIndex():
return getGraph().get_or_create_index(neo4j.Node, "Ideas")
Neo4j nodes have ids, they are integers, however if a node is destroyed and recreated, the integer may be reused. id(n) is the node n’s id. Is there something wrong with the UUID? Integer solutions can become problematic when you are multi-threading or distributing your computing project across multiple servers as you scale. So unless there is something wrong with the UUID solution, I’d just stick with that.
In spite of being hard to read, and perhaps requiring slightly more storage, UUID's have many advantages over trying to enforce uniqueness with integers (in general). I encourage you to read up on the nature of UUIDs on Wikipedia.
Integer uniqueness has many pitfalls when trying to scale across independent systems (for fault tolerance and performance reasons). If you can start out working with UUID's, you can grow with your solution for the long term with many fewer headaches down the road.
FWIW, if you end up storing UUID's in PostgreSQL sometime down the road, be sure to take advantage of the 'uuid' datatype. It will make storing and indexing those values almost as efficient as plain integers. (It will be hard to tell the difference.)
It appears that a certain project of mine will require the use of quad-trees, something that I have never worked with before. From what I have read they should allow substantial performance enhancements than a brute-force attempt at the problem would yield. Are any of these python modules any good?
Quadtree 0.1.2 <= No: unable to execute in Python 3.1
QuadTree <= Yes: simple while working with rectangles
quadtree.py <= No: no support for needed operations
EDIT 1: Does anyone know of a better implementation than the one presented in the pygame wiki?
EDIT 2: Here are a few resources that others may find useful for path-finding techniques in Python.
Game Entity Navigation
Catch the Cootie
In this comment, joferkington refers to the current question and says:
Just for whatever it's worth, scipy.spatial.KDTree (and/or scipy.spatial.cKDTree, which is written in C for performance reasons) is a far more robust choice than the options listed.
Another library to check out is PyQuadTree, a pure python quadtree index that also works on Python 3x. All you need to add an item is its bounding box as a 4-length sequence, so it could be used for a variety of purposes and even negative coordinate systems.
Although I am the author, I really just took someone else's quadtree structure/code and made it more user-friendly, added support for rectangle-quads, and added documentation. A simple example of how to use it:
#SETUP
import pyqtree
spindex = pyqtree.Index(bbox=[0,0,1000,500])
#ADD SOME ITEMS
for item in items:
spindex.insert(item=item, bbox=item.bbox)
#RETRIEVE ITEMS FROM A REGION
result = spindex.intersect(bbox=[233,121,356,242])
The python package index produces 2 other libraries when searching for quadtree : http://pypi.python.org/pypi?%3Aaction=search&term=quadtree&submit=search
disclaimer : never used quadtrees or any of these libraries.
Sometimes, it is not obvious how to implement data structures like trees in Python.
For instance,
D
/ \
B F
/ \ / \
A C E G
is a simple binary tree structure. In Python, you would represent it like so:
[D,B,F] is a node with a left and right subtree. To represent the full tree you would have:
[D,[[B,A,C],[F,E,G]]]
That is a simple list of nested lists where any node can be a value like D or C, and any node can be a subtree which is, recursively, a list of nested lists. You could do something similar with a dictionary of dictionaries. These types of implementations are a bit quick and dirty and might not be acceptable in an assignment where the instructor expects a Node class with pointers to other nodes, but in the real world it is generally better to use the optimized implementations of Python lists/dictionaries first. Only if the result is inadequate in some way, rewrite it to be more like you would write it in C or Java.
Beyond that of course you need to implement the various algorithms to manipulate your trees because a quadtree is more than just some data; it is a set of rules about how to insert and delete nodes. If this is not a coursework question, then Quadtree 0.1.2 would probably be a good idea.
I don't know if this is the place to ask about algorithms. But let's see if I get any answers ... :)
If anything is unclear I'm very happy to clarify things.
I just implemented a Trie in python. However, one bit seemed to be more complicated than it ought to (as someone who loves simplicity). Perhaps someone has had a similar problem?
My aim was to minimize the number of nodes by storing the largest common prefix of a sub-trie in its root. For example, if we had the words stackoverflow, stackbase and stackbased, then the tree would look something like this:
[s]tack
[o]verflow ______/ \_______ [b]ase
\___ [d]
Note that one can still think of the edges having one character (the first one of the child node).
Find-query is simple to implement.
Insertion is not hard, but somewhat more complex than I want.. :(
My idea was to insert the keys one after the other (starting from an empty trie), by first searching for the to-be-inserted key k (Find(k)), and then rearranging/splitting the nodes locally at the place where the find-procedure stops. There turn out to be 4 cases:
(Let k be the key we want to insert, and k' be the key of the node, where the search ended)
k is identical to k'
k is a "proper" prefix of k'
k' is a "proper" prefix of k
k and k' share some common prefix, but none of the cases (1), (2) or (3) occur.
It seems that each of the cases are unique and thus imply different modifications of the Trie. BUT: is it really that complex? Am I missing something? Is there a better approach?
Thanks :)
At a glance, it sounds like you've implemented a Patricia Trie. This approach also is called path compression in some of the literature. There should be copies of that paper that aren't behind the ACM paywall, which will include an insertion algorithm.
There's also another compression method you may want to look at: level compression. The idea behind path compression is to replace strings of single child nodes with a single super node that has a "skip" count. The idea behind level compression is to replace full or nearly full subtrees with a super node with a "degree" count that says how many digits of the key the node decodes. There's also a 3rd approach called width compression, but I'm afraid my memory fails me and I couldn't find a description of it with quick googling.
Level compression can shorten the average path considerably, but insertion and removal algorithms get quite complicated as they need to manage the trie nodes as similarly to dynamic arrays. For the right data sets, level compressed trees can be fast. From what I remember, they're the 2nd fastest approach for storing IP routing tables, the fastest is some sort of hash trie.
I don't see anything wrong with your approach. If you're looking for a spike solution, perhaps the action taken in case 4 is actually feasible for the first three cases, IE find the common prefix to k and k' and rebuild the node with that in mind. If it happens that the keys were prefixes of one-another, the resulting trie will still be correct, only the implementation did a bit more work than it really had to. but then again, without any code to look at it's hard to say if this works in your case.
Somewhat of a tangent, but if you are super worried about the number of nodes in your Trie, you may look at joining your word suffixes too. I'd take a look at the DAWG (Directed Acyclic Word Graph) idea: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
The downside of these is that they aren't very dynamic and creating them can be difficult. But, if your dictionary is static, they can be super compact.
I have a question regarding your implementation. What is the level of granularity that you decide to split your strings on to make the prefix tree. You could split stack as either s,t,a,c,k or st,ta,ac,ck and many other ngrams of it. Most prefix tree implementations take into account an alphabet for the language, based on this alphabet, you do the splitting.
If you were building a prefix tree implementation for python then your alphabets would be things like def, : , if , else... etc
Choosing the right alphabet makes a huge difference in building efficient prefix trees. As for your answers, you could look for PERL packages on CPAN which do longest common substring computation using trie's. You may have some luck there as most of their implementation is pretty robust.
Look at : Judy-arrays and the python interface at http://www.dalkescientific.com/Python/PyJudy.html