I don't know if this is the place to ask about algorithms. But let's see if I get any answers ... :)
If anything is unclear I'm very happy to clarify things.
I just implemented a Trie in python. However, one bit seemed to be more complicated than it ought to (as someone who loves simplicity). Perhaps someone has had a similar problem?
My aim was to minimize the number of nodes by storing the largest common prefix of a sub-trie in its root. For example, if we had the words stackoverflow, stackbase and stackbased, then the tree would look something like this:
[s]tack
[o]verflow ______/ \_______ [b]ase
\___ [d]
Note that one can still think of the edges having one character (the first one of the child node).
Find-query is simple to implement.
Insertion is not hard, but somewhat more complex than I want.. :(
My idea was to insert the keys one after the other (starting from an empty trie), by first searching for the to-be-inserted key k (Find(k)), and then rearranging/splitting the nodes locally at the place where the find-procedure stops. There turn out to be 4 cases:
(Let k be the key we want to insert, and k' be the key of the node, where the search ended)
k is identical to k'
k is a "proper" prefix of k'
k' is a "proper" prefix of k
k and k' share some common prefix, but none of the cases (1), (2) or (3) occur.
It seems that each of the cases are unique and thus imply different modifications of the Trie. BUT: is it really that complex? Am I missing something? Is there a better approach?
Thanks :)
At a glance, it sounds like you've implemented a Patricia Trie. This approach also is called path compression in some of the literature. There should be copies of that paper that aren't behind the ACM paywall, which will include an insertion algorithm.
There's also another compression method you may want to look at: level compression. The idea behind path compression is to replace strings of single child nodes with a single super node that has a "skip" count. The idea behind level compression is to replace full or nearly full subtrees with a super node with a "degree" count that says how many digits of the key the node decodes. There's also a 3rd approach called width compression, but I'm afraid my memory fails me and I couldn't find a description of it with quick googling.
Level compression can shorten the average path considerably, but insertion and removal algorithms get quite complicated as they need to manage the trie nodes as similarly to dynamic arrays. For the right data sets, level compressed trees can be fast. From what I remember, they're the 2nd fastest approach for storing IP routing tables, the fastest is some sort of hash trie.
I don't see anything wrong with your approach. If you're looking for a spike solution, perhaps the action taken in case 4 is actually feasible for the first three cases, IE find the common prefix to k and k' and rebuild the node with that in mind. If it happens that the keys were prefixes of one-another, the resulting trie will still be correct, only the implementation did a bit more work than it really had to. but then again, without any code to look at it's hard to say if this works in your case.
Somewhat of a tangent, but if you are super worried about the number of nodes in your Trie, you may look at joining your word suffixes too. I'd take a look at the DAWG (Directed Acyclic Word Graph) idea: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
The downside of these is that they aren't very dynamic and creating them can be difficult. But, if your dictionary is static, they can be super compact.
I have a question regarding your implementation. What is the level of granularity that you decide to split your strings on to make the prefix tree. You could split stack as either s,t,a,c,k or st,ta,ac,ck and many other ngrams of it. Most prefix tree implementations take into account an alphabet for the language, based on this alphabet, you do the splitting.
If you were building a prefix tree implementation for python then your alphabets would be things like def, : , if , else... etc
Choosing the right alphabet makes a huge difference in building efficient prefix trees. As for your answers, you could look for PERL packages on CPAN which do longest common substring computation using trie's. You may have some luck there as most of their implementation is pretty robust.
Look at : Judy-arrays and the python interface at http://www.dalkescientific.com/Python/PyJudy.html
Related
Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.
I'm looking for more information about dealing with trees of class instances, and how best to go about calling methods on the leaves from the trunk. I have a trunk instance with many branch instances (in a dictionary), and each has many leaf instances (and dicts in the branches). The leaves are where the action really happens, and as such there are methods in the leaves for querying values, restoring values, and many other things.
This leads to what feels like duplication of code, as I might want to do something to all leaves of a branch, so there are methods in the branches for doing something to a leaf, a specified set of leaves, or all leaves known to that branch, though these do reduce the code duplication by simply looping over the leafs and asking them to do said things to themselves (thus the actual code doing the work is in one place in the leaf class).
Then the trunk comes in, where I might want to do something to the entire tree (i.e. all leaves) in one fell swoop, so I have methods there that ask all known objects to run their all-leaf functions. I start to feel pretty removed from the real action in the leaves this way, though it works fine, and the code seems fairly tight - extremely brief, readable, and functioning fine.
Another issue comes in logical groupings. There are bits of data I might want to associate with some, most, or all leaves, to indicate that they're part of some logical group, so currently the leaves themselves are all storing that kind of data. When I want to get a logical group, I have to scan all leaves and gather them back up, rather than having some sort of list at the trunk level. This actually all works fine, and is even pretty logical, yet it feels insane. Is this simply the nature of working with tree-like structures, because of their complexity, or are there other ways of doing these kinds of things? I prefer not to build secondary structures to connect to things from the opposite direction - e.g. making a structure with references to the leaves in a logical group, approaching them then from that more list-like direction. One bonus of keeping things all in a large tree like this is that it can be dumped and loaded in one shot with pickle.
I'd love to hear thoughts - any and all - from anyone else's experience with such things.
What I'm taking away from your question is that "everything works", but that the code is starting to feel unmanagable and difficult to reason about, and: is there a better way to do this?
The one thing your question is missing is a solid context. What sort of problem is your tree structure actually solving? What do these object actually do? Are they all the same type of object, or is there a mix of objects? With some of these specifics you might get more practical responses.
As it stands, I would suggest checking out some resources on design patterns. Specifically the composite and visitor patterns.
On the book end of things you could have a look at Design Patterns and/or Refactoring to Patterns. Neither of these have any Python code in them, but if you don't mind Java, the latter is an excellent introduction to taking hard to reason code structures and using a pattern to better organize things.
You might also have a look at Alex Martelli's talk on Python Design Patterns.
This question has some further resource links regarding patterns and python in general.
Hope that helps.
I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).
My prototype uses Python's set as the obvious data type.
When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.
Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.
I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.
I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.
Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?
Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)
(Edit: to clarify what I'm looking for from an API, I want something like:
def merge(document_sets):
probe_i = 0
probe_set = document_sets[probe_i]
document_id = GET_FIRST(probe_set)
while IS_VALID(document_id):
# See if the document is present in all sets
for i in range(1, len(document_sets)):
# dynamically adapt to favor the least matching set
target_i = (i + probe_i) % len(document_sets)
target = document_sets[target_i]
if document_id not in target_set:
probe_i = target_id
probe_set = document_sets[probe_i]
document_id = GET_NEXT(probe_set, document_id)
break
else:
yield document_id
I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.
This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.)
)
Yes, there are some, though I'm not sure if they're suitable for your use case: but it seems none of them are what you asked for.
BioPython has a Trie implementation in C.
Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520
Other (some very stale) implementations:
http://pypi.python.org/pypi/radix
py-radix is an implementation of a
radix tree data structure for the
storage and retrieval of IPv4 and IPv6
network prefixes.
https://bitbucket.org/markon/patricia-tree/src
A Python implementation of
patricia-tree
http://pypi.python.org/pypi/trie
A prefix tree (trie) implementation.
http://pypi.python.org/pypi/logilab-common/0.50.3
patricia.py : A Python implementation
of PATRICIA trie (Practical Algorithm
to Retrieve Information Coded in
Alphanumeric).
I've recently added iteration support to datrie, you may give it a try.
It appears that a certain project of mine will require the use of quad-trees, something that I have never worked with before. From what I have read they should allow substantial performance enhancements than a brute-force attempt at the problem would yield. Are any of these python modules any good?
Quadtree 0.1.2 <= No: unable to execute in Python 3.1
QuadTree <= Yes: simple while working with rectangles
quadtree.py <= No: no support for needed operations
EDIT 1: Does anyone know of a better implementation than the one presented in the pygame wiki?
EDIT 2: Here are a few resources that others may find useful for path-finding techniques in Python.
Game Entity Navigation
Catch the Cootie
In this comment, joferkington refers to the current question and says:
Just for whatever it's worth, scipy.spatial.KDTree (and/or scipy.spatial.cKDTree, which is written in C for performance reasons) is a far more robust choice than the options listed.
Another library to check out is PyQuadTree, a pure python quadtree index that also works on Python 3x. All you need to add an item is its bounding box as a 4-length sequence, so it could be used for a variety of purposes and even negative coordinate systems.
Although I am the author, I really just took someone else's quadtree structure/code and made it more user-friendly, added support for rectangle-quads, and added documentation. A simple example of how to use it:
#SETUP
import pyqtree
spindex = pyqtree.Index(bbox=[0,0,1000,500])
#ADD SOME ITEMS
for item in items:
spindex.insert(item=item, bbox=item.bbox)
#RETRIEVE ITEMS FROM A REGION
result = spindex.intersect(bbox=[233,121,356,242])
The python package index produces 2 other libraries when searching for quadtree : http://pypi.python.org/pypi?%3Aaction=search&term=quadtree&submit=search
disclaimer : never used quadtrees or any of these libraries.
Sometimes, it is not obvious how to implement data structures like trees in Python.
For instance,
D
/ \
B F
/ \ / \
A C E G
is a simple binary tree structure. In Python, you would represent it like so:
[D,B,F] is a node with a left and right subtree. To represent the full tree you would have:
[D,[[B,A,C],[F,E,G]]]
That is a simple list of nested lists where any node can be a value like D or C, and any node can be a subtree which is, recursively, a list of nested lists. You could do something similar with a dictionary of dictionaries. These types of implementations are a bit quick and dirty and might not be acceptable in an assignment where the instructor expects a Node class with pointers to other nodes, but in the real world it is generally better to use the optimized implementations of Python lists/dictionaries first. Only if the result is inadequate in some way, rewrite it to be more like you would write it in C or Java.
Beyond that of course you need to implement the various algorithms to manipulate your trees because a quadtree is more than just some data; it is a set of rules about how to insert and delete nodes. If this is not a coursework question, then Quadtree 0.1.2 would probably be a good idea.
So I have a huge list of entries in a DB (MySql)
I'm using Python and Django in the creation of my web application.
This is the base Django model I'm using:
class DJ(models.Model):
alias = models.CharField(max_length=255)
#other fields...
In my DB I have now duplicates
eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond,
Disk Jokey Above and Beyond, ...
This is a problem... as it blows a big hole in my DB and therefore my application.
I'm sure other people have encountered this problem and thought about it.
My ideas are the following:
Create a set of rules so a new entry cannot be created?
eg. "DJ Above and Beyond" cannot be
created because "Above & Beyond" is in
the DB
Relate these aliases to each other somehow?
eg. relate "DJ Above and Beyond" to "Above & Beyond"
I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.
Any help would be very much appreciated! Thank you guys.
I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.
Unless you can define a rules system that can work out for any x and y whether x is a duplicate of y, you're going to have to deal with this in a fuzzy, human way.
Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.
From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.
String distance doesn't really work, as strings that are algorithmically close may not be semantically close (e.g. "DJ Above & Beyond" should match "Above and Beyond" but not "DJ Above & Beyond 2" which is closer in Levenshtein distance.
Some cheap alternatives to natural language parsing are soundex, which will match by phonetic sounds, and Stemming, which removes prefixes/suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but this wouldn't be terribly accurate either.
If this is a User-interacting program, you could echo "near misses" to the user, e.g. "Is one of these what you meant to enter?"
You could normalize the entries in some way so that different entries map to the same normalized value (e.g. case normalize, "&" -> "And", etc, etc. which some of the above suggestions might be a step towards) to find near misses or map multiple inputs to a single value.
Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.
I think this is more of a social problem than a programming problem. Any sort of programatic solution to natural language processing like this is going to be buggy and error prone. It's very hard to distinguish things that are close, but legitimately different from the sort of undesired duplicates that you're talking about.
As Dominic mentioned, Stack Overflow's tagging system is a pretty good model for this. It provides cues to the user that encourage them to use existing tags if appropriate (drop down lists as the user types), it allows trusted users to retag individual questions, and it allows moderators to do mass retags.
This is really a process that has to have a person directly involved.
This is not a complete solutions but one thought I had:
class DJ(models.Model):
#other fields, no alias!
class DJAlias(models.Model):
dj = models.ForeignKey(DJ)
This would allow you to have several Aliases for the same dj.
But still you will need to find a proper way to ensure the aliases are added to the right dj. See Dominics post.
But if you check an alias against several other aliases pointing to one dj, the algorithms might work better.
You could try to solve this problem for this instance only (replacing the "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc..). If your table only contains DJ's you could set up a bunch of rules like those.
If your table contains more diverse stuff you will have to go with a more structural approach. Could you give a sample of your dataset?
First of all of course the programming task (NLP etc. as mentioned) is interesting. But as mentioned it's overkill aiming to perfect that.
But the other view is as mentioned ("social"), who enters the data, who views it, how long and how correct should it be? So it's a naming convention issue and reminds me to the great project musicbrainz.org - should your site "just work" or do you prefer to go along standards, in latter case i would orient myself along the mb project - in case you haven't done that and not heard of it.
ie. see here for Above & Beyond: they have on alias defined, they use it to match user searches.
http://musicbrainz.org/show/artist/aliases.html?artistid=58438
check out also the Artist_Alias page in the wiki.
The data model is worth a look and there are even several API bindings to sync data, also in python.
How about changing the model so "alias" to be list of keys to other table that looks like this (skipping small words like "the", "and", etc.):
1 => Above;
2 => Beyond;
3 => Disk;
4 => Jokey;
Then when you want to insert new record just check how many of the significant words from the title are already in the table and match currently existing model entities. If more than 50% (for example) maybe you have a coincidence and you can show list of them to the visitor and asking "do you mean some of this one".
Looks like fuzzywuzzy is a perfect match to your needs.
This article explains the reason it was set up, which very closely matches your requirements -- basically, to handle situations in which two different things were named slightly differently:
One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns).
…
To achieve this, we've built up a library of "fuzzy" string matching routines to help us along.
If you're only after artist names or generally media related names it might be much better to just use the API of last.fm or echonest as they already have a huge rule set and a huge database to settle on.