Does anyone know an algorithm to check which of two wild card expression is more general than the other?
For example I'd like to compare
*/foo/foo.bar
with
*.bar
Clearly the first expression is contained in the second. I know that is not possible for regex (at least not if you don't have a looooot of time, as far as I remember this is in complexity class Non elementary), but it could be possible for wild card expression which are far less expressive. I tried to put together a simple python algorithm, but it get's very nasty when it comes to special cases.
Anybody has an idea if there is an algorithm for that problem?
UPDATE:
I do not want to use any brute force algorithm, since this won't work in general, because of performance reasons
Regards,
Gerald
You basically need to find a string somehow that matches the more general glob but not the more specific one. Just being captain obvious...
Probably by replacing * character with 0 or more random symbols.
Related
I'm currently working on a neural network that evaluates students' answers to exam questions. Therefore, preprocessing the corpora for a Word2Vec network is needed. Hyphenation in german texts is quite common. There are mainly two different types of hyphenation:
1) End of line:
The text reaches the end of the line so the last word is sepa-
rated.
2) Short form of enumeration:
in case of two "elements":
Geistes- und Sozialwissenschaften
more "elements":
Wirtschafts-, Geistes- und Sozialwissenschaften
The de-hyphenated form of these enumerations should be:
Geisteswissenschaften und Sozialwissenschaften
Wirtschaftswissenschaften, Geisteswissenschaften und Sozialwissenschaften
I need to remove all hyphenations and put the words back together. I already found several solutions for the first problem.
But I have absoluteley no clue how to get the second part (in the example above "wissenschaften") of the words in the enumeration problem. I don't even know if it is possible at all.
I hope that I have pointet out my problem properly.
So has anyone an idea how to solve this problem?
Thank you very much in advance!
It's surely possible, as the pattern seems fairly regular. (Something vaguely analogous is sometimes seen in English. For example: The new requirements applied to under-, over-, and average-performing employees.)
The rule seems to be roughly, "when you see word-fragments with a trailing hyphen, and then an und, look for known words that begin with the word-fragments, and end the same as the terminal-word-after-und – and replace the word-fragments with the longer words".
Not being a German speaker and without language-specific knowledge, it wouldn't be possible to know exactly where breaks are appropriate. That is, in your Geistes- und Sozialwissenschaften example, without language-specific knowledge, it's unclear whether the first fragment should become Geisteszialwissenschaften or Geisteswissenschaften or Geistesenschaften or Geiestesaften or any other shared-suffix with Sozialwissenschaften. But if you've got a dictionary of word-fragments, or word-frequency info from other text that uses the same full-length word(s) without this particular enumeration-hyphenation, that could help choose.
(If there's more than one plausible suffix based on known words, this might even be a possible application of word2vec: the best suffix to choose might well be the one that creates a known-word that is closest to the terminal-word in word-vector-space.)
Since this seems a very German-specific issue, I'd try asking in forums specific to German natural-language-processing, or to libraries with specific German support. (Maybe, NLTK or Spacy?)
But also, knowing word2vec, this sort of patch-up may not actually be that important to your end-goals. Training without this logical-reassembly of the intended full words may still let the fragments achieve useful vectors, and the corresponding full words may achieve useful vectors from other usages. The fragments may wind up close enough to the full compound words that they're "good enough" for whatever your next regression/classifier step does. So if this seems a blocker, don't be afraid to just try ignoring it as a non-problem. (Then if you later find an adequate de-hyphenation approach, you can test whether it really helped or not.)
I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?
I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)
n_dicwords = [np.sum([c.lower().count(w.decode('utf-8')) for w in dictionary])
for c in documents]
Here I am trying to determine my feature engineering computation time:
By using this line of code, which goes through every document and checks whether or not and if yes then how many its words also appear in this dictionary that I have, it generates a feature called n_dicwords. Sorry I am such a noob to complexity theory, I think the time complexity for generating this feature is O(n* m*w) where n is the number of documents, m is the number of words in each document and w is the number of words in the dictionary. Am I right? And if so is there any way to improve this?
Thank you so much! I am really appreciated for your help!
Unless the code underneath your code does any clever stuff your complexity analysis should be correct.
If performance in this part is important you should use a multiple-pattern string search algorithm, which attempts to solve pretty much the exact problem you are doing.
To start with have a look at Aho-Corasick which is the most commonly used one and runs in linear time. Googling "Aho-Corasick python" turned up a few different implementations, so while I have not used any of them personally I would think you would not have to implement the algorithm itself to use it.
If you just need your code to run a little faster, and don't need to get the best performance you possibly could you could just use a set for the dictionary. In python a normal set is a hash set, so it has constant time lookup. Then you could just for each word check if it is in the dictionary.
I'm slightly surprised to note the the "x in s" construction in python is O(n), where n is the number of items n the list. So, your estimation is correct. A slightly more correct way of looking at it: Since your document or wor counts in said aren't changing at all, the important numbers are the total number of words which must be checked, and the length of the dictionary against which they are being checked. Obviously, this doesn't change the number of computations at all, it just gets us to a quickly recognizable form of O(m*n).
You could conceivably store your dictionary in a binary tree, which would reduce that to O(log(n)).
Search for "binary tree python" on Google, I was a few interesting things out there, like a package called "bintrees".
However, Erik Vesteraas points out the the python 'set' data structure is a hashed based collection, and has a complexity of O(1) in the average case, and O(n) in the worst, and highly rare case.
See https://docs.python.org/2/library/stdtypes.html#set
I have the special case of the problem, but it would be nice to know whether it is possible for any function.
So I want to find the position of a substring in a string. Ok, in python there is a find method which does exactly what is needed.
string.find(s, sub[, start[, end]])
Return the lowest index in s where
the substring sub is found such that sub is wholly contained in
s[start:end]. Return -1 on failure. Defaults for start and end and
interpretation of negative values is the same as for slices.
Amazing, but the problem is that finding a big substring in a big string can run from O(n*m) to O(n) (which is a huge deal) depending on the algorithm. Documentation gives no information about time complexity, nor information about the underlying algorithm.
I see few approaches how to resolve this:
benchmark
go to source code and try to understand it
Both does not sound really easy (I hope that there is an easier way). So how can I find a complexity of a built-in function?
You say, "go to source code and try to understand it," but it might be easier than you think. Once you get to the actual implementation code, in Objects/stringlib/fastsearch.h, you find:
/* fast search/count implementation, based on a mix between boyer-
moore and horspool, with a few more bells and whistles on the top.
for some more background, see: http://effbot.org/zone/stringlib.htm */
The URL referenced there has a good discussion of the algorithm and its complexity.
Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.
So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.
To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.
The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.
EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.
Writing an implementation of a regular expression engine is indeed a quite complex task.
But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.
I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?
I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.
What I'll (very quickly) describe is the approach taken in Modern Compiler Design.
Imagine you have the following regular expression...
a (b c)* d
The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...
0 : .a (b c)* d
The only possible match is 'a', so the next state we derive is...
1 : a.(b c)* d
We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...
2 : a (b.c)* d
3 : a (b c)* d.
State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.
IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...
1 : a.b c (b c)* d
a.d
That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).
EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.
In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.
Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.
With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.
If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.
Hopcrofts algorithm is an efficient way to handle this basic idea.
A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.
The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.
BTW - have a look around the rest of Dick Grunes site - he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.
"A play on regular expressions: functional pearl" takes an interesting approach. The implementation is given in Haskell, but it's been reimplemented in Python at least once.
The developed program is based on an old technique to turn regular expressions into finite automata which makes it efficient both in terms of worst-case time and space bounds and actual performance: despite its simplicity, the Haskell implementation can compete with a recently published professional C++ program for the same problem.
There's an interesting (if slightly short) chapter in Beautiful Code by Brian Kernighan, appropriately called "A Regular Expression Matcher". In it he discusses a simple matcher that can match literal characters, and the .^$* symbols.
I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examples and run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.