Text Parsing Design

Text Parsing Design - python

Let's say I have a paragraph of text like so:
Snails can be found in a very wide
range of environments including
ditches, deserts, and the abyssal
depths of the sea. Numerous kinds of snail can
also be found in fresh waters. (source)
I have 10,000 regex rules to match text, which can overlap. For example, the regex /Snails? can/i will find two matches (italicized in the text). The regex /can( also)? be/i has two matches (bolded).
After iterating through my regexes and finding matches, what is the best data structure to use, that given some place in the text, it returns all regexes that mached it? For example, if I want the matches for line 1, character 8 (0-based, which is the a in can), I would get a match for both regexes previously described.
I can create a hashmap: (key: character location, value: set of all matching regexes). Is this optimal? Is there a better way to parse the text with thousands of regexes (to not loop through each one)?
Thanks!

Storing all of the matches in a dictionary will work, but will it means you'll have to store all of the matches in memory at the same time. If your data is small enough to easily fit into memory, don't worry about it. Just do what works and move on.
If you do need to reduce memory usage of increase speed it really depends on how you are using the data. For example, if you process positions starting at the beginning and going to the end, you could use re.finditer to iteratively process all of the regexes and not maintain extra matches in memory longer then needed.

I'm assuming that your regex does not cross between multiple sentences. In that case you could
1) break your text into array of sentences
2) for each sentence simply record which (id) regex have matched.
3) when you would like to see the match - run the regex again.
"Store less / compute more" solution.

Related

Python Regex: How to get all matches of the _entire_ regex in a string with different occurrences and multiple regexes involved

I extracted text from a pdf and I am using re.finditer() atm but as the documentation on re.match() says, the latter already returns a match object if "zero or more characters at the beginning of string match the regular expression pattern"
re.finditer() also behaves that way. It is sufficient obviously for a certain amount of the beginning of two quite similar parts of the string to be considered as occurrence or "match" of the same compiled regular expression - which is NOT what I want/need.
In order to correctly "parse" the full text extracted from the pdf I will need to employ multiple regular expressions and I must employ them in their entirety. Either a block-type of unknown size beforehand from that text fully satisfies the unique and specific pattern-type or not.
Sadly re.fullmatch is not an alternative because it wants to match the whole text but as I said the whole text is a composition of different rexexp patterns, partly with multiple occurrences only differing on a very individual level such as "name of the store where I purchased things" but that is still into the specificity of the respective regexp-type as a capturing group which I need to process further.
Hence, question is: What else to use than re.finditer() if I don't know the start- and endpos of each possible block. To find out where the borders of each type-block-instance are is one reason I am wanting to test the text against multiple regexes.

Matching list of words in any order but only if adjacent/separated by a maximum of n words

I have only been able to find advice on how to match a number of substrings in any order, or separated by a maximum of n words (not both at once).
I need to implement a condition in python where a number of terms appear in any order but separated by e.g. a maximum of one word, or adjacent. I've found a way to implement the "in any order" part using lookarounds, but it doesn't account for the adjacent/separated by a maximum of one word issue. To illustrate:
re.search("^.*(?=.*word1\s*\w*\sword2)(?=.*\w*)(?=.*word3\w*).*$", "word1 filler word2 and word3")
This should match "word1 word2" or "word1 max1word word2" and "word3*", in any order, separated by one word as in this case - which it does. However, it also matches a string where the terms are separated by two or more words, which it should not. I tried doing it like this:
re.search("^.*(?=\s?word1\s*\w*\sword2)(?=\s?\w*)(?=\s?word3\w*).*$", "word1 word2 word3")
hoping that using \s? at the beginning of each bracketed term instead of .* would fix it but that doesn't work at all (no match even when there should be one).
Does anyone know of a solution?
In the actual patterns I'm looking for it's more than just two separate strings, so writing out each possible combination is not feasible.

Well, your question is not perfectly clear but you could try this, assuming word1, word2 and word3 are known words
(?:word1(\s\w+)?\sword2)|(?:word2(\s\w+)?\sword1)|word3
Demo

I'm trying to identify patents related to AI using a keyword search in abstracts and titles
In that case, I wouldn't recommend regular expressions at all.
Instead,
Use a library that provides stemming.
This is how your code knows that that "program", "programming", "programmer", etc. are related, and is much more robust than a regular expression solution and is more likely to be semantically accurate.
Use a library that can generate n-grams.
N-grams are essentially tuples of subsequent words from a text. You can infer that words appearing in the same tuple are more likely to be related to each other.
Use a library of stop words.
These are words that are so common as to be unlikely to provide value, like "and".
Putting this all together, 3-grams for
inductive randomword logic and programming
would be something like
(induct, randomword, logic)
(randomword, logic, program)
Applying this to an arbitrary article, you can then look for stemmed n-grams with whatever terms you wish. Adjust n if you find you're having too many false negatives or positives.
This isn't the place for a complete tutorial, but a library like NLTK provides all of these features.

Is a single big regex more efficient than a bunch of smaller ones?

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.
There are many possible forms of that code, for example:
UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...
What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:
UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}
this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)
But if you write your pattern like this:
(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})
or the variation:
[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})
Most of the positions where the match isn't possible are quickly discarded before the alternation.
Also, when you write a single pattern, obviously, the string is parsed only once.

How can I use use a regex to match characters that aren't included in certain words?

Suppose I want to return all occurrences of 'lep' in a string in Python, but not if an occurrence is in a substring like 'filepath' or 'telephone'. Right now I am using a combination of negative lookahead/lookbehind:
(?<!te|fi)lep(?!hone|ath)
However, I do want 'telepath' and 'filephone' as well as 'filep' and 'telep'. I've seen similar questions but not one that addresses this type of combination of lookahead/behind.
Thanks!

You can place lookaheads inside lookbehinds (and vice-versa; any combination, really, so long as every lookbehind has a fixed length). That allows you to combine the two conditions into one (doesn't begin with X and end with Y):
lep(?<!telep(?=hone))(?<!filep(?=ath))
Putting the lookbehinds last is more efficient, too. I would advise doing it that way even if there's no suffix (for example, lep(?<!filep) to exclude filep).
However, generating the regexes from user input like lep -telephone -filepath promises to be finicky and tedious. If you can, it would be much easier to search for the unwanted terms first and eliminate them. For example, search for:
(?:telephone|filepath|(lep))
If the search succeeds and group(1) is not None, it's a hit.

Reading text from a file, then writing to another file with repetitions in text marked

I'm a beginner to both Python and to this forum, so please excuse any vague descriptions or mistakes.
I have a problem regarding reading/writing to a file. What I'm trying to do is to read a text from a file and then find the words that occur more than one time, mark them as repeated_word and then write the original text to another file but with the repeated words marked with star signs around them.
I find it difficult to understand how I'm going to compare just the words (without punctuation etc) but still be able to write the words in its original context to the file.
I have been recommended to use regex by some, but I don't know how to use it. Another approach is to iterate through the textstring and tokenize and normalize, sort of by going through each character, and then make some kind av object or element out of each word.
I am thankful to anyone who might have ideas on how to solve this. The main problem is not how to find which words that are repeated but how to mark them and then write them to the file in their context. Some help with the coding would be much appreciated, thanks.
EDIT
I have updated the code with what I've come up with so far. If there is anything you would consider "bad coding", please comment on it.
To explain the Whitelist class, the assignment has two parts, one of where I am supposed to mark the words and one regarding a whitelist, containing words that are "allowed repetitions", and shall therefore not be marked.
I have read heaps of stuff about regex but I still can't get my head around how to use it.

Basically, you need to do two things: find which words are repeated, and then transform each of these words into something else (namely, the original word with some marker around it). Since there's no way to know which words are repeated without going through the entire file, you will need to make two passes.
For the first pass, all you need to do is extract the words from the text and count how many times each one occurs. In order to determine what the words are, you can use a regular expression. A good starting point might be
regex = re.compile(r"[\w']+")
The function re.compile creates a regular expression from a string. This regular expression matches any sequence of one or more word characters (\w) or apostrophes, so it will catch contractions but not punctuation, and I think in many "normal" English texts this should capture all the words.
Once you have created the regular expression object, you can use its finditer method to iterate over all matches of this regular expression in your text.
for word in regex.finditer(text):
You can use the Counter class to count how many times each word occurs. (I leave the implementation as an exercise. :-P The documentation should be quite helpful.)
After you've gotten a count of how many times each word occurs, you will have to pick out those whose counts are 2 or more, and come up with some way to identify them in the input text. I think a regular expression will also help you here. Specifically, you can create a regular expression object which will match any of a selected set of words, by compiling the string consisting of the words joined by |.
regex = re.compile('|'.join(words))
where words is a list or set or some iterable. Since you're new to Python, let's not get too fancy (although one can); just code up a way to go through your Counter or whatever and create a list of all words which have a count of 2 or more, then create the regular expression as I showed you.
Once you have that, you'll probably benefit from the sub method, which takes a string and replaces all matches of the regular expression in it with some other text. In your case, the replacement text will be the original word with asterisks around it, so you can do this:
new_text = regex.sub(text, r'*\0*')
In a regular expression replacement, \0 refers to whatever was matched by the regex.
Finally, you can write new_text to a file.

If you know that the text only contains alphabetic characters, it may be easier to just ignore characters that are outside of a-z than to try to remove all the punctuation.
Here is one way to remove all characters that are not a-z or space:
file = ''.join(c for c in file if 97 <= ord(c) <= 122 or c == ' ')
This works because ord() returns the ASCII code for a given character, and ASCII 97-122 represent a-z (in lowercase).
Then you'll want to split those into words, you can accomplish that like:
words = file.split()
If you pass this to the Counter data structure it will count the occurrences of each word.
counter = Counter(file.split)
Then counter.items() will contain a mapping from word to number of occurrences.

OK. I presume that this is a homework assignment, so I'm not going to give you a complete solution. But, you really need to do a number of things.
The first is to read the input file in to memory. Then split it in to its component words (tokenize it) probably contained in a list, suitably cleaned up to remove stray punctuation. You seem to be well on your way to doing that, but I would recommend you look at the split() and strip() methods available for strings.
You need to consider whether you want the count to be case sensitive or not, and so you might want to convert each word in the list to (say) lowercase to keep this consistent. So you could do this with a for loop and the string lower() method, but a list-comprehension is probably better.
You then need to go through the list of words and count how many times each one appears. If you check out collections.Counter you will find that this does the heavy lifting for your or, alternatively, you will need to build a dictionary which has the words as keys and the count of the words. (You might also want to check out the collections.defaultdict class here as well).
Finally, you need to go through the text you've read from the file and for each word it contains which has more than one match (i.e. the count in the dictionary or counter is > 1) mark it appropriately. Regular expressions are designed to do exactly this sort of thing. So I recommend you look at the re library.
Having done that, you simply then write the result to a file, which is simple enough.
Finally, with respect to your file operations (reading and writing) I would recommend you consider replacing the try ... except construct with a with ... as one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.