Replacing a group of words from a long document [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I was trying to solve this problem.
I have a long form technical document (600 pages). I need to substitute group of words with their abbreviations. Say, 'Long Form Document' is to be detected and replaced with LFD in the text.
I have a list of these group of words and their abbreviation. Also, the length of words is not fixed, it ranges from 2-6 words to be replaced by one single abbreviation.
I have tried creating n-grams and substituting but it distorts the document with unnecessary combinations and count of tokens is important. I also tried using a regex with window of 5 words and capital alphabets not preceded by full stop. Please suggest a suitable solution.

Below is an example in Python 3. There is no tokenisation: all possible terms to be abbreviated are simply or'ed together into a pattern, and the match is replaced with the corresponding abbreviation.
The performance of this solution with a large dictionary of abbreviations is to be determined experimentally.
If 600 pages are too much to be loaded into memory, a file can be loaded line by line, assuming that no group goes across several lines. If it can happen, a window of two or three lines can be processed at a time, then advancing by one line.
import re
text = '''I was trying to solve this problem. I have a long form technical document (600 pages).
I need to substitute group of words with their abbreviations.
Say, 'Long Form Document' is to be detected and replaced with LFD in the text.
I have a list of these group of words and their abbreviation.
Also, the length of words is not fixed, it ranges from 2-6 words to be replaced by one single abbreviation.
I have tried creating n-grams and substituting but it distorts the document with unnecessary combinations and count of tokens is important.
I also tried using a regex with window of 5 words and capital alphabets not preceded by full stop. Please suggest a suitable solution.'''
abbr = {
'Long Form Document': 'LFD',
'Short Form Focument': 'SFD'
}
def abbreviate (text: str, abbr: dict) -> str:
pattern = re.compile ('|'.join (abbr.keys ()), re.I)
return re.sub (pattern, lambda m: abbr [m.group()], text)
# Test:
print (abbreviate (text, abbr))
Output:
I was trying to solve this problem. I have a long form technical document (600 pages).
I need to substitute group of words with their abbreviations.
Say, 'LFD' is to be detected and replaced with LFD in the text.
I have a list of these group of words and their abbreviation.
Also, the length of words is not fixed, it ranges from 2-6 words to be replaced by one single abbreviation.
I have tried creating n-grams and substituting but it distorts the document with unnecessary combinations and count of tokens is important.
I also tried using a regex with window of 5 words and capital alphabets not preceded by full stop. Please suggest a suitable solution.

Related

Find regex match in large dictionary using python quickly

I have a large dictionary containing regex values as the key and a numeric value as a value, and given a corpus (broken down into a list of individual word tokens) I would like to to find the regex value that best matches my word to obtain its respective value.
The dictionary contains many regex values that are ambiguous, in the sense that a word may have multiple regex matches, and therefore you would want to find the longest regex or 'best match' (ex: dictionary contains affect+, as well as affected an affection)
My issue is when running a large text sample through the dictionary and finding the regex match of each word token, it takes a long amount of time (0.1s per word), which obviously adds up over 1000's of words. This is because it goes through the whole dictionary each time to find the 'best match'.
Is there a faster way to achieve this? Please see the problematic part of my code below.
for word in textTokens:
for reg,value in dictionary.items():
if(re.match(reg, word)):
matchedWords.append(reg)
Because you mentioned the input regexes have the structure of word+ a simple word and a plus regex symbol, you can use a modified version of the optimal Aho-Corasick algorithm, in this algorithm you make a finite-state-machine from in search patterns that can easily be modified accept some regex signs easily, like in your case a very simplistic solution would be to pad your keys to length of longest words in the list and accept anything that comes after the padding, there is an easy to implement wildcards are '.' and '?', for * one have to go to end of the word or return and follow the other path through list, which can be exponential number of choices in constant memory(any and all deterministic finite automatons).
A finite state machine for a list of your regex keys can be made in linear time, meaning it takes time proportional to the sum of length of your dictionary keys. as explained here there is also support for longest matched word in the dictionary.

Matching list of words in any order but only if adjacent/separated by a maximum of n words

I have only been able to find advice on how to match a number of substrings in any order, or separated by a maximum of n words (not both at once).
I need to implement a condition in python where a number of terms appear in any order but separated by e.g. a maximum of one word, or adjacent. I've found a way to implement the "in any order" part using lookarounds, but it doesn't account for the adjacent/separated by a maximum of one word issue. To illustrate:
re.search("^.*(?=.*word1\s*\w*\sword2)(?=.*\w*)(?=.*word3\w*).*$", "word1 filler word2 and word3")
This should match "word1 word2" or "word1 max1word word2" and "word3*", in any order, separated by one word as in this case - which it does. However, it also matches a string where the terms are separated by two or more words, which it should not. I tried doing it like this:
re.search("^.*(?=\s?word1\s*\w*\sword2)(?=\s?\w*)(?=\s?word3\w*).*$", "word1 word2 word3")
hoping that using \s? at the beginning of each bracketed term instead of .* would fix it but that doesn't work at all (no match even when there should be one).
Does anyone know of a solution?
In the actual patterns I'm looking for it's more than just two separate strings, so writing out each possible combination is not feasible.
Well, your question is not perfectly clear but you could try this, assuming word1, word2 and word3 are known words
(?:word1(\s\w+)?\sword2)|(?:word2(\s\w+)?\sword1)|word3
Demo
I'm trying to identify patents related to AI using a keyword search in abstracts and titles
In that case, I wouldn't recommend regular expressions at all.
Instead,
Use a library that provides stemming.
This is how your code knows that that "program", "programming", "programmer", etc. are related, and is much more robust than a regular expression solution and is more likely to be semantically accurate.
Use a library that can generate n-grams.
N-grams are essentially tuples of subsequent words from a text. You can infer that words appearing in the same tuple are more likely to be related to each other.
Use a library of stop words.
These are words that are so common as to be unlikely to provide value, like "and".
Putting this all together, 3-grams for
inductive randomword logic and programming
would be something like
(induct, randomword, logic)
(randomword, logic, program)
Applying this to an arbitrary article, you can then look for stemmed n-grams with whatever terms you wish. Adjust n if you find you're having too many false negatives or positives.
This isn't the place for a complete tutorial, but a library like NLTK provides all of these features.

Regex to match word bounded instances with 'dot' inside? [duplicate]

This question already has answers here:
Regular expression for floating point numbers
(20 answers)
Closed 3 years ago.
Hope the question was understandable.
What I want to do is to match anything that constitutes a number (int and float) in python syntax. For instance, I want to match everything on the form (including the dot):
123
123.321
123.
My attempted solution was
"\b\d+/.?\d*\b"
...but this fails. The idea is to match any sequence that starts with one or more digit (\d+), followed by an optional dot (/.?), followed by an arbitrary number of digits (\d*), with word boundaries around. This would match all three number forms specified above.
The word boundary is important because I do not want to match the numbers in
foo123
123foo
and want to match the numbers in
a=123.
foo_method(123., 789.1, 10)
However the problem is that the last word boundary is recognised right before the optional dot. This prevents the regex to match 123. and 123.321, but instead matches 123 and 312.
How can I possibly do this with word boundaries out of the question? Possible to make program perceive the dot as word character?
The float spec is a little more complicated than you've got covered there.
This matches pythons float spec, though there are others as well.
r"[+-]?\d+\.?\d*([eE][+-]?\d+)?"
You can add on positive lookaheads and lookbehinds to this if you are doing something relatively simple, but you may want to split all of what you are parsing by word boundary before parsing for something more complex
This would be the version ensuring word boundaries:
r"(?<=\b)[+-]?\d+\.?\d*([eE][+-]?\d+)?(?=\b)"

How to grab multiple paragraphs in the capture group? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
I'm using this code: (?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*(.+?)\n*Item.*?1B to grab the following text:
ITEM 1A. RISK FACTORS
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
ITEM 1B.
But it would not grab anything in the capturing group, unless it's one paragraph like this:
ITEM 1A. RISK FACTORS
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
ITEM 1B.
Your regex is matching any number of newlines, then any amount of text on one line, then any number of newlines - it's only looking for a single "paragraph" between newlines, since . does not capture across lines.
Try replacing it with something like [\s\S], which will capture everything - including newlines, paragraphs, text, space, anything you want. Of special note is that this will capture any number of paragraphs, with any amount of whitespace between them.
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors\n*([\s\S]*?)\n*Item.*?1B
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors Match up to the end of risk factors.
\n* Match as many newlines as needed 'till we hit the next paragraph.
([\s\S]*?) Capture anything, across any number of lines (lazy).
\n* Match as many newlines as needed 'till we hit the next paragraph.
Item.*?1B Match the rest of the content. (This doesn't match the . at the very end, did you mean for it to? If so, add \. to the end).
Try it here!
Try
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*((.*\n*)+)\n*Item.*?1B
And for the sake of your future regex headaches, an incredible resource:
https://regex101.com
Cheers-

Python Regular expression search specific string beside number

I need help here.
I have a list and string.
Things I want to do is to find all the numbers from the string and also match the words from the list in the string that are beside numbers.
str = 'Lily goes to school everyday at 9:00. Her House is near to her school.
Lily's address - Flat No. 203, 14th street lol lane, opp to yuta mall,
washington. Her school name is kids International.'
list = ['school', 'international', 'house', 'flat no']
I wrote a regex which can pull numbers
x = re.findall('([0-9]+[\S]+[0-9]+|[0-9]+)' , str,re.I|re.M)
Output I want:
Numbers - ['9:00', '203', '14th']
Flat No.203 (because flat no is beside 203)
14 is also beside string but I dont want it because it is not contained in list.
But How can I write regex to make second condition satisfy. that is to search
whether flat no is beside 203 or not in same regex.
There you go:
(\d{1,2}:\d{1,2})|(?:No\. (\d+))|(\d+\w{2})
Demo on Regex101.com can be found here
What does it do and how does it work?
I use two pipes (|) to gather different number "types" you want:
First alteration ((\d{1,2}:\d{1,2}) - captures time using 1-2 digits followed by a colon and another set of 1-2 digits (probably you could go for 2 digits only).
Second alteration (?:No\. (\d+)) - gives you the number prefixed with literal "No. " (note the space at the end), and then captures following number, no matter how long (at least one digit)
The third and the last part (\d+\w{2}) - simply captures any number of digits (again, at least one) followed by two word characters. You could further improve this part of the regex to match only st, nd, and th suffixes, but I will leave this up to you.
Also to get rid of further unneeded matches you could use lookarounds, but again - I'll leave this up to you to implement.
General note - rather than using one regex to rule... erm - match them all, you should focus on creating many simple regexes. Not only will this improve legibility, but also maintainability of the regexes. This also allows you to search for timestamps, building numbers and positional numerals separately, easily allowing you to split this information to specific variables.

Categories