Parsing special symbol '( )' using regex - python

I am trying to parse text from document using regex. Document contains different structure i.e. section 1.2, section (1). Below regex is able to parse text with decimal point but fails for ().
Any suggestion to handle content which starts with ().
For example:
import re
RAW_Data = '(4) The Governor-General may arrange\n with the Chief Minister of the Australian Capital Territory for the variation or revocation of an \n\narrangement in force under subsection (3). \nNorthern Territory \n (5) The Governor-General may make arrangements with the \nAdministrator of the Northern \nTerritory with respect to the'
f = re.findall(r'(^\d+\.[\d\.]*)(.*?)(?=^\d+\.[\d\.]*)', RAW_Data,re.DOTALL|re.M|re.S)
for z in f:
z=(''.join(z).strip().replace('\n',''))
print(z)
Expected output:
(4) The Governor-General may arrange with the Chief Minister of the Australian Capital Territory for the variation or revocation of an arrangement in force under subsection
(3) Northern Territory
(5) The Governor-General may make arrangements with the Administrator of the Northern Territory with respect to the'

Use regex, [sS]ection\s*\(?\d+(?:\.\d+)?\)?
The (?\d+(?:\.\d+)?\)? will match any number with or without decimal or a brace
Regex

You can try:
(?<=(\(\d\)|\d\.\d))(.(?!\(\d\)|\d\.\d))*
To understand how it works, consider the following block:
(\(\d\)|\d\.\d)
It looks for strings which are of type (X) or X.Y, where X and Y are numbers. Let's call such string 'delimiters'.
Now, the regex above, looks for the first character preceeded by a delimiter (positive lookbehind) and matches the following characters until it finds one which is followed by the delimiter (negative lookhaed).
Try it here!
Hope it helps!

There are a new RegEx \(\d\)[^(]+
\(\d\) match any string like (1) (2) (3) ...
[^(]+ match one or more char and stop matching when found (
test on : on Regex101
But i wonder if you have a special example like (4) The Governor-General may arrange\n with the Chief Minister of the Austr ... (2) (3). \nNorthern Territory \n. It is a sentence from (4) to (2). Because my regex can not match this type of sentence.

Related

Regex: match address string if multiple words

Disclaimer: I know from this answer that regex isn't great for U.S. addresses since they're not regular. However, this is a fairly small project and I want to see if I can reduce the number of false positives.
My challenge is to distinguish (i.e. match) between addresses like "123 SOUTH ST" and "123 SOUTH MAIN ST". The best solution I can come up with is to check if more than 1 word comes after the directional word.
My python regex is of the form:
^(NORTH|SOUTH|EAST|WEST)(\s\S*\s\S*)+$
Explanation:
^(NORTH|SOUTH|EAST|WEST) matches direction at the start of the string
(\s\S*\s\S*)+$ attempts to match a space, a word of any length, another space, and another word of any length 1 or more times
But my expression doesn't seem to distinguish between the 2 types of term. Where's my error (besides using regex for U.S. addresses)?
Thanks for your help.
Your regex misses number in beginning of the address and treats optional word (MAIN in this case) as mandatory. Try this
^\d+ (NORTH|SOUTH|EAST|WEST)((\s\S*)?\s\S*)+$

Removing a sentence from a text in dataframe column

I want to format a text-column in the dataframe in a following way:
In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.
Example df:
index text
1 Trump met with Putin. Learn more here:
2 New movie by Christopher Nolan! Watch here:
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
after formatting should look like this:
index text
1 Trump met with Putin.
2 New movie by Christopher Nolan!
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
lets do it with regex to have more problems
df.text = df.text.str.replace(r"(?<=[.!?])[^.!?]*:\s*$", "", regex=True)
now df.text.tolist() is
['Trump met with Putin.',
'New movie by Christopher Nolan!',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.',
"I don't want to do a national lockdown again. If #coronavirus continues to 'progress' in the UK."]
variable lookbehind ftw
On regex:
(?<=[.!?])
This is a "lookbehind". It doesnt physically match anything but asserts something, which is that there must be something before what follows this. That something happens to be a character class here [.!?] which means either . or ! or ?.
[^.!?]*
Again we have a character class with square brackets. But now we have a caret ^ as the first which means that we want everything except those in the character class. So any character other than . or ! or ? will do.
The * after the character class is 0-or-more quantifier. Meaning, the "any character but .?!" can be found as many times as possible.
So far, we start matching either . or ? or !, and this character is behind a stream of characters which could be "anything but .?!". So we assured we match after the last sentence with this "anything but" because it can't match .?! on the way anymore.
:\s*$
With :, we say that the 0-or-more stream above is to stop whenever it sees : (if ever; if not, no replacement happens as desired).
The \s* after it is to allow some possible (again, 0 or more due to *) spaces (\s means space) after the :. You can remove that if you are certain there shall not be any space after :.
Lastly we have $: this matches the end of string (nothing physical, but positional). So we are sure that the string ends with : followed optionally by some spaces.
Using sent_tokenize from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences
from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
.map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))
index
1 Trump met with Putin.
2 New movie by Christopher Nolan.
3 Campers: Get ready to stop COVID-19 in its tra...
4 London was building a bigger rival to the Eiff...
Name: text, dtype: object
You might have to handle NaNs appropriately with a preceeding fillna('') call if your column contains those.
In list form the output looks like this:
['Trump met with Putin.',
'New movie by Christopher Nolan.',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
Note that NLTK needs to be pip-installed.

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?
To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Extract words begin with capital letters

I have a string like this
text1="sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
I want to extract words in this text that begin with a capital letter but do not follow a fullstop. So [Takocok The New England Journal of Medicine] should be extracted without [That's Allan].
I tried this regex but still extracting Allan and That's.
t=re.findall("((?:[A-Z]\w+[ -]?)+)",text1)
Here is an option using re.findall:
text1 = "sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
matches = re.findall(r'(?:(?<=^)|(?<=[^.]))\s+([A-Z][a-z]+)', text1)
print(matches)
This prints:
['Takocok', 'The', 'New', 'England', 'Journal', 'Medicine']
Here is an explanation of the regex pattern:
(?:(?<=^)|(?<=[^.])) assert that what precedes is either the start of the string,
or a non full stop character
\s+ then match (but do not capture) one or more spaces
([A-Z][a-z]+) then match AND capture a word starting with a capital letter
It's probably possible to find a single regular expression for this case, but it tends to get messy.
Instead, I suggest a two-step approach:
split the text into tokens
work on these tokens to extract the interesting words
tokens = [
'sedentary',
'.',
' ',
'Allan',
' ',
'Takocok',
'.',
' ',
'That\'s',
…
]
This token splitting is already complicated enough.
Using this list of tokens, it is easier to express the actual requirements since you now work on well-defined tokens instead of arbitrary character sequences.
I kept the spaces in the token list because you might want to distinguish between 'a.dotted.brand.name' or 'www.example.org' and the dot at the end of a sentence.
Using this token list, it is easier than before to express rules like "must be preceded immediately by a dot".
I expect that your rules get quite complicated over time since you are dealing with natural language text. Therefore the abstraction to tokens.
This should be the regex your looking for:
(?<!\.)\s+([A-Z][A-Za-z]+)
See the regex101 here: https://regex101.com/r/EoPqgw/1

Multiple negative lookbehind assertions in python regex?

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like
"\. A-Z"
However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.
I tried implementing the first half, but even this did not work. My code was:
"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).
Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).
Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.
The pattern for the first part of your question could be
(?<!Abs)(?<!S)(\. [A-Z])
or
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion)
That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.
To exclude the month names I came up with this regular expression:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.
I'm adding a short answer to the question in the title, since this is at the top of Google's search results:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:
"(?<!1)(?<!12)(?<!123)example"
This would match example 2example and 3example but not 1example 12example or 123example.
Use nltk punkt tokenizer. It's probably more robust than using regex.
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
Use nltk or similar tools as suggested by #root.
To answer your regex question:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
Input
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
Output
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']
You can use Set [].
'(?<![1,2,3]example)'
This would not match 1example, 2example, 3example.

Categories