Removing a sentence from a text in dataframe column - python

I want to format a text-column in the dataframe in a following way:
In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.
Example df:
index text
1 Trump met with Putin. Learn more here:
2 New movie by Christopher Nolan! Watch here:
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
after formatting should look like this:
index text
1 Trump met with Putin.
2 New movie by Christopher Nolan!
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

lets do it with regex to have more problems
df.text = df.text.str.replace(r"(?<=[.!?])[^.!?]*:\s*$", "", regex=True)
now df.text.tolist() is
['Trump met with Putin.',
'New movie by Christopher Nolan!',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.',
"I don't want to do a national lockdown again. If #coronavirus continues to 'progress' in the UK."]
variable lookbehind ftw
On regex:
(?<=[.!?])
This is a "lookbehind". It doesnt physically match anything but asserts something, which is that there must be something before what follows this. That something happens to be a character class here [.!?] which means either . or ! or ?.
[^.!?]*
Again we have a character class with square brackets. But now we have a caret ^ as the first which means that we want everything except those in the character class. So any character other than . or ! or ? will do.
The * after the character class is 0-or-more quantifier. Meaning, the "any character but .?!" can be found as many times as possible.
So far, we start matching either . or ? or !, and this character is behind a stream of characters which could be "anything but .?!". So we assured we match after the last sentence with this "anything but" because it can't match .?! on the way anymore.
:\s*$
With :, we say that the 0-or-more stream above is to stop whenever it sees : (if ever; if not, no replacement happens as desired).
The \s* after it is to allow some possible (again, 0 or more due to *) spaces (\s means space) after the :. You can remove that if you are certain there shall not be any space after :.
Lastly we have $: this matches the end of string (nothing physical, but positional). So we are sure that the string ends with : followed optionally by some spaces.

Using sent_tokenize from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences
from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
.map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))
index
1 Trump met with Putin.
2 New movie by Christopher Nolan.
3 Campers: Get ready to stop COVID-19 in its tra...
4 London was building a bigger rival to the Eiff...
Name: text, dtype: object
You might have to handle NaNs appropriately with a preceeding fillna('') call if your column contains those.
In list form the output looks like this:
['Trump met with Putin.',
'New movie by Christopher Nolan.',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
Note that NLTK needs to be pip-installed.

Related

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?
To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Regular Expression cleaning except abbreviations

I'm using [^A-Za-z'] expression to clean data from a CSV file before processing it. But I want to keep dots for abbreviations (such as U.S)
I want to exclude [A-Za-z]\.[A-Za-z] from [^A-Za-z']. How can I do that?
Edit:
To make it clearer. I will provide an example sentence:
"The plastic buildout in the U.S. is clustered in the Gulf of Mexico
region, where much of the U.S. petrochemical industry is already
located."
I convert to lowercase, clean any characters that aren't alphabetical and divide the sentence into words. When I'm cleaning it, I get the result:
"the plastic buildout in the u s is clustered in the gulf of mexico
region where much of the u s petrochemical industry is already
located"
I want to exclude [A-Za-z]\.[A-Za-z] to ignore U.S
The line of code:
corpus_text['Sentence'] = corpus_text['Sentence'].str.replace("[^A-Za-z']", ' ').str.lower()
Am I reading your question correctly, that you want to remove all non A-Za-z characters, except if there is a dot in the middle, e.g.
U.S --> U.S
U.S. --> U.S
end of sentence. --> end of sentence
an ellipsis ... like this --> an ellipsis like this
That means that any trailing dots, like at the end of a sentence still need to be removed.
So, clean out any optional trailing dots, followed by non alpha or non-dot characters:
\.*[^A-Za-z\.]

regex for returning first sentence from a bigger text Python3

I want to get the first sentence from a text. I am encountering various text formats.Using Python3 re.split().the regex I wrote: '.*\. [A-Z]' meaning take anything until format appears.
This works form 90% of the cases, the case with 'Dr. Firstname Lastname' in the first sentence is breaking the pattern, it gets the first sentence until Firstname.I was thinking of trying to exclude substrings like 'Dr. [A-Z]' but cannot figure out a way to do it.Any ideas? Thanks
Sample:The rain in U.S.A. and Spain is researched by Dr. Martin Laurance. This is the latest U.S.A. study. Anything else will just be ignored.
Don't reinvent the wheel, the problem has been tackled before.
When using Python (what your link suggests), give nltk a try:
from nltk import sent_tokenize
string = "The rain in U.S.A. and Spain is researched by Dr. Martin Laurance. This is the latest U.S.A. study. Anything else will just be ignored."
for sent in sent_tokenize(string):
print(sent)
This yields
The rain in U.S.A. and Spain is researched by Dr. Martin Laurance.
This is the latest U.S.A. study.
Anything else will just be ignored.
Wanted to kill a minute or two (or 25 ;)) so I came up with this (not at all foolproof) solution:
(?i).*?\b((?=[a-z']*[aoueiy])(?=[a-z']*[^aoueiy])\w{2,}\.)
What it does is to identify a word followed by a full stop. To separate this word from any abbreviations it's searching for a sequence of characters ( {2,} = more than 1) than contains at least one vowel and one consonant. This is achieved using two "look a heads" prior to matching the word.
Look a head to find a vowel in a word: (?=[a-z]*[aoueiy])
[a-z]* = any number of letters followed by the character class [aoueiy] - a vowel.
The consonant is the same, only with a negated character class [^aoueiy] matching any consonant (and also any other non letter, but since the match is letters only it doesn't matter ;)
Note that this is of course is nothing close to a complete language parser, but it may work in many cases. One thing it would miss is sentences end with the one letter word "I". Like "We're good together you and I."
See it here at regex101

Parsing special symbol '( )' using regex

I am trying to parse text from document using regex. Document contains different structure i.e. section 1.2, section (1). Below regex is able to parse text with decimal point but fails for ().
Any suggestion to handle content which starts with ().
For example:
import re
RAW_Data = '(4) The Governor-General may arrange\n with the Chief Minister of the Australian Capital Territory for the variation or revocation of an \n\narrangement in force under subsection (3). \nNorthern Territory \n (5) The Governor-General may make arrangements with the \nAdministrator of the Northern \nTerritory with respect to the'
f = re.findall(r'(^\d+\.[\d\.]*)(.*?)(?=^\d+\.[\d\.]*)', RAW_Data,re.DOTALL|re.M|re.S)
for z in f:
z=(''.join(z).strip().replace('\n',''))
print(z)
Expected output:
(4) The Governor-General may arrange with the Chief Minister of the Australian Capital Territory for the variation or revocation of an arrangement in force under subsection
(3) Northern Territory
(5) The Governor-General may make arrangements with the Administrator of the Northern Territory with respect to the'
Use regex, [sS]ection\s*\(?\d+(?:\.\d+)?\)?
The (?\d+(?:\.\d+)?\)? will match any number with or without decimal or a brace
Regex
You can try:
(?<=(\(\d\)|\d\.\d))(.(?!\(\d\)|\d\.\d))*
To understand how it works, consider the following block:
(\(\d\)|\d\.\d)
It looks for strings which are of type (X) or X.Y, where X and Y are numbers. Let's call such string 'delimiters'.
Now, the regex above, looks for the first character preceeded by a delimiter (positive lookbehind) and matches the following characters until it finds one which is followed by the delimiter (negative lookhaed).
Try it here!
Hope it helps!
There are a new RegEx \(\d\)[^(]+
\(\d\) match any string like (1) (2) (3) ...
[^(]+ match one or more char and stop matching when found (
test on : on Regex101
But i wonder if you have a special example like (4) The Governor-General may arrange\n with the Chief Minister of the Austr ... (2) (3). \nNorthern Territory \n. It is a sentence from (4) to (2). Because my regex can not match this type of sentence.

Match double quotation mark following punctuation with regex in Python to split sentence

I'm sure I'm just missing something, but my regex is a little rusty.
I have a well formatted text corpus and it came out of a SQLite DB that had each review as a row, which is fine and I wrote it out that way to a text file, so each review is a line followed by a new line character.
What I need to do is convert every sentence into a line to feed an iterator that expects sentences as lines that then feeds a model. The text is all professionally written and edited, so a simple regex that splits lines based on strings ending in [.!?] or [.!?] followed by a double quotation mark (") is actually sufficient. something like
re.split('(?<=[.!?]) +|((?<=[.!?])\")', text)
The lookbehind works for anything except ("). I've usually done regex mostly in R or Ruby and this is just making me feel dumb in the wee hours of Sunday night.
Example text:
“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.
Thanks in advance for any suggestions.
You may use
r'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
See the regex demo
Details
(?: - start of a non-capturing alternation group matching:
(?<=[.!?]) - a position that is immediately preceded with ., ! or ?
| - or
(?<=[.!?]["”]) - a position that is immediately preceded with ., ! or ? followed with " or ”
) - end of the grouping
\s+ - 1+ whitespaces.
Python 2 demo:
import re
rx = ur'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
s = u"“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule."
for result in re.split(rx, s):
print(result.encode("utf-8"))
Output:
“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.”
But today, the much-maligned subgenre almost feels like a secret precedent.
Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.

Categories