Regular Expression cleaning except abbreviations - python

I'm using [^A-Za-z'] expression to clean data from a CSV file before processing it. But I want to keep dots for abbreviations (such as U.S)
I want to exclude [A-Za-z]\.[A-Za-z] from [^A-Za-z']. How can I do that?
Edit:
To make it clearer. I will provide an example sentence:
"The plastic buildout in the U.S. is clustered in the Gulf of Mexico
region, where much of the U.S. petrochemical industry is already
located."
I convert to lowercase, clean any characters that aren't alphabetical and divide the sentence into words. When I'm cleaning it, I get the result:
"the plastic buildout in the u s is clustered in the gulf of mexico
region where much of the u s petrochemical industry is already
located"
I want to exclude [A-Za-z]\.[A-Za-z] to ignore U.S
The line of code:
corpus_text['Sentence'] = corpus_text['Sentence'].str.replace("[^A-Za-z']", ' ').str.lower()

Am I reading your question correctly, that you want to remove all non A-Za-z characters, except if there is a dot in the middle, e.g.
U.S --> U.S
U.S. --> U.S
end of sentence. --> end of sentence
an ellipsis ... like this --> an ellipsis like this
That means that any trailing dots, like at the end of a sentence still need to be removed.
So, clean out any optional trailing dots, followed by non alpha or non-dot characters:
\.*[^A-Za-z\.]

Related

How to make a new line for a sentence after finished sentene with dot?

I have a large text file in Python. I want to make a new line for each sentences. For each line should contain only one sentence information.
For example:
Input:
The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world". Numerous attempts in the 21. century to settle the debate.
Output:
The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci.
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21. century to settle the debate.
I tried :
with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
text_lines = text.readlines()
for line in text_lines:
if "." in line:
new_lines = line.replace(".", ".\n")
new_text2.write(new_lines)
It makes a new line for sentences; however, it makes a new line for every string after ".".
For example:
The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci.
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21.
century to settle the debate.
I want to keep "Numerous attempts in the 21. century to settle the debate" in one line.
You only need to replace periods followed by a space and a capital letter:
import re
with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
text_lines = text.readlines()
for line in text_lines:
if "." in line:
new_lines = re.sub(
r"(?<=\.) (?=[A-Z])",
"\n",
line
)
new_text2.write(new_lines)
I use the re module that allows performing regex-based replacements with the function re.sub. Then, in the line, I search for spaces that match the following regex: (?<=\.) (?=[A-Z])
The space must have a period right before it. I use (?<=xxx) which is a positive look behind, it makes sure that the match has xxx just before). \. matches a period, so (?<=\.) (note the space at the end) makes sure I match spaces that have a period right before it.
The space must have a capital letter right after it. I use (?=xxx) which is a positive look ahead, it makes sure that the match has xxx just after). [A-Z] matches any capital letter, so (?=[A-Z]) (note the space at the beginning) makes sure I match spaces that have a capital letter after it.
Combining those two conditions should be enough to replace by a new line only spaces that are between two sentences.

Regex: match address string if multiple words

Disclaimer: I know from this answer that regex isn't great for U.S. addresses since they're not regular. However, this is a fairly small project and I want to see if I can reduce the number of false positives.
My challenge is to distinguish (i.e. match) between addresses like "123 SOUTH ST" and "123 SOUTH MAIN ST". The best solution I can come up with is to check if more than 1 word comes after the directional word.
My python regex is of the form:
^(NORTH|SOUTH|EAST|WEST)(\s\S*\s\S*)+$
Explanation:
^(NORTH|SOUTH|EAST|WEST) matches direction at the start of the string
(\s\S*\s\S*)+$ attempts to match a space, a word of any length, another space, and another word of any length 1 or more times
But my expression doesn't seem to distinguish between the 2 types of term. Where's my error (besides using regex for U.S. addresses)?
Thanks for your help.
Your regex misses number in beginning of the address and treats optional word (MAIN in this case) as mandatory. Try this
^\d+ (NORTH|SOUTH|EAST|WEST)((\s\S*)?\s\S*)+$

Removing a sentence from a text in dataframe column

I want to format a text-column in the dataframe in a following way:
In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.
Example df:
index text
1 Trump met with Putin. Learn more here:
2 New movie by Christopher Nolan! Watch here:
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
after formatting should look like this:
index text
1 Trump met with Putin.
2 New movie by Christopher Nolan!
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
lets do it with regex to have more problems
df.text = df.text.str.replace(r"(?<=[.!?])[^.!?]*:\s*$", "", regex=True)
now df.text.tolist() is
['Trump met with Putin.',
'New movie by Christopher Nolan!',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.',
"I don't want to do a national lockdown again. If #coronavirus continues to 'progress' in the UK."]
variable lookbehind ftw
On regex:
(?<=[.!?])
This is a "lookbehind". It doesnt physically match anything but asserts something, which is that there must be something before what follows this. That something happens to be a character class here [.!?] which means either . or ! or ?.
[^.!?]*
Again we have a character class with square brackets. But now we have a caret ^ as the first which means that we want everything except those in the character class. So any character other than . or ! or ? will do.
The * after the character class is 0-or-more quantifier. Meaning, the "any character but .?!" can be found as many times as possible.
So far, we start matching either . or ? or !, and this character is behind a stream of characters which could be "anything but .?!". So we assured we match after the last sentence with this "anything but" because it can't match .?! on the way anymore.
:\s*$
With :, we say that the 0-or-more stream above is to stop whenever it sees : (if ever; if not, no replacement happens as desired).
The \s* after it is to allow some possible (again, 0 or more due to *) spaces (\s means space) after the :. You can remove that if you are certain there shall not be any space after :.
Lastly we have $: this matches the end of string (nothing physical, but positional). So we are sure that the string ends with : followed optionally by some spaces.
Using sent_tokenize from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences
from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
.map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))
index
1 Trump met with Putin.
2 New movie by Christopher Nolan.
3 Campers: Get ready to stop COVID-19 in its tra...
4 London was building a bigger rival to the Eiff...
Name: text, dtype: object
You might have to handle NaNs appropriately with a preceeding fillna('') call if your column contains those.
In list form the output looks like this:
['Trump met with Putin.',
'New movie by Christopher Nolan.',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
Note that NLTK needs to be pip-installed.

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?
To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Parsing special symbol '( )' using regex

I am trying to parse text from document using regex. Document contains different structure i.e. section 1.2, section (1). Below regex is able to parse text with decimal point but fails for ().
Any suggestion to handle content which starts with ().
For example:
import re
RAW_Data = '(4) The Governor-General may arrange\n with the Chief Minister of the Australian Capital Territory for the variation or revocation of an \n\narrangement in force under subsection (3). \nNorthern Territory \n (5) The Governor-General may make arrangements with the \nAdministrator of the Northern \nTerritory with respect to the'
f = re.findall(r'(^\d+\.[\d\.]*)(.*?)(?=^\d+\.[\d\.]*)', RAW_Data,re.DOTALL|re.M|re.S)
for z in f:
z=(''.join(z).strip().replace('\n',''))
print(z)
Expected output:
(4) The Governor-General may arrange with the Chief Minister of the Australian Capital Territory for the variation or revocation of an arrangement in force under subsection
(3) Northern Territory
(5) The Governor-General may make arrangements with the Administrator of the Northern Territory with respect to the'
Use regex, [sS]ection\s*\(?\d+(?:\.\d+)?\)?
The (?\d+(?:\.\d+)?\)? will match any number with or without decimal or a brace
Regex
You can try:
(?<=(\(\d\)|\d\.\d))(.(?!\(\d\)|\d\.\d))*
To understand how it works, consider the following block:
(\(\d\)|\d\.\d)
It looks for strings which are of type (X) or X.Y, where X and Y are numbers. Let's call such string 'delimiters'.
Now, the regex above, looks for the first character preceeded by a delimiter (positive lookbehind) and matches the following characters until it finds one which is followed by the delimiter (negative lookhaed).
Try it here!
Hope it helps!
There are a new RegEx \(\d\)[^(]+
\(\d\) match any string like (1) (2) (3) ...
[^(]+ match one or more char and stop matching when found (
test on : on Regex101
But i wonder if you have a special example like (4) The Governor-General may arrange\n with the Chief Minister of the Austr ... (2) (3). \nNorthern Territory \n. It is a sentence from (4) to (2). Because my regex can not match this type of sentence.

Categories