How to replace the dots in abbreviations using regex in Python? - python

Basically, I want to drop all the dots in the abbreviations like "L.L.C.", converting to "LLC". I don't have a list of all the abbreviations. I want to convert them as they are found. This step is performed before sentence tokenization.
text = """
Proligo L.L.C. is a limited liability company.
S.A. is a place.
She works for AAA L.P. in somewhere.
"""
text = re.sub(r"(?:([A-Z])\.){2,}", "\1", text)
This does not work.
I want to remove the dots from the abbreviations so that the dots will not break the sentence tokenizer.
Thank you!
P.S. Sorry for not being clear enough. I edited the sample text.

Try using a callback function with re.sub:
def callback( str ):
return str.replace('.', '')
text = "L.L.C., S.A., L.P."
text = re.sub(r"(?:[A-Z]\.)+", lambda m: callback(m.group()), text)
print(text)
The regex pattern (?:[A-Z]\.)+ will match any number of capital abbreviations. Then, for each match, the callback function will strip off dots.

import re
string = 'ha.f.d.s.a.s.d.f'
re.sub('\.', '', string)
#output
hafdsasdf
Note that this only works properly if your text does not contain multiple sentences. If it does it will create one long sentence as all '.' are replaced.

Use this regular expression:
>>> re.sub(r"(?<=[A-Z]).", "", text)
'LLC, SA, LP'
>>>
regex101

The answers here are extremely aggressive: any capital alphabetical character followed by a period will be replaced.
I'd recommend:
text = "L.L.C., S.A., L.P."
text = re.sub(r"L\.L\.C\.|S\.A\.|L\.P\.", lambda x: x.group().replace(".", ""), text)
print(text) # => "LLC, SA, LP"
This will only match the abbreviations you're asking for. You can add word boundaries for additional strictness.

Related

Python - Remove word only from within a sentence

I am trying to remove a specific word from within a sentence, which is 'you'. The code is as listed below:
out1.text_condition = out1.text_condition.replace('you','')
This works, however, it also removes it from within a word that contains it, so when 'your' appears, it removes the 'you' from within it, leaving 'r' standing. Can anyone help me figure out what I can do to just remove the word, not the letters from within another string?
Thanks!
In order to replace whole words and not substrings, you should use a regular expression (regex).
Here is how to replace a whole word with the module re:
import re
def replace_whole_word_from_string(word, string, replacement=""):
regular_expression = rf"\b{word}\b"
return re.sub(regular_expression, replacement, string)
string = "you you ,you your"
result = replace_whole_word_from_string("you", string)
print(result)
Output:
, your
Explanation:
The two \b are what we call "word boundaries". The advantage over str.replace is that it will take into account the punctuation too.
In order to create the regular expression, here we use Literal String Interpolation (also called "f-strings", https://www.python.org/dev/peps/pep-0498/).
To create a "f-string", we add the prefix f.
We also use the prefix r, in order to create a "raw string". We use a raw string in order to avoid escaping the backslash in \b.
Without the prefix r, we would have written regular_expression = f"\\b{word}\\b".
If you had used string.replace(' you ', ' '), you would have received this (wrong) output:
you ,you your
A very simple solution is to replace the word with spaces around it with one space:
out1.text_condition = out1.text_condition.replace(' you ', ' ')
But note that it wouldn't remove for example you. (in the end of the sentence) or you,, etc.
Easiest way is probably just to assume there are spaces before and after the word:
out1.text_condition = out1.text_condition.replace(' you ','')

How to replace substring between two other substrings in python?

I have a corpus of text documents, some of which will have a sequence of substrings. The first and last substrings are consistent, and mark the beginning and the end of the parts I want to replace. But, I would also like to delete/replace all substrings that exist between these first and last positions.
origSent = 'This is the sentence I am intending to edit'
Using the above as an example, how would I go about using 'the' as the start substring, and 'intending' as the end substring, deleting both in addition to the words that exist between them to make the following:
newSent = 'This is to edit'
You could use regex replacement here:
origSent = 'This is the sentence I am intending to edit'
newSent = re.sub(r'\bthe((?!\bthe\b).)*\bintending\b', '', origSent)
print(newSent)
This prints:
This is to edit
The "secret sauce" in the regex pattern is the tempered dot:
((?!\bthe\b).)*
This will consume all content which does not cross over another occurrence of the word the. This prevents matching on some earlier the before intending, which we don't want to do.
I would do this:
s_list = origSent.split()
newSent = ' '.join(s_list[:s_list.index('the')] + s_list[s_list.index('intending')+1:])
Hope this helps.

Python: multiline regular expression problem

I have a very long string, the first lines of which are:
text = """
[text begins here] ... """
I want to remove all the \n characters at the beginning of it, so that I get only something like:
text = """[text begins here] ... """
I'm trying the following:
pattern = r"^\n*"
search = re.compile(pattern, re.S)
out = re.sub(pattern, "", text)
But it doesn't catch or replace anything.
How can I fix this?
(Note: I need to use RegEx for this, not string slicing or other methods.)
You can perform a left-strip on your text using str.lstrip:
out = text.lstrip()
You probably have more than just linefeed characters (\n) in your string (like carriage return (\r)). You should just use lstrip like the others have said, but the following regex should work:
re.sub(r'^[\r\n]*', '', text)

Python Regex: remove underscores and dashes except if they are in a dict of string substitutions

I'm pre-processing a string. I have a dictionary of 10k string substitutions (e. g. "John Lennon": "john_lennon"). I want to replace all other punctuation with a space.
The problem is some of these string substitutions contain underscores or hyphens, so I want to replace punctuation (except full stops) with spaces unless the word is contained in the keys of this dict. I also want to do it in one Regex expression since the text corpus is quite large and this could be a bottleneck.
So far, I have:
import re
input_str = "John Lennon: a musician, artist and activist."
multi_words = dict((re.escape(k), v) for k, v in multi_words.items())
pattern = re.compile("|".join(multi_words.keys()))
output_str = pattern.sub(lambda m: multi_words[re.escape(m.group(0))], input_str)
This replaces all strings using the keys in a dict. Now I just need to also remove punctuation in the same pass. This should return "john_lennon a musician artist and activist."
You could do it by adding one more alternative to the constructed regex which matches a single punctuation character. When the match is processed, a match not in the dictionary can be replaced with a space, using dictionary's get method. Here, I use [,:;_-] but you probably want to replace other characters.
Note: I moved the call to re.escape into the construction of the regex to avoid having to call it on every match.
import re
input_str = "John Lennon: a musician, artist and activist."
pattern = re.compile(("|".join(map(re.escape, multi_words.keys())) + "|[,:;_-]+")
output_str = pattern.sub(lambda m: multi_words.get(m.group(0), ' '), input_str)
You could handle the punctuation you would like to remove like the entries in your dictionary:
pattern = re.compile("|".join(multi_words.keys()) + r'|_|-')
and
multiwords['_'] = ' '
multiwords['-'] = ' '
Then these occurrences are treated like your key words.
But let me remind you that your code only works for a certain set of regular expressions. If you have the pattern foo.*bar in your keys and that matches a string like foo123bar, you will not find the corresponding value to the key by passing foo123bar through re.escape() and then searching in your multiword dictionary for it.
I think the whole escaping you do should be removed and the code should be commented to make clear that only fixed strings are allowed as keys, not complex regular expressions matching variable inputs.
You can add punctuations (excluding full stop) in a character set as part of the items to match, and then handle punctuations and dict keys separately in the substitution function:
import re
import string
punctuation = string.punctuation.replace('.', '')
pattern = re.compile("|".join(multi_words.keys())+
"|[{}]".format(re.escape(punctuation)))
def func(m):
m = m.group(0)
print(m, re.escape(m))
if m in string.punctuation:
return ''
return multi_words[re.escape(m)]
output_str = pattern.sub(func , input_str)
print(output_str)
# john_lennon a musician artist and activist.
You may use a regex like (?:alt1|alt2...|altN)|([^\w\s.]+) and check if Group 1 (that is, any punctuation other than .) was matched. If yes, replace with an empty string:
pattern = re.compile(r"(?:{})|([^\w\s.]+)".format("|".join(multi_words.keys())))
output_str = pattern.sub(lambda m: "" if m.group(1) else multi_words[re.escape(m.group(0))], input_str)
See the Python demo.
A note about _: if you need to remove it as well, use r"(?:{})|([^\w\s.]+|_+)" because [^\w\s.] (matching any char other than word, whitespace and . chars) does not match an underscore (a word char) and you need to add it as a separate alternative.
Note on Unicode: if you deal with Unicode strings, in Python 2.x, pass re.U or re.UNICODE modifier flag to the re.compile() method.

I want string divide space && replace first part of string with second part of string use python re.sub() method?

String two-part fixed and separated by a space
import re
text="jahir islam"
print(re.sub(r' ',text))
Input: jahir islam
Output: islam jahir
You don't need to do that. Just split the string on the space and reverse it and join it back;
input_str = "jahir islam"
output_str = " ".join(input_str.split(" ")[::-1])
Using regex you can do this way if you know that you have only 2 words you need to swap
import re
text = 'jahir islam'
print re.sub(r'(.*) (.*)', r'\2 \1', text)
explanation:
re.sub(r'(.*) (.*)', r'\2 \1', text)
it groups the words and '\1' and '\2' are the representation of the groups you made, now you can place them any where as you want
for further query you can post a comment

Categories