I want a function using regex that will map certain punctuation characters (these: ., ,, (, ), ;, :) to a function of that character. Specifically, it will put a space on either side.
For example, it would map the string "Hello, this is a test string." to "Hello , this is a test string . "
This is what I have right now:
import re
def add_spaces_to_punctuation(input_text)
text = re.sub('[.]', ' . ', input_text)
text = re.sub('[,]', ' , ', text)
text = re.sub('[:]', ' : ', text)
text = re.sub('[;]', ' ; ', text)
text = re.sub('[(]', ' ( ', text)
text = re.sub('[)]', ' ) ', text)
return text
This works as intended, but is pretty unwieldy/hard to read. If I had a way to map each punctuation characters to a function of that character in a single regex line it would improve it significantly. Is there a way to do this with regex? Sorry if its something obvious, I'm pretty new to regex and don't know what this kind of thing would be called.
You could try the following:
import re
recmp = re.compile(r'[.,:;()]')
def add_spaces_to_punctuation(input_text):
text = recmp.sub(r' \1 ', input_text)
return text
Also, taking performance into consideration and according to this answer, it should be faster if you need to run it often.
Try this:
text = re.sub(r"(\.|\,|\(|\)|\;|\:)", lambda x: f' {x.group(1)} ', text)
It uses a capture group to capture whatever the character may be and then map that to the same character with a space on either side using a lambda expression (see https://www.w3schools.com/python/python_lambda.asp for more info).
The part of the regex in parenthesis (in this case the whole thing) is what gets captured and x.group(1) will give you the captured character.
Related
I'm working with Python regex and I'm trying to get pattern matched from a match object not text matched itself.
I have some patterns to replace and I'm doing this:
import re
patterns = {
r'^[ |\n]+': '',
r'[ |\n]+$': '',
r'[ |\n]+': ' '
}
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join(patterns.keys()),
lambda match: patterns[ match.group(0) ],
text)
But this is a wrong solution because match.group(0) returns text matched so none of them will be equals to any key of patterns dict.
I tried match.pattern but get an exception and tried match.re but this gives all re.compile object and its pattern for this problem is '^[ |\n]+|[ |\n]+$|[ |\n]+'.
EDIT: based on Barmar solution I got this:
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i, group in enumerate(match.groups()):
if group:
return patterns[ i ][ 1 ]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[ 0 ] + ')' for p in patterns), getreplacement, text)
print(text)
But still not a way to always get pattern from a match group.
I don't think there's a way to find out directly which alternative matched.
Use a list instead of a dictionary, and put each pattern in a capture group. Then you can see which capture group matched, and use that as the index to get the corresponding replacement.
Note that this won't work if there are any capture groups in the patterns. If groups are needed, make sure they're non-capturing.
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i in range(1, match.groups):
if match.group(i):
return patterns[i-1][1]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[0] + ')' for p in patterns), getreplacement, text)
If I got it right, you want to strip leading and trailing spaces and reduce the ones in the middle to just one.
First, you code likely has a bug: [ |\n] will match a space ( ), a pipe (|), or a new line. You probably don't want to match a pipe, but you might want to match all whitespace characters, like tabs (\t), for example.
Second, styling: keep your lines under 80 chars and no spaces around indices in brackets.
Third, removing the leading and trailing spaces is simply done with str.strip. The only thing remaining to replace now is sequences of two or more whitespaces, which is easily matched with \s{2,} (\s = "whitespace", {2,} = "two or more").
Here is a modification of your code:
import re
patterns = [
(r"^[ |\n]+", ""),
(r"[ |\n]+$", ""),
(r"[ |\n]+", " "),
]
def get_replacement(m: re.Match) -> str:
return next(
patterns[i][1]
for i, group in enumerate(m.groups())
if group is not None
)
text = (
"\n"
" \t Hello there, I\n need your help here plase :) \t \n"
" \t Hello there, I\n need your help here plase :) \t "
"\n"
)
result1 = re.sub(
"|".join(f"({p})" for p, _ in patterns),
get_replacement,
text,
)
result2 = re.sub(r"[ \n]{2,}", " ", text.strip())
result3 = re.sub(r"\s{2,}", " ", text.strip())
print(repr(result1))
print(repr(result2))
print(repr(result3))
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.
If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.
You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?
You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?
Basically, I want to drop all the dots in the abbreviations like "L.L.C.", converting to "LLC". I don't have a list of all the abbreviations. I want to convert them as they are found. This step is performed before sentence tokenization.
text = """
Proligo L.L.C. is a limited liability company.
S.A. is a place.
She works for AAA L.P. in somewhere.
"""
text = re.sub(r"(?:([A-Z])\.){2,}", "\1", text)
This does not work.
I want to remove the dots from the abbreviations so that the dots will not break the sentence tokenizer.
Thank you!
P.S. Sorry for not being clear enough. I edited the sample text.
Try using a callback function with re.sub:
def callback( str ):
return str.replace('.', '')
text = "L.L.C., S.A., L.P."
text = re.sub(r"(?:[A-Z]\.)+", lambda m: callback(m.group()), text)
print(text)
The regex pattern (?:[A-Z]\.)+ will match any number of capital abbreviations. Then, for each match, the callback function will strip off dots.
import re
string = 'ha.f.d.s.a.s.d.f'
re.sub('\.', '', string)
#output
hafdsasdf
Note that this only works properly if your text does not contain multiple sentences. If it does it will create one long sentence as all '.' are replaced.
Use this regular expression:
>>> re.sub(r"(?<=[A-Z]).", "", text)
'LLC, SA, LP'
>>>
regex101
The answers here are extremely aggressive: any capital alphabetical character followed by a period will be replaced.
I'd recommend:
text = "L.L.C., S.A., L.P."
text = re.sub(r"L\.L\.C\.|S\.A\.|L\.P\.", lambda x: x.group().replace(".", ""), text)
print(text) # => "LLC, SA, LP"
This will only match the abbreviations you're asking for. You can add word boundaries for additional strictness.
I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'