Punctuation special sequence \p fails in regex [duplicate] - python

I have this code for removing all punctuation from a regex string:
import regex as re
re.sub(ur"\p{P}+", "", txt)
How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.

[^\P{P}-]+
\P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.
Example: http://www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk

Here's how to do it with the re module, in case you have to stick with the standard libraries:
# works in python 2 and 3
import re
import string
remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
txt = ")*^%{}[]thi's - is - ###!a !%%!!%- test."
re.sub(pattern, "", txt)
# >>> 'this - is - a - test'
If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

You could either specify the punctuation you want to remove manually, as in [._,] or supply a function instead of the replacement string:
re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

Related

Python Regular Expression: replace a letter if it is not a part of the word in a list

Assume I have a word list like [cat,hat,mat,ate] and I would like to to remove all the letter a in a string like acatbatmate to catbtmate if the letter a is not in the word list.
In the current step, I can split the string by the words in the word list with the following codes:
''.join([word.replace('a','')
if word not in ['cat','hat','mat','ate']
else word for word in re.split('(cat|hat|mat|ate)','acatbatmate') ])
Is it possible for me to use re.sub(pattern, repl, string) to remove the letter a straightforwardly?
You may easily do it with re like this:
import re
except_contexts = ['cat','hat','mat','ate']
print(re.sub(r'({})|a'.format("|".join(except_contexts)), lambda x: x.group(1) if x.group(1) else '', 'acatbatmate'))
# => catbtmate
See the Python 2 demo.
If you are using Python 3.5+, it is even easier with a mere backreference:
import re
except_contexts = ['cat','hat','mat','ate']
print(re.sub(r'({})|a'.format("|".join(except_contexts)), r'\1', 'acatbatmate'))
However, if you plan to replace that a, you will need to use the lambda expression.
Details
r'({})|a'.format("|".join(except_contexts)) will look like (cat|hat|mat|ate)|a regex. It will match and capture cat, hat, etc. into Group 1 and if it matches, we need to replace with this group contents. Else, we either replace with an empty string or a required replacement.
See the regex demo.
Yes, you can (I've always wanted to write it like that...):
import regex as re
exceptions = ['cat','hat','mat','ate']
rx = re.compile(r'''(?:{})(*SKIP)(FAIL)|a+'''.format('|'.join(exceptions)))
word = rx.sub('', 'acatbatmate')
print(word)
This makes use of the newer regex module that supports (*SKIP)(*FAIL).
The pattern here is:
(?:cat|hat|mat|ate)(*SKIP)(*FAIL)
|
a+
Without a new module, you could use a function handler:
import re
exceptions = ['cat','hat','mat','ate']
def handler(match):
if match.group(1):
return ''
return match.group(0)
rx = re.compile(r'''(?:{})|(a+)'''.format('|'.join(exceptions)))
word = rx.sub(handler, 'acatbatmate')
print(word)

Python Regex: remove underscores and dashes except if they are in a dict of string substitutions

I'm pre-processing a string. I have a dictionary of 10k string substitutions (e. g. "John Lennon": "john_lennon"). I want to replace all other punctuation with a space.
The problem is some of these string substitutions contain underscores or hyphens, so I want to replace punctuation (except full stops) with spaces unless the word is contained in the keys of this dict. I also want to do it in one Regex expression since the text corpus is quite large and this could be a bottleneck.
So far, I have:
import re
input_str = "John Lennon: a musician, artist and activist."
multi_words = dict((re.escape(k), v) for k, v in multi_words.items())
pattern = re.compile("|".join(multi_words.keys()))
output_str = pattern.sub(lambda m: multi_words[re.escape(m.group(0))], input_str)
This replaces all strings using the keys in a dict. Now I just need to also remove punctuation in the same pass. This should return "john_lennon a musician artist and activist."
You could do it by adding one more alternative to the constructed regex which matches a single punctuation character. When the match is processed, a match not in the dictionary can be replaced with a space, using dictionary's get method. Here, I use [,:;_-] but you probably want to replace other characters.
Note: I moved the call to re.escape into the construction of the regex to avoid having to call it on every match.
import re
input_str = "John Lennon: a musician, artist and activist."
pattern = re.compile(("|".join(map(re.escape, multi_words.keys())) + "|[,:;_-]+")
output_str = pattern.sub(lambda m: multi_words.get(m.group(0), ' '), input_str)
You could handle the punctuation you would like to remove like the entries in your dictionary:
pattern = re.compile("|".join(multi_words.keys()) + r'|_|-')
and
multiwords['_'] = ' '
multiwords['-'] = ' '
Then these occurrences are treated like your key words.
But let me remind you that your code only works for a certain set of regular expressions. If you have the pattern foo.*bar in your keys and that matches a string like foo123bar, you will not find the corresponding value to the key by passing foo123bar through re.escape() and then searching in your multiword dictionary for it.
I think the whole escaping you do should be removed and the code should be commented to make clear that only fixed strings are allowed as keys, not complex regular expressions matching variable inputs.
You can add punctuations (excluding full stop) in a character set as part of the items to match, and then handle punctuations and dict keys separately in the substitution function:
import re
import string
punctuation = string.punctuation.replace('.', '')
pattern = re.compile("|".join(multi_words.keys())+
"|[{}]".format(re.escape(punctuation)))
def func(m):
m = m.group(0)
print(m, re.escape(m))
if m in string.punctuation:
return ''
return multi_words[re.escape(m)]
output_str = pattern.sub(func , input_str)
print(output_str)
# john_lennon a musician artist and activist.
You may use a regex like (?:alt1|alt2...|altN)|([^\w\s.]+) and check if Group 1 (that is, any punctuation other than .) was matched. If yes, replace with an empty string:
pattern = re.compile(r"(?:{})|([^\w\s.]+)".format("|".join(multi_words.keys())))
output_str = pattern.sub(lambda m: "" if m.group(1) else multi_words[re.escape(m.group(0))], input_str)
See the Python demo.
A note about _: if you need to remove it as well, use r"(?:{})|([^\w\s.]+|_+)" because [^\w\s.] (matching any char other than word, whitespace and . chars) does not match an underscore (a word char) and you need to add it as a separate alternative.
Note on Unicode: if you deal with Unicode strings, in Python 2.x, pass re.U or re.UNICODE modifier flag to the re.compile() method.

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'
Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).
I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

Python: Ignore a # / and random numbers in a string

I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?
If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".
import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.
If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.
I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)
A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)

Categories