Cannot substitute with regular expression re library - python

I am trying to remove all elements that have the form \x followed by two numbers. I have created the following regular expression r'\\x[0-9][0-9]'. I then test it with the following code:
pattern1 = r'\\x[0-9][0-9]'
a = "\x85ciao \x85839"
re.sub(pattern1, "", a)
But it is not working as it does not replace anything. The output is in fact the same as string a. What could be causing this behaviour?
I am also having the same issue with replacing \' in strings. I would like to remove only the backlash and keep the '. How can I do this?

You can do it like this:
import re
a = "\x85ciao \x85839"
re.sub('\x85','', a)
or simply:
a.replace("\x85", "")

Related

replacing a sub string using regular expression in python

I have a string that contains sub strings like
RTDEFINITION(55,4) RTDEFINITION(45,2)
I need to replace every occurrence of this kind of string with another string:
DEFRTE
using Python and regular expressions. Any ideas?
thx
This should work
import re
re.sub(r'RTDEFINITION\(\d+,\d+\)', 'DEFRTE', mystring)

Regex check if backslash before every symbols using python

I met some problems when I'd like to check if the input regex if correct or not.
I'd like to check is there one backslash before every symbol, but I don't know how to implement using Python.
For example:
number: 123456789. (return False)
phone\:111111 (return True)
I try to use (?!) and (?=) in Python, but it doesn't work.
Update:
I'd like to match the following string:
\~, \!, \#, \$, \%, \^, \&, \*, \(, \), \{, \}, \[, \], \:, \;, \", \', \>, \<, \?
Thank you very much.
import re
if re.seach(r"\\\W", "phone\:111111") is not None:
print("OK")
Does it work?
Reading between the lines a bit, it sounds like you are trying to pass a string to a regex and you want to make sure it has no special characters in it that are unescaped.
Python's re module has an inbuilt re.escape() function for this.
Example:
>>> import re
>>> print(re.escape("phone:111111"))
"phone\\:111111"
Check that the entire string is composed of single characters or pairs of backslash+symbol:
import re
def has_backslash_before_every_symbol(s):
return re.match(r"^(\\[~!#$%^&*(){}\[\]:;"'><?]|[^~!#$%^&*(){}\[\]:;"'><?])*$", s) is not None
Python regex reference: https://docs.python.org/3/library/re.html

How to find a specific character in a string and put it at the end of the string

I have this string:
'Is?"they'
I want to find the question mark (?) in the string, and put it at the end of the string. The output should look like this:
'Is"they?'
I am using the following regular expression in python 2.7. I don't know why my regex is not working.
import re
regs = re.sub('(\w*)(\?)(\w*)', '\\1\\3\\2', 'Is?"they')
print regs
Is?"they # this is the output of my regex.
Your regex doesn't match because " is not in the \w character class. You would need to change it to something like:
regs = re.sub('(\w*)(\?)([^"\w]*)', '\\1\\3\\2', 'Is?"they')
As shown here, " is not captured by \w. Hence, it would probably be best to just use a .:
>>> import re
>>> re.sub("(.*)(\?)(.*)", r'\1\3\2', 'Is?"they')
'Is"they?'
>>>
. captures anything/everything in Regex (except newlines).
Also, you'll notice that I used a raw-string for the second argument of re.sub. Doing so is cleaner than having all those backslashes.

can't use variable inside regex

So, I have a long sequence of Unicode characters that I want to match using regular expressions:
char_set = '\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
(These are all the uppercase characters comprehended in the Unicode range 0-382. Most of them are accented. PEP8 discourages the use of non-ASCII characters in Python scripts, so I'm using the Unicode codes instead of the string literals.)
If I simply compile that long string directly, it works. For instance, this matches all the words that begin with one of those characters:
regex = re.compile(u'\A[\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D]')
But I want to re-use that same sequence of characters in several other regular expressions. I could simply copy and paste it every time, but that's ugly. So based on previous answers to similar questions I've tried this:
regex = re.compile(u'\A[%s]' % char_set)
No good. Somehow the above expression seems to match ANY character, not just the ones hardcoded under the variable 'char_set'.
I've also tried this:
regex = re.compile(u'\A[' + char_set + ']')
And this:
regex = re.compile(u'\A[' + re.escape(char_set) + ']')
And this too:
regex = re.compile(u'\A[{ }]'.format(char_set))
None of which works as expected.
Any thoughts? What am I doing wrong?
(I'm using Python 2.7 and Mac OS X 10.6)
When you're using a pattern with a set of characters in square brackets, you don't want to put any vertical bar (|) characters in the set. Instead, just string the characters together and it should work. Here's a session where I tried out your characters with no problems after stripping the | chars:
>>> import re
>>> char_set = u'\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
>>> fixed_char_set = char_set.replace("|", "") # remove the unneeded vertical bars
>>> pattern = ur"\A[{}]".format(fixed_char_set) # create a pattern string
>>> regex = re.compile(pattern) # compile the pattern into a regex object
>>> print regex.match("%foo") # "%" is not in the character set, so match returns None
None
edit: Actually, it seems like there must be some other issue going on, since I don't match "%foo" even if I use your original char_set without stripping out anything. Please give examples of text that is matching when it shouldn't!

Python regular expression with string in it

I would like to match a string with something like:
re.match(r'<some_match_symbols><my_match><some_other_match_symbols>', mystring)
where mymatch is a string I would like it to find. The problem is that it may be different from time to time, and it is stored in a variable. Would it be possible to add one variable to a regex?
Nothing prevents you from simply doing this:
re.match('<some_match_symbols>' + '<my_match>' + '<some_other_match_symbols>', mystring)
Regular expressions are nothing else than a string containing some special characters, specific to the regular expression syntax. But they are still strings, so you can do whatever you are used to do with strings.
The r'…' syntax is btw. a raw string syntax which basically just prevents any escape sequences inside the string from being evaluated. So r'\n' will be the same as '\\n', a string containing a backslash and an n; while '\n' contain a line break.
import re
url = "www.dupe.com"
expression = re.compile('<p>%s</p>'%url)
result = expression.match("<p>www.dupe.com</p>BBB")
if result:
print result.start(), result.end()
The r'' notation is for constants. Use the re library to compile from variables.

Categories