Using python 2.7 regex to replace parts of a string - python

I have the following line:
b = re.sub('^xMain (\S+)/y1,/y0 (\S+ )(.*)$', 'xMain \2\1\3', a)
where a is:
xMain Buchan/y1,/y0 Angus Sub1
Why does b come out as 'xMain \x02\x01\x03'?
My intention is to de-invert a name. In Regexbuddy this works OK but not in Python 2.7.

You see unprintable characters because \2\1\3 have meaning in a regular python string too, as octal escape codes:
>>> '\2'
'\x02'
>>> 'xMain \2\1\3'
'xMain \x02\x01\x03'
They never make it to the re.sub() function as written.
Use a raw string literal instead:
b = re.sub('^xMain (\S+)/y1,/y0 (\S+ )(.*)$', r'xMain \2\1\3', a)
Note the r'...' string. In a raw string literal \... escape codes are not interpreted, leaving the back-references in place for the re module to use:
>>> r'xMain \2\1\3'
'xMain \\2\\1\\3'
The alternative would be to double the backslashes, escaping the escape:
b = re.sub('^xMain (\S+)/y1,/y0 (\S+ )(.*)$', 'xMain \\2\\1\\3', a)
Either way, your replacement pattern now works as expected:
>>> import re
>>> a = 'xMain Buchan/y1,/y0 Angus Sub1'
>>> re.sub('^xMain (\S+)/y1,/y0 (\S+ )(.*)$', r'xMain \2\1\3', a)
'xMain Angus BuchanSub1'

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36
There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'
There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

Word boundary detection doesn't seem to work for me in Python re

I tried using:
>>> wbpat='\btest\b'
>>> re.findall(wbpat, 'a test tested in testing')
The result that expected to get was ['test'] but somehow I am getting an empty list. What could be the problem...
\b is an escape code for a backspace (length 1 string). Use r'\btest\b'. The leading r indicates to the Python interpreter that it should interpret each character in the string as a literal single character (a "raw" string) and ignore escape sequences.
Example:
>>> len('\btest\b') # <backspace>test<backspace>
6
>>> len(r'\btest\b') # <backslash>btest<backslash>b
8
>>> import re
>>> re.findall(r'\btest\b','a test tested in testing')
['test']
It's a good habit to use a raw string for regular expressions in Python.

ignoring backslash character in python

This one is a bit tricky I think.
if I have:
a = "fwd"
b = "\fwd"
how can I ignore the "\" so something like
print(a in b)
can evaluate to True?
You don't have fwd in b. You have wd, preceded by ASCII codepoint 0C, the FORM FEED character. That's the value Python puts there when you use a \f escape sequence in a regular string literal.
Double the backslash if you want to include a backslash or use a raw string literal:
b = '\\fwd'
b = r'\fwd'
Now a in b works:
>>> 'fwd' in '\\fwd'
True
>>> 'fwd' in r'\fwd'
True
See the String literals documentation:
Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
[...]
\f ASCII Formfeed (FF)
One way of doing it using raw strings:
>>> a = "fwd"
>>> b = "\fwd"
>>> a in b
False
>>> a = r"fwd"
>>> b = r"\fwd"
>>> a in b
True
The relevant docs
You need to "escape" the backslash, as in:
b = '\\fwd'
Otherwise, it reads the single backslash + f as an ASCII character (a formfeed).
Here's an example.
>>> a='fwd'
>>> b='\fwd'
>>> c='\\fwd'
>>> a in b
False
>>> a in c
True

C-style escaping in python

How do I escape (and unescape) the C escaped characters( newlines, slashes etc) for a string in python?
I guess JSON.encode( string) does this, but is there a better way?
Use str.encode('string-escape') in Python 2.7:
>>> '12\t34\n'.encode('string-escape')
'12\\t34\\n'
>>> '12\\t34\\n'.decode('string-escape')
'12\t34\n'
Use str.encode('unicode-escape') or str.encode('unicode-escape').decode('utf-8'):
>>> '12\t34\n'.encode('unicode-escape')
b'12\\t34\\n'
>>> b'12\\t34\\n'.decode('unicode-escape')
'12\t34\n'
>>> '12\t34\n'.encode('unicode-escape').decode('utf-8')
'12\\t34\\n'
>>> '12\\t34\\n'.encode('utf-8').decode('unicode-escape')
'12\t34\n'

Python Replace \\ with \ [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
So I can't seem to figure this out... I have a string say, "a\\nb" and I want this to become "a\nb". I've tried all the following and none seem to work;
>>> a
'a\\nb'
>>> a.replace("\\","\")
File "<stdin>", line 1
a.replace("\\","\")
^
SyntaxError: EOL while scanning string literal
>>> a.replace("\\",r"\")
File "<stdin>", line 1
a.replace("\\",r"\")
^
SyntaxError: EOL while scanning string literal
>>> a.replace("\\",r"\\")
'a\\\\nb'
>>> a.replace("\\","\\")
'a\\nb'
I really don't understand why the last one works, because this works fine:
>>> a.replace("\\","%")
'a%nb'
Is there something I'm missing here?
EDIT I understand that \ is an escape character. What I'm trying to do here is turn all \\n \\t etc. into \n \t etc. and replace doesn't seem to be working the way I imagined it would.
>>> a = "a\\nb"
>>> b = "a\nb"
>>> print a
a\nb
>>> print b
a
b
>>> a.replace("\\","\\")
'a\\nb'
>>> a.replace("\\\\","\\")
'a\\nb'
I want string a to look like string b. But replace isn't replacing slashes like I thought it would.
There's no need to use replace for this.
What you have is a encoded string (using the string_escape encoding) and you want to decode it:
>>> s = r"Escaped\nNewline"
>>> print s
Escaped\nNewline
>>> s.decode('string_escape')
'Escaped\nNewline'
>>> print s.decode('string_escape')
Escaped
Newline
>>> "a\\nb".decode('string_escape')
'a\nb'
In Python 3:
>>> import codecs
>>> codecs.decode('\\n\\x21', 'unicode_escape')
'\n!'
You are missing, that \ is the escape character.
Look here: http://docs.python.org/reference/lexical_analysis.html
at 2.4.1 "Escape Sequence"
Most importantly \n is a newline character.
And \\ is an escaped escape character :D
>>> a = 'a\\\\nb'
>>> a
'a\\\\nb'
>>> print a
a\\nb
>>> a.replace('\\\\', '\\')
'a\\nb'
>>> print a.replace('\\\\', '\\')
a\nb
r'a\\nb'.replace('\\\\', '\\')
or
'a\nb'.replace('\n', '\\n')
Your original string, a = 'a\\nb' does not actually have two '\' characters, the first one is an escape for the latter. If you do, print a, you'll see that you actually have only one '\' character.
>>> a = 'a\\nb'
>>> print a
a\nb
If, however, what you mean is to interpret the '\n' as a newline character, without escaping the slash, then:
>>> b = a.replace('\\n', '\n')
>>> b
'a\nb'
>>> print b
a
b
It's because, even in "raw" strings (=strings with an r before the starting quote(s)), an unescaped escape character cannot be the last character in the string. This should work instead:
'\\ '[0]
In Python string literals, backslash is an escape character. This is also true when the interactive prompt shows you the value of a string. It will give you the literal code representation of the string. Use the print statement to see what the string actually looks like.
This example shows the difference:
>>> '\\'
'\\'
>>> print '\\'
\
In Python 3 it will be:
bytes(s, 'utf-8').decode("unicode_escape")
This works on Windows with Python 3.x:
import os
str(filepath).replace(os.path.sep, '/')
Where: os.path.sep is \ on Windows and / on Linux.
Case study
Used this to prevent errors when generating a Markdown file then rendering it to pdf.
path = "C:\\Users\\Programming\\Downloads"
# Replace \\ with a \ along with any random key multiple times
path.replace('\\', '\pppyyyttthhhooonnn')
# Now replace pppyyyttthhhooonnn with a blank string
path.replace("pppyyyttthhhooonnn", "")
print(path)
#Output...
C:\Users\Programming\Downloads

Categories