ignoring backslash character in python - python

This one is a bit tricky I think.
if I have:
a = "fwd"
b = "\fwd"
how can I ignore the "\" so something like
print(a in b)
can evaluate to True?

You don't have fwd in b. You have wd, preceded by ASCII codepoint 0C, the FORM FEED character. That's the value Python puts there when you use a \f escape sequence in a regular string literal.
Double the backslash if you want to include a backslash or use a raw string literal:
b = '\\fwd'
b = r'\fwd'
Now a in b works:
>>> 'fwd' in '\\fwd'
True
>>> 'fwd' in r'\fwd'
True
See the String literals documentation:
Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
[...]
\f ASCII Formfeed (FF)

One way of doing it using raw strings:
>>> a = "fwd"
>>> b = "\fwd"
>>> a in b
False
>>> a = r"fwd"
>>> b = r"\fwd"
>>> a in b
True
The relevant docs

You need to "escape" the backslash, as in:
b = '\\fwd'
Otherwise, it reads the single backslash + f as an ASCII character (a formfeed).
Here's an example.
>>> a='fwd'
>>> b='\fwd'
>>> c='\\fwd'
>>> a in b
False
>>> a in c
True

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36
There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'
There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

In Python, how can you write the string String = "\s"? [duplicate]

This question already has answers here:
Why do backslashes appear twice?
(2 answers)
Closed 7 months ago.
Why does:
B = "The" + "\s"
and
B = "The" + r"\s"
yield:
"The\\s"
Is it possible to write the above, such that the output string is:
"The\s"
I have read similar questions on both the issue of backslashes, and their property for escaping, and the interpretation of regex characters in Python.
How to print backslash with Python?
Why can't Python's raw string literals end with a single backslash?
Does this mean there is no way to write what I want?
If it is useful, My end goal is to a write a program that adds the regex expression for space (\s) to a string where this such space:
For example, start with:
A = "The Cat and Dog"
After applying the function, this becomes:
B = "The\sCat\sand\sDog"
I believe this is related to Why does printing a tuple (list, dict, etc.) in Python double the backslashes?
The representation of the string and what it actually contains can differ.
Observe:
>>> B = "The" + "\s"
>>> B
'The\\s'
>>> print B
The\s
Furthermore
>>> A = "The Cat and Dog"
>>> B = str.replace(A, ' ', '\s')
>>> B
'The\\sCat\\sand\\sDog'
>>> print B
The\sCat\sand\sDog
From the docs:
all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result
So while \s is not a proper escape sequence, Python forgives you your mistake and treats the backslash as if you had properly escaped it as \\. But when you then view the string's representation, it shows the backslash properly escaped. That said, the string only contains one backslash. It's only the representation that shows it as an escape sequence with two.
You must escape the "\"
B = "The" + "\\s"
>>> B = "The" + "\\s"
>>> print(B)
The\s
See the Escape Sequences part:
Python 3 - Lexical Analysis

Python Escape Sequence and String Manipulation

I have the following two vars:
a = chr(92) + 'x11'
b = '\x11'
print 'a is: ' + a
print 'b is: ' + b
The result of these print statemtents:
a is: \x11
b is: <| # Here I am just showing a representation of the symbol that is printed for b
How can I make it so that variable a prints the same thing as var b using the chr(92) call? Thank you in advance.
The other answers are showing you how to make b give you what you get in a. If you want a to give you what you get in b (which is what you're asking, if I read you correctly), you need to decode the escape sequence:
>>> a
u'\\x11'
>>> a.decode('string-escape')
'\x11'
You can also use unicode-escape instead of string-escape if you want a unicode string as the result.
Check out the documentation for string literals.
Backslash is an escape character in Python strings, so to include a literal backslash in your string you need to escape them by using two consecutive backslashes. Alternatively, you can suppress the escaping behavior of backslashes by using a raw string literal, which is done by prefixing the string with r. For example:
Escaping the backslash:
b = '\\x11'
Using a raw string literal:
b = r'\x11'
If I am misinterpreting your question and b should be '\x11' or equivalently chr(17), but you just want it to display in the escaped format, you can use repr() for that:
>>> b = '\x11'
>>> print 'b is: ' + repr(b)
b is: '\x11'
If you don't want the quotes, use the string_escape encoding:
>>> print 'b is: ' + b.encode('string_escape')
b is: \x11
Or to get a to be the same as b, you can use a.decode('string_escape').
\x11 appears to be the hex value for a ^Q control character in ASCII:
\021 17 DC1 \x11 ^Q (Device control 1) (XON) (Default UNIX START char.)
You need to escape the \ to get the literal \x11

Python Replace \\ with \ [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
So I can't seem to figure this out... I have a string say, "a\\nb" and I want this to become "a\nb". I've tried all the following and none seem to work;
>>> a
'a\\nb'
>>> a.replace("\\","\")
File "<stdin>", line 1
a.replace("\\","\")
^
SyntaxError: EOL while scanning string literal
>>> a.replace("\\",r"\")
File "<stdin>", line 1
a.replace("\\",r"\")
^
SyntaxError: EOL while scanning string literal
>>> a.replace("\\",r"\\")
'a\\\\nb'
>>> a.replace("\\","\\")
'a\\nb'
I really don't understand why the last one works, because this works fine:
>>> a.replace("\\","%")
'a%nb'
Is there something I'm missing here?
EDIT I understand that \ is an escape character. What I'm trying to do here is turn all \\n \\t etc. into \n \t etc. and replace doesn't seem to be working the way I imagined it would.
>>> a = "a\\nb"
>>> b = "a\nb"
>>> print a
a\nb
>>> print b
a
b
>>> a.replace("\\","\\")
'a\\nb'
>>> a.replace("\\\\","\\")
'a\\nb'
I want string a to look like string b. But replace isn't replacing slashes like I thought it would.
There's no need to use replace for this.
What you have is a encoded string (using the string_escape encoding) and you want to decode it:
>>> s = r"Escaped\nNewline"
>>> print s
Escaped\nNewline
>>> s.decode('string_escape')
'Escaped\nNewline'
>>> print s.decode('string_escape')
Escaped
Newline
>>> "a\\nb".decode('string_escape')
'a\nb'
In Python 3:
>>> import codecs
>>> codecs.decode('\\n\\x21', 'unicode_escape')
'\n!'
You are missing, that \ is the escape character.
Look here: http://docs.python.org/reference/lexical_analysis.html
at 2.4.1 "Escape Sequence"
Most importantly \n is a newline character.
And \\ is an escaped escape character :D
>>> a = 'a\\\\nb'
>>> a
'a\\\\nb'
>>> print a
a\\nb
>>> a.replace('\\\\', '\\')
'a\\nb'
>>> print a.replace('\\\\', '\\')
a\nb
r'a\\nb'.replace('\\\\', '\\')
or
'a\nb'.replace('\n', '\\n')
Your original string, a = 'a\\nb' does not actually have two '\' characters, the first one is an escape for the latter. If you do, print a, you'll see that you actually have only one '\' character.
>>> a = 'a\\nb'
>>> print a
a\nb
If, however, what you mean is to interpret the '\n' as a newline character, without escaping the slash, then:
>>> b = a.replace('\\n', '\n')
>>> b
'a\nb'
>>> print b
a
b
It's because, even in "raw" strings (=strings with an r before the starting quote(s)), an unescaped escape character cannot be the last character in the string. This should work instead:
'\\ '[0]
In Python string literals, backslash is an escape character. This is also true when the interactive prompt shows you the value of a string. It will give you the literal code representation of the string. Use the print statement to see what the string actually looks like.
This example shows the difference:
>>> '\\'
'\\'
>>> print '\\'
\
In Python 3 it will be:
bytes(s, 'utf-8').decode("unicode_escape")
This works on Windows with Python 3.x:
import os
str(filepath).replace(os.path.sep, '/')
Where: os.path.sep is \ on Windows and / on Linux.
Case study
Used this to prevent errors when generating a Markdown file then rendering it to pdf.
path = "C:\\Users\\Programming\\Downloads"
# Replace \\ with a \ along with any random key multiple times
path.replace('\\', '\pppyyyttthhhooonnn')
# Now replace pppyyyttthhhooonnn with a blank string
path.replace("pppyyyttthhhooonnn", "")
print(path)
#Output...
C:\Users\Programming\Downloads

Removing control characters from a string in python

I currently have the following code
def removeControlCharacters(line):
i = 0
for c in line:
if (c < chr(32)):
line = line[:i - 1] + line[i+1:]
i += 1
return line
This is just does not work if there are more than one character to be deleted.
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
You could use str.translate with the appropriate map, for example like this:
>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'
Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].
You may test it like this:
>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True
So to remove the control characters using re just use the following:
>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'
This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.
pip install regex
import regex as rx
def remove_control_characters(str):
return rx.sub(r'\p{C}', '', 'my-string')
\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.
Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n2) instead of O(n). Try this instead:
return ''.join(c for c in line if ord(c) >= 32)
And for Python 2, with the builtin translate:
import string
all_bytes = string.maketrans('', '') # String of 256 characters with (byte) value 0 to 255
line.translate(all_bytes, all_bytes[:32]) # All bytes < 32 are deleted (the second argument lists the bytes to delete)
You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])
filter(string.printable[:-5].__contains__,line)
I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:
Finally I found this solution that did the job:
df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')
Reference here.

Categories