String with Backslash and Quotes in Python is problems galore - python

I've a normal string, that I like to send to a program, which only eats my string as "\"text\"", exactly like that. But in Python I can print it like that but I can't assign it like that. See the following:
My text:
In [12]:
i = fieldList[0]
print str(i.name)
Y03M01D01
Which I can print as "\"text\""
In [13]:
field_new = '"\\"'+str(i.name)+'\\""'
print field_new
"\"Y03M01D01\""
But this is how it is eaten by the program
In [14]:
field_new
Out[14]:
'"\\"Y03M01D01\\""'
Which is not equal to "\"text\"" and so my code fails.
Any suggestions how to resolve this?

Using the r prefix for the string in your comparison will have python treat the string as raw (all backslashes are unescaped).
>>> i = "text"
>>> field_new = '"\\"'+str(i)+'\\""'
>>> field_new
'"\\"text\\""'
>>> field_new == r'"\"text\""'
True

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36
There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'
There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

python convert "unicode" as list

I have a doubt about treat a return type in python.
I have a database function that returns this as value:
(1,13616,,"My string, that can have comma",170.90)
I put this into a variable and did test the type:
print(type(var))
I got the result:
<type 'unicode'>
I want to convert this to a list and get the values separeteds by comma.
Ex.:
var[0] = 1
var[1] = 13616
var[2] = None
var[3] = "My string, that can have comma"
var[4] = 170.90
Is it possible?
Using standard library csv readers:
>>> import csv
>>> s = u'(1,13616,,"My string, that can have comma",170.90)'
>>> [var] = csv.reader([s[1:-1]])
>>> var[3]
'My string, that can have comma'
Some caveats:
var[2] will be an empty string, not None, but you can post-process that.
numbers will be strings and also need post-processing, since csv does not tell the difference between 0 and '0'.
You can try to do the following:
b = []
for i in a:
if i != None:
b.append(i)
if i == None:
b.append(None)
print (type(b))
The issue is not with the comma.
this works fine:
a = (1,13616,"My string, that can have comma",170.90)
and this also works:
a = (1,13616,None,"My string, that can have comma",170.90)
but when you leave two commas ",," it doesn't work.
Unicode strings are (basically) just strings in Python2 (in Python3, remove the word "basically" in that last sentence). They're written as literals by prefixing a u before the string (compare raw-strings r"something", or Py3.4+ formatter strings f"{some_var}thing")
Just strip off your parens and split by comma. You'll have to do some post-parsing if you want 170.90 instead of u'170.90' or None instead of u'', but I'll leave that for you to decide.
>>> var.strip(u'()').split(u',')
[u'1', u'13616', u'', u'"My string', u' that can have comma"', u'170.90']

Unicode object to a list

I have a utf8 - text corpus I can read easily in Python 2.7 :
sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()
> This is my sentence in the right format
However, when I pass this text corpus to a list (for example, for tokenizing) :
tokens = sentence.tokenize()
and print it in the notebook, I obtain bit-like caracters, like :
(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")
Whereas I would like normal characters just like in my original import.
So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?
It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.
Many examples
Original strings:
>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")
Displaying at the interactive prompt:
>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")
Printing individual strings from the tuples:
>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,
A way to print tuples without escape codes:
>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')
Hm, codecs.open(...) returns a "wrapped version of the underlying file object" then you overwrite this variable with the result from executing the read method on that object. Brave, irritating - but ok ;-)
When you type say an äöüß into your "notebook", does it show like "this" or do you see some \uxxxxx instead?
The default value for codecs.open(...) is errors=strict so if this is the same environment for all samples, this should work.
I understand, that when you write "print it" you print the list, that is different from printing the content of the list.
Sample (taking a tab typed as \t into a normal "byte" string - this is python 2.7.11):
>>> a="\t"
>>> print a # below is an expanded tab
>>> a
'\t'
>>> [a]
['\t']
>>> print [a]
['\t']
>>> for element in [a]:
... print element
...
>>> # above is an expanded tab

Raw string('\r') in regular expression in python doesn't works?

Below is my raw string ('\r') test in python.
import re
a = re.compile('\d')
b = re.compile('\\d')
c = re.compile(r'\d')
d = re.compile(r'\\d')
print a.search("1") # (O)
print a.search("\d")
print a.search("\1")
print b.search("1") # (O)
print b.search("\d")
print b.search("\1")
print c.search("1") # (O)
print c.search("\d")
print c.search("\1")
print d.search("1")
print d.search("\d") # (O)
print d.search("\1")
But it seems like raw string doesn't work.
For example, regular expression 'b' should catch the expression which is composed of "backslash + alphabet d", but it catches just number '1'....
And according to meaning of 'r', regular expression 'c' also should catch the string which is composed of 'backslash + alphabet d', but it didn't.
Could anyone explain this?
Thanks
Your first three strings are exactly the same.
>>> '\d' == '\\d' == r'\d'
True
Thus, when run through the regex engine, they all match only a single digit. This is true because '\d' has no interesting behavior in the way that '\n' does, so parsing the backslash as literal is the only reasonable way for the Python interpreter to respond (barring a parse error -- which I'd argue might have been a better idea, but couldn't be implemented now without breaking compatibility).
By contrast, the same is not true of \n:
>>> '\n' == '\\n'
False
>>> '\\n' == r'\n'
True
Your fourth string, r'\\d', is the same as '\\\\d'; thus, that it matches only the literal string \d should be no surprise.

How to convert hex string "\x89PNG" to plain text in python

I have a string "\x89PNG" which I want to convert to plain text.
I referred http://love-python.blogspot.in/2008/05/convert-hext-to-ascii-string-in-python.html
But I found it a little complicated. Can this be done in a simpler way?
\x89PNG is a plain text. Just try to print it:
>>> s = '\x89PNG'
>>> print s
┴PNG
The recipe in the link does nothing:
>>> hex_string = '\x70f=l\x26hl=en\x26geocode=\x26q\x3c'
>>> ascii_string = reformat_content(hex_string)
>>> hex_string == ascii_string
True
The real hex<->plaintext encoding\decoding is a piece of cake:
>>> s.encode('hex')
'89504e47'
>>> '89504e47'.decode('hex')
'\x89PNG'
However, you may have problems with strings like '\x70f=l\x26hl=en\x26geocode=\x26q\x3c', where '\' and 'x' are separate characters:
>>> s = '\\x70f=l\\x26hl=en\\x26geocode=\\x26q\\x3c'
>>> print s
\x70f=l\x26hl=en\x26geocode=\x26q\x3c
In this case string_escape encoding is really helpful:
>>> print s.decode('string_escape')
pf=l&hl=en&geocode=&q<
More about encodings - http://docs.python.org/library/codecs.html#standard-encodings

Categories