I get a string which includes Unicode characters. But the backslashes are escaped. I want to remove one backslash so python can treat the Unicode in the right way.
Using replace I am only able to remove and add two backslashes at a time.
my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\', '')
'\\uD83D\\uDE01\\n\\uD83D\\uDE01' should be '\uD83D\uDE01\n\uD83D\uDE01'
edit:
Thank you for the many responses. You are right my example was wrong. Here are other things I have tried
my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\\\', '\\') # no unicode
my_str2 = my_str.replace('\\', '')
That's… probably not going to work. Escape characters are handled during lexical analysis (parsing), what you have in your string is already a single backslash, it's just the escaped representation of that single backslash:
>>> r'\u3d5f'
'\\u3d5f'
What you need to do is encode the string to be "python source" then re-decode it while applying unicode escapes:
>>> my_str.encode('utf-8').decode('unicode_escape')
'\ud83d\ude01\n\ud83d\ude01'
However note that these codepoints are surrogates, and your string is thus pretty much broken / invalid, you're not going to be able to e.g. print it because the UTF8 encoder is going to reject it:
>>> print(my_str.encode('utf-8').decode('unicode_escape'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
To fix that, you need a second fixup pass: encode to UTF-16 letting the surrogates pass through directly (using the "surrogatepass" mode) then do proper UTF-16 decoding back to an actual well-formed string:
>>> print(my_str.encode('utf-8').decode('unicode_escape').encode('utf-16', 'surrogatepass').decode('utf-16'))
😁
😁
You may really want to do a source analysis on your data though, it's not logically valid to get a (unicode) string with unicode escapes in there, it might be incorrect loading of JSON data or somesuch. If it's an option (I realise that's not always the case) fixing that would be much better than applying hacky fixups afterwards.
Related
I have this issue and I can't figure out how to solve it. I have this string:
data = '\xc4\xb7\x86\x17\xcd'
When I tried to encode it:
data.encode()
I get this result:
b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'
I only want:
b'\xc4\xb7\x86\x17\xcd'
Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.
You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.
So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.
If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.
If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.
As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:
raw_unicode_escape
Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.
So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:
>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'
So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).
So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:
>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)
Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.
You can use 'raw_unicode_escape' as your encoding:
In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'
As mentioned in comments you can also pass the encoding directly to the encode method of your string.
In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'
I'm writing a program that pulls screen names and tweets from twitter into a txt file. Some screen names contain special unicode characters like ♡. In my Bash terminal these characters show up as an empty box. My sql fails when I try to insert this character and tells me it contains an untranslatable character. Is there a way to convert only special characters in python to their hexadecimal form? I would also be happy just replacing these special characters with
Ideally "screenName♡" would convert to "screenName0x2661" or just replace special characters to something like "screenName#REPLACE#"
You can achieve this using the encode method, explained here. From the docs:
Another important method is .encode([encoding], [errors='strict']),
which returns an 8-bit string version of the Unicode string, encoded
in the requested encoding. The errors parameter is the same as the
parameter of the unicode() constructor, with one additional
possibility; as well as ‘strict’, ‘ignore’, and ‘replace’, you can
also pass ‘xmlcharrefreplace’ which uses XML’s character references.
The following example shows the different results:
>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8') '\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii') Traceback (most recent call last):
... UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore') 'abcd'
>>> u.encode('ascii', 'replace') '?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace') 'ꀀabcd'
I'm trying to read some files using Python3.2, the some of the files may contain unicode while others do not.
When I try:
file = open(item_path + item, encoding="utf-8")
for line in file:
print (repr(line))
I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-16: ordinal not in range(128)
I am following the documentation here: http://docs.python.org/release/3.0.1/howto/unicode.html
Why would Python be trying to encode to ascii at any point in this code?
The problem is that repr(line) in Python 3 returns also the Unicode string. It does not convert the above 128 characters to the ASCII escape sequences.
Use ascii(line) instead if you want to see the escape sequences.
Actually, the repr(line) is expected to return the string that if placed in a source code would produce the object with the same value. This way, the Python 3 behaviour is just fine as there is no need for ASCII escape sequences in the source files to express a string with more than ASCII characters. It is quite natural to use UTF-8 or some other Unicode encoding these day. The truth is that Python 2 produced the escape sequences for such characters.
What's your output encoding? If you remove the call to print(), does it start working?
I suspect you've got a non-UTF-8 locale, so Python is trying to encode repr(line) as ASCII as part of printing it.
To resolve the issue, you must either encode the string and print the byte array, or set your default encoding to something that can handle your strings (UTF-8 being the obvious choice).
>>> s = 'auszuschließen'
>>> print(s.encode('ascii', errors='xmlcharrefreplace'))
b'auszuschließen'
>>> print(str(s.encode('ascii', errors='xmlcharrefreplace'), 'ascii'))
auszuschließen
Is there a prettier way to print any string without the b''?
EDIT:
I'm just trying to print escaped characters from Python, and my only gripe is that Python adds "b''" when i do that.
If i wanted to see the actual character in a dumb terminal like Windows 7's, then i get this:
Traceback (most recent call last):
File "Mailgen.py", line 378, in <module>
marked_copy = mark_markup(language_column, item_row)
File "Mailgen.py", line 210, in mark_markup
print("TP: %r" % "".join(to_print))
File "c:\python32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 29: character maps to <undefined>
>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.
To see ascii representation (like repr() on Python 2) for debugging:
print(ascii('auszuschließen…'))
# -> 'auszuschlie\xdfen\u2026'
To print bytes:
sys.stdout.buffer.write('auszuschließen…'.encode('ascii', 'xmlcharrefreplace'))
# -> auszuschließen…
Not all terminals can handle more than some sort of 8-bit character set, that's true. But they won't handle that no matter what you do, really.
Printing a Unicode string will, assuming that your OS set's up the terminal properly, result in the best result possible, which means that the characters that the terminal can not print will be replaced with some character, like a question mark or similar. Doing that translation yourself will not really improve things.
Update:
Since you want to know what characters are in the string, you actually want to know the Unicode codes for them, or the XML equivalent in this case. That's more inspecting than printing, and then usually the b'' part isn't a problem per se.
But you can get rid of it easily and hackily like so:
print(repr(s.encode('ascii', errors='xmlcharrefreplace'))[2:-1])
Since you're using Python 3, you're afforded the ability to write print(s) to the console.
I can agree that, depending on the console, it may not be able to print properly, but I would imagine that most modern OSes since 2006 can handle Unicode strings without too much of an issue. I'd encourage you to give it a try and see if it works.
Alternatively, you can enforce a coding by placing this before any lines in a file (similar to a shebang):
# -*- coding: utf-8 -*-
This will force the interpreter to render it as UTF-8.
For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'