I want to encode URL with special characters. In my case it is: š, ä, õ, æ, ø (it is not a finite list).
urllib2.quote(symbol) gives very strange result, which is not correct. How else these symbols can be encoded?
urllib2.quote("Grønlandsleiret, Oslo, Norway") gives a %27Gr%B8nlandsleiret%2C%20Oslo%2C%20Norway%27
Use UTF-8 explicitly then:
urllib2.quote(u"Grønlandsleiret, Oslo, Norway".encode('UTF-8'))
And always state the encoding in your file. See PEP 0263.
A non-UTF-8 string needs to be decode first, then encoded:
# You've got a str "s".
s = s.decode('latin-1') # (or what the encoding might be …)
# Now "s" is a unicode object.
s = s.encode('utf-8') # Encode as UTF-8 string.
# Now "s" is a str again.
s = urllib2.quote(s) # URL encode.
# Now "s" is encoded the way you need it.
Related
Given a byte string, for instanceB = b"\x81\xc9\x00\x07I ABCD_\xe2\x86\x97_" I want to be able to convert this to the valid printable UTF-8 string that is as UTF-8 as possible: S = "\\x81\\xc9\\x00\\x07I ABCD_↗_". Note that the first group of hex bytes are not valid UTF-8 characters, but the last 3 do define a valid UTF-8 character (the arrow). It seems like this should be part of codecs but I cannot figure out how to make this happen.
for instance
>>> codecs.decode(codecs.escape_encode(B, 'utf-8')[0], 'utf-8')
'\\x81\\xc9\\x00\\x07I\\x19ABCD_\\xe2\\x86\\x97_'
escapes a valid UTF-8 character along with the invalid characters.
Specifying 'backslashreplace' as the error handling mode when decoding a bytestring will replace un-decodable bytes with backslashed escape sequences:
decoded = b.decode('utf-8', errors='backslashreplace')
Also, this is a decoding operation, not an encoding operation. Decoding is bytes->string. Encoding is string->bytes.
I have this function in python
Str = "ü";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like ü and é however it fails when i try � and ¶
Is there a way i can fix it?
According to UTF8, à and ¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.
What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for ü, the Windows-1252 code of à is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.
Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).
Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is à followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.
For ¶ the Windows-1252 encoding is C2B6. Shouldn't it be ö, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?
I have a corpus of different texts from different languages.
I want to capture all characters. I use python 2.7 and defaultencodingsetting is utf-8.
I do not know why when I use this code for German umlaut it prints out German umlaut correctly :
'Erd\xC3\xA4pfel'.decode('unicode-escape').encode('latin1').decode('utf-8')
Result is:
Erdäpfel
but when I use this code :
'Erd\xC3\xA4pfel'.decode('unicode-escape').encode('utf-8').decode('utf-8')
result is :
Erdäpfel which is different.
I am not familiar with text mining.I know that for example latin1 encoding does not contain French letter which is not desired in my project.
How can I convert all unicode escape strings in my corpus regardless of their language to respective character?
Utf-8 according to documentations contains all languages but why it does not print out German umlaut correctly while latin1 encoding prints out correctly?
PS: Lowercase in unicode escaping characters sequences is not the case. I have tried both and results were the same.
You already have UTF-8 encoded data. There are no string literal characters to escape in your bytestring. You are looking at the repr() output of a string where non-printable ASCII characters are shown as escape sequences because that makes the value easily copy-pastable in an ASCII-safe way. The \xc3 you see is one byte, not separate characters:
>>> 'Erd\xC3\xA4pfel'
'Erd\xc3\xa4pfel'
>>> 'Erd\xC3\xA4pfel'[3]
'\xc3'
>>> 'Erd\xC3\xA4pfel'[4]
'\xa4'
>>> print 'Erd\xC3\xA4pfel'
Erdäpfel
You'd have to use a raw string literal or doubled backslashes to actually getting escape sequences that unicode-escape would handle:
>>> '\\xc3\\xa4'
'\\xc3\\xa4'
>>> '\\xc3\\xa4'[0]
'\\'
>>> '\\xc3\\xa4'[1]
'x'
>>> '\\xc3\\xa4'[2]
'c'
>>> '\\xc3\\xa4'[3]
'3'
>>> print '\\xc3\\xa4'
\xc3\xa4
Note how there is a separate \ backslash character in that string (echoed as \\, escaped again).
Next to interpreting actual escape sequences he unicode-escape decodes your data as Latin-1, so you end up with a Unicode string with the character U+00C3 LATIN CAPITAL LETTER A WITH TILDE in it. Encoding that back to Latin-1 bytes gives you the \xC3 byte again, and you are back to UTF-8 bytes. Decoding then as UTF-8 works correctly.
But your second attempt encoded the U+00C3 LATIN CAPITAL LETTER A WITH TILDE codepoint to UTF-8, and that encoding gives you the byte sequence \xc3\x83. Printing those bytes to your UTF-8 terminal will show the à character. The other byte, \xA4 became U+00A4 CURRENCY SIGN, and the UTF-8 byte sequence for that is \xc2\xa4, which prints as ¤.
There is absolutely no need to decode as unicode-escape here. Just leave the data as is. Or, perhaps, decode as UTF-8 to get a unicode object:
>>> 'Erd\xC3\xA4pfel'.decode('utf8')
u'Erd\xe4pfel'
>>> print 'Erd\xC3\xA4pfel'.decode('utf8')
Erdäpfel
If your actual data (and not the test you did) contains \xhh escape sequences that encode UTTF-8, then don't use unicode-escape to decode those sequences either. Use string-escape so you get a byte string containing UTF-8 data (which you can then decode to Unicode as needed):
>>> 'Erd\\xc3\\xa4pfel'
'Erd\\xc3\\xa4pfel'
>>> 'Erd\\xc3\\xa4pfel'.decode('string-escape')
'Erd\xc3\xa4pfel'
>>> 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8')
u'Erd\xe4pfel'
>>> print 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8')
Erdäpfel
I am currently learning Python and I came across the following code:
text=raw_input()
for letter in text:
x=[alpha_dic[letter]]
print x
When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.
You get some trouble from multiple shortcomings in Python (2.x).
raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x
I got this to work borrowing from this answer:
# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
x=[alpha_dict[unicode(letter)]]
print x
>>> ü
>>> ['umlaut']
Python 2 and unicode are not for the feint of heart...
thestring = urllib.quote(thestring.encode('utf-8'))
This will encode it. How to decode it?
What about
backtonormal = urllib.unquote(thestring)
if you mean to decode a string from utf-8, you can first transform the string to unicode and then to any other encoding you would like (or leave it in unicode), like this
unicodethestring = unicode(thestring, 'utf-8')
latin1thestring = unicodethestring.encode('latin-1','ignore')
'ignore' meaning that if you encounter a character that is not in the latin-1 character set you ignore this character.