Python string.title() issue with German umlauts

Python string.title() issue with German umlauts - python

I got a strange behavior of the Python string.title() function if the string contains German umlauts (üöä). Then, not only the first character of the string is capitalized, but as well the character following the umlaut.
# -*- coding: utf-8 -*-
a = "müller"
print a.title()
# this returns >MüLler< , not >Müller< as expected
Tried to fix by setting locale to German UTF-8 charset, but no success:
import locale
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
a="müller"
print a.title()
# same value >MüLler<
Any ideas to prevent the capitalization after the umlaut?
My Python version is 2.6.6 on debian linux

Decode your string to Unicode, then use unicode.title():
>>> a = "müller"
>>> a.decode('utf8').title()
u'M\xfcller'
>>> print a.decode('utf8').title()
Müller
You can always encode to UTF-8 again later on.

Related

Print Urdu/Arabic Language in Console (Python)

I am a newbie and i don't know how to set my console to print urdu / arabic characters i am using Wing IDE when i run this code
print "طجکسعبکبطکسبطب"
i get this on my console
╪╖╪¼┌⌐╪│╪╣╪¿┌⌐╪¿╪╖┌⌐╪│╪¿╪╖╪¿

You should encode your string arguments as unicode UTF-8 or later. Wrap the whole code in unicode, and/or mark individual string args as unicode (u'your text') too.
Additionally, you should make sure that unicode is enabled in your terminal/prompt window too.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
arabic_words = u'لغت العربیه'
print arabic_words

Support more characters in Python 2.7

Whenever I try to use the following characters in Python 2.7 "šđžćč" the console gives some non-ascii character error.
This is fixed by adding # -*- coding: utf-8 -*- to the header.
However when I try to use the characters this happends. Eg.
The code is print "Upiši svoj tekst:" but Upi┼íi svoj tekst: is printed.

Printing a UTF char by its non-ASCII code in Python

I want to print a non-ASCII (UTF-8) by its code rather than the character itself using Python 2.7.
For example, I have the following:
# -*- coding: utf-8 -*-
print "…"
and that's OK. However, I want to print '…' using '\xe2', the corresponding code, instead.
Any ideas?

printing '\xe2\x80\xa6' will give you ...
In [36]: print'\xe2\x80\xa6'
…
In [45]: print repr("…")
'\xe2\x80\xa6'

Python 2.5 sub function from regex module not recognizing a pattern

I'm trying to use Python's sub function from the regex module to recognize and change a pattern in a string. Below is my code.
old_string = "afdëhë:dfp"
newString = re.sub(ur'([aeiouäëöüáéíóúàèìò]|ù:|e:|i:|o:|u:|ä:|ë:|ö:|ü:|á:|é:|í:|ó:|ú:|à:|è:|ì:|ò:|ù:)h([aeiouäëöüáéíóúàèìòù])', ur'\1\2', old_string)
So what I'm looking to get after the code is applied is afdëë:dfp (without the h). So I'm trying to match a vowel (sometimes with accents, sometimes with a colon after it) then the h then another vowel (sometimes with accents). So a few examples...
ò:ha becomes ò:a
ä:hà becomes ä:hà
aha becomes aa
üha becomes üa
ëhë becomes ëë
So I'm trying to remove the h when it is between two vowels and also remove the h when it follows a volume with a colon after it then another vowel (ie a:ha). Any help is greatly appreciated. I've been playing around with this for a while.

A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)' regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string should be also Unicode.
Applying the last rule fixes your example.
You could write your regex using lookahead/lookbehind assertions:
# -*- coding: utf-8 -*-
import re
from functools import partial
old_string = u"""
ò:ha becomes ò:a
ä:hà becomes ä:à
aha becomes aa
üha becomes üa
ëhë becomes ëë"""
# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))
Output
ò:a becomes ò:a
ä:à becomes ä:à
aa becomes aa
üa becomes üa
ëë becomes ëë
For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize() function (see the example in the docs, to understand why you might need it).

It is encoding issue. Different combinations of file encoding and old_string being non-unicode behave differently for different pythons.
For example your code works fine for python for 2.6 to 2.7 this way (all data below is cp1252 encoded):
# -*- coding: cp1252 -*-
old_string = "afdëhë:dfp"
but fails with SyntaxError: Non-ASCII character '\xeb' if no encoding specified in file.
However, those lines fails for python 2.5 with
`UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)` for python 2.5
While for all pythons fails to remove h with old_string being non-unicode:
# -*- coding: utf8 -*-
old_string = "afdÃ«hÃ«:dfp"
So you have to provide correct encoding and define old_unicode being unicode string as well, for example this one will do:
# -*- coding: cp1252 -*-
old_string = u"afdëhë:dfp"

Python: What's the equivalent of String[a:b] but for Unicode

So I have something like this:
x = "CЕМЬ"
x[:len(x)-1]
Which is to remove the last character from the string.
However it doesn't work and it gives me an error. I figured it's because it's Unicode. So how do you do this simple formatting on non-ansi strings.

That's because in Python 2.x "CЕМЬ", is a strange way of writing the byte string b'C\xd0\x95\xd0\x9c\xd0\xac'.
You want a character string. In Python 2.x, character strings are prefixed with a u:
x = u"CЕМЬ"
x[:-1] # Returns u"CЕМ" (len(x) is implicit for negative values)
If you're writing this in a program (as opposed to an interactive shell), you will want to specify a source code encoding. To do that, simply add the following line to the beginning of the file, where utf-8 matches your file encoding:
# -*- coding: utf-8 -*-

save the file with utf-8 encoding:
# -*- coding: utf-8 -*-
x = u'CЕМЬ'
print x[:-1] #prints CЕМ

x = u'some string'
x2 = x[:-1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string.title() issue with German umlauts - python

Decode your string to Unicode, then use unicode.title(): >>> a = "müller" >>> a.decode('utf8').title() u'M\xfcller' >>> print a.decode('utf8').title() Müller You can always encode to UTF-8 again later on.

Related

Print Urdu/Arabic Language in Console (Python)

Support more characters in Python 2.7

Printing a UTF char by its non-ASCII code in Python

Python 2.5 sub function from regex module not recognizing a pattern

Python: What's the equivalent of String[a:b] but for Unicode

Categories

Resources