Django French Translation - how to handle single quotes in translation strings?

Django French Translation - how to handle single quotes in translation strings? - python

I am using Python 3.5.2 and Django 1.10.
I have received the French translation .po file and can run the compilemessages command without receiving any errors.
However, when I run the site, many pages refuse to load.
I suspect that this is because the French translation .po file contains many single quotes (') in the translation strings.
For example,
#: .\core\constants\address_country_style_types.py:274
msgid "Ascension Island"
msgstr "Île de l'Ascension"
I remember reading somewhere (but cannot find that reference anywhere) that the single quotes must have either a forward or back slash before them. So I tried that, but when I ran the compilemessage command, I got an error message of:
C:\Users\me\desktop\myapp\myapp\locale\fr\LC_MESSAGES\django.po:423:18: invalid control sequence
So how do I escape the French single quote in strings issue?
here is the header of my French language .po file:
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL#ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2017-05-04 12:55+1000\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL#ADDRESS>\n"
"Language-Team: LANGUAGE <LL#li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

I am unsure what is the cause of this issue (maybe that the translator somehow corrupted the file?).
However, a workaround is instead of using the standard single quotation mark ', I have used this single quotation mark (taken from symbols in MS Word):
′
I am yet to check this with the French translator, but it looks and works OK.
I hope this helps someone.

The correct way is to "Escape" the single quote, however, you need to know the end-point consuming the text. Like you found out with the backslash, as in:
L\'Ascension
Trust me, nobody that is French will like seeing the backquote. Back in the DOS days of the 90's, visually, there was almost no difference. Now with fonts, it gets ugly.
Since you're producing for the web, use a HTML replacement, like &apos;
See this article:
Why shouldn't `&apos;` be used to escape single quotes?

The solution is
#: .\core\constants\address_country_style_types.py:274
msgid "Ascension Island"
msgstr "Ile de l‘Ascension"
It works, even if it will be used in some JavaScript. Don't use the numeric code ', it will not work inside Form fields, it will not be rendered and you will see the ugly number. I already tested all this.
As I said in the comments, beginning a word with a uppercase accented letter is not recommended. If you put Île and you then sort the list of countries, the Î character will come after the Z and will not be sorted following a natural order, as you would expect.
This is another problem with Python sorting capabilities. It will only follow the extended ASCII code according of each letter encoding number. And Î has an ANSI code of 206, it comes after the Z, which is 90.
Maybe Python provides a solution to this, but I didn't find yet. If someone found it I would be glad to know.

I'm a French speaker, so are most of my users.
Very annoying bug.
the normal django escaping techniques (through \' or format_html(my_translated_string)) do not work for me as well.
I have used ′ instead of ' and it works OK - the compilemessage command works and the html node works ok.
it is however not very elegant or Robust as any future message needs to take this into account, and it is not very common to use the character ´
I found out another better and more robust solution:
escaping through template filters.
in html template:
<h5 class="modal-title">{{help_message_body|escape}}</h5>
and in javascript:
modal.find('.modal-message').html('<h5 class="modal-title">{{help_message_body|escapejs}}</h5>')

Related

Regex behaves differently for the same input string

I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None.
On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none.
How can I avoid this behavior?

The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: ﬁ.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character ﬁ is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('ﬁ', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: ﬂ, ﬀ, ﬃ, ﬄ (I possibly missed some more).

Japanese characters won't appear when printed

I am printing Unicode characters in python. All of the symbols I have used so far work except for Japanese characters. When I print the characters, it only shows the "question mark in a box" symbol. How can I fix this?
When I first countered the problem I thought it might be python. I searched Google, but I found almost nothing.
Then I wondered if it was Command Prompt. (I use Command Prompt to test my code.) No relevant results.
For my code, I use a list made of the Unicode characters so I won't have to look up and type the specific code. This is what it looks like.
UD = [u"\u3053", u"\u3093", u"\u306B", u"\u3061", u"\u306F"]
UDTemp = UD[0] + UD[1] + UD[2] + UD[3] + UD[4]
print(UDTemp)
When printing, I expected "こんにちは", but instead I got the weird symbols.

The font has to support the characters. For example, I have east Asia IMEs installed on a US Windows 10 system, which make available fonts that support Japanese:
To obtain the fonts you want, it is easiest to add the language support for the desired language in Window 10. To add a language, search for "Language settings":
Once the language is installed, fonts supporting that language will appear in the Console properties, and IMEs will be installed so you can type in that language if you know how to use them.

Detecting Arabic characters in regex

I have a dataset of Arabic sentences, and I want to remove non-Arabic characters or special characters. I used this regex in python:
text = re.sub(r'[^ء-ي0-9]',' ',text)
It works perfectly, but in some sentences (4 cases from the whole dataset) the regex also removes the Arabic words!
I read the dataset using Panda (python package) like:
train = pd.read_excel('d.xlsx', encoding='utf-8')
Just to show you in a picture, I tested on Pythex site:
What is the problem?
------------------ Edited:
The sentences in the example:
انا بحكي رجعو مبارك واعملو حفلة واحرقوها بالمعازيم ولما الاخوان يروحو
يعزو احرقو العزا -- احسنلكم والله #مصر
ﺷﻔﻴﻖ ﺃﺭﺩﻭﻏﺎﻥ ﻣﺼﺮ ..ﺃﺣﻨﺍ ﻧﺒﻘﻰ ﻣﻴﻦ ﻳﺎ ﺩﺍﺩﺍ؟ #ﻣﺴﺨﺮﺓ #ﻋﺒﺚ #EgyPresident #Egypt #ﻣﻘﺎﻃﻌﻮﻥ لا يا حبيبي ما حزرت: بشار غبي بوجود بعثة أنان حاب يفضح روحه انه مجرم من هيك نفذ المجزرة لترى البعثة اجرامه بحق السورين

Those incorrectly included characters are not in the common Unicode range for Arabic (U+0621..U+64A), but are "hardcoded" as their initial, medial, and final forms.
Comparable to capitalization in Latin-based languages, but more strict than that, Arabic writing indicates both the start and end of words with a special 'flourish' form. In addition it also allows an "isolated" form (to be used when the character is not part of a full word).
This is usually encoded in a file as 'an' Arabic character and the actual rendering in initial, medial, or final form is left to the text renderer, but since all forms also have Unicode codepoints of their own, it is also possible to "hardcode" the exact forms. That is what you encountered: a mix of these two systems.
Fortunately, the Unicode ranges for the hardcoded forms are also fixed values:
Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The presentation forms are present only for compatibility with older standards such as codepage 864 used in DOS, and are typically used in visual and not logical order.
(https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A)
and their ranges are U+FB50..U+FDFF (Presentation Forms A) and U+FE70..U+FEFC (Presentation Forms B). If you add these ranges to your exclusion set, the regex will no longer delete these texts:
[^ء-ي0-9ﭐ-﷿ﹰ-ﻼ]
Depending on your browser and/or editor, you may have problems with selecting this text to copy and paste it. It may be more clear to explicitly use a string specifying the exact characters:
[^0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]

I have made some try on Pythex and I Found this (With the help from Regular Expression Arabic characters and numbers only) : [\u0621-\u064A0-9] who catch almost all non-Arabic characters. For un Unknown reason, this dosen't catch 'y' so you have to add it yourself : [\u0621-\u064A0-9y]
This can catch all non-arabic character. For special character, i'm sorry but i found nothing except to add them inside : [\u0621-\u064A0-9y#\!\?\,]

Gettext fallbacks don't work with untranslated strings

In source code of my application I wrapped with gettext strings in russian, so this is my default language and *.po files based on it.
Now I need to make fallbacks chain - string that doesn’t translated in spanish catalog should be searched in english catalog and than if it doesn’t translated will be returned itself in russian.
I trying to do this with add_fallback method, but untranslated strings in self._catalog of GNUTranslations(NullTranslations) already replaced with itself and ugettext method never doing fallbacks.
What I am doing wrong?
Example:
Current locale is Spanish, and we’ve got no translations for string "Титул должен быть уникальным" in Spanish catalog and as a result "Title should be unique" from English catalog should be returned.
Spanish *.po file
msgid "Титул должен быть уникальным"
msgstr "" # <— We've got no translation for this string
English *.po file
msgid "Титул должен быть уникальным"
msgstr "Title should be unique"
Russian *.po file does not contains translations, because this language used as keys in source code (default language)
msgid "Титул должен быть уникальным"
msgstr ""
I’ve got Spanish translator (object of GNUTranslations), and I add English traslator (object of GNUTranslations) as fallback for it with add_fallback method.
So, my es_translator._fallback is en_translator object.
In ugettext function we trying to get value from self._catalog by message as key, and only if it is missing we doing self._fallback call.
But self._catalog.get(message) for untranslated string return string itself.
self._catalog["Титул должен быть уникальным"] -> "Титул должен быть уникальным" and we never doing search in English catalog.
def add_fallback(self, fallback):
if self._fallback:
self._fallback.add_fallback(fallback)
else:
self._fallback = fallback
def ugettext(self, message):
missing = object()
tmsg = self._catalog.get(message, missing)
if tmsg is missing:
if self._fallback:
return self._fallback.ugettext(message)
return unicode(message)
return tmsg
However if message marked as fuzzy it does’t include in self._catalog and fallback works well.
#, fuzzy
msgid "Отсутствуют файлы фотографий"
msgstr "Archivos de fotos ausentes"

Ok, python is doing something different from the standard fallback mechanism for added functionality which is not working like you think it should. This may warrant a bug report.
The standard fallback mechanism only has one fall back if a string is not in a translation: use the source string. In most cases this is english (the C or POSIX locale forces no lookups), but in your case because the messages in the source the C locale has russian text (which may cause other problems because sometimes the C locale assumes ascii not utf8). The current recommended best practice is to use english in the C locale encoded in seven bit ascii and then translate to all other languages. This is a significant redesign (and admittedly anglocentric) but unless someone improves the tools (which would be even more significant redesign) this is probably your best bet.

Only way to solve it was removing untranslated strings while compiling *.mo files.
Patch babel/messages/mofile.py write_mo with
messages = [m for m in messages if m.string]

Translating content in filesystem for a Plone product

I'm trying to get certain strings in a .py file translated, using the i18n machinery. Translating .pt files is not a problem, but whenever I try to translate using _('Something') in Python code on the filesystem, it always gives English text (which is the default) instead of the Norwegian text that should be there. So I can see output from python code in English, while other Page Templates bits are correctly translated.
Is there a how-to or something similar for this?

Is the domain name used for _('Something') the same as what you use in the Norwegian .po file that has the translation? They should be the same, so do not use 'plone' in one case and 'my.domain' in the other.
Also, the call to the underscore function does not in itself translate the string; it only creates a string that can be translated. If this string ends up on its own directly in a template, you should add i18n:translate="" to that tag, probably with a matching i18n:domain.
Otherwise you should manually call the translate method, as in http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html#manually-translated-message-ids. Read the Plone 4 migration guide for some differences between Plone 3 and 4 that might bite you here.

if you are seeking for how-tos you should probably read these docs:
http://plone.org/documentation/kb/i18n-for-developers
http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html
Bye,
Giacomo

be aware that _() does not translate the text at call, but returns a Message object which will be translated when rendered in a template.
That means:
do not concat Message objects. "text %s" % _('translation') will not work, as well as "text" + _('translation')
if you do not send the text to the browser through a template, it may not be translated. for example if you generate a email you need to translate the Message object manually

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.