Gettext fallbacks don't work with untranslated strings - python

In source code of my application I wrapped with gettext strings in russian, so this is my default language and *.po files based on it.
Now I need to make fallbacks chain - string that doesn’t translated in spanish catalog should be searched in english catalog and than if it doesn’t translated will be returned itself in russian.
I trying to do this with add_fallback method, but untranslated strings in self._catalog of GNUTranslations(NullTranslations) already replaced with itself and ugettext method never doing fallbacks.
What I am doing wrong?
Example:
Current locale is Spanish, and we’ve got no translations for string "Титул должен быть уникальным" in Spanish catalog and as a result "Title should be unique" from English catalog should be returned.
Spanish *.po file
msgid "Титул должен быть уникальным"
msgstr "" # <— We've got no translation for this string
English *.po file
msgid "Титул должен быть уникальным"
msgstr "Title should be unique"
Russian *.po file does not contains translations, because this language used as keys in source code (default language)
msgid "Титул должен быть уникальным"
msgstr ""
I’ve got Spanish translator (object of GNUTranslations), and I add English traslator (object of GNUTranslations) as fallback for it with add_fallback method.
So, my es_translator._fallback is en_translator object.
In ugettext function we trying to get value from self._catalog by message as key, and only if it is missing we doing self._fallback call.
But self._catalog.get(message) for untranslated string return string itself.
self._catalog["Титул должен быть уникальным"] -> "Титул должен быть уникальным" and we never doing search in English catalog.
def add_fallback(self, fallback):
if self._fallback:
self._fallback.add_fallback(fallback)
else:
self._fallback = fallback
def ugettext(self, message):
missing = object()
tmsg = self._catalog.get(message, missing)
if tmsg is missing:
if self._fallback:
return self._fallback.ugettext(message)
return unicode(message)
return tmsg
However if message marked as fuzzy it does’t include in self._catalog and fallback works well.
#, fuzzy
msgid "Отсутствуют файлы фотографий"
msgstr "Archivos de fotos ausentes"

Ok, python is doing something different from the standard fallback mechanism for added functionality which is not working like you think it should. This may warrant a bug report.
The standard fallback mechanism only has one fall back if a string is not in a translation: use the source string. In most cases this is english (the C or POSIX locale forces no lookups), but in your case because the messages in the source the C locale has russian text (which may cause other problems because sometimes the C locale assumes ascii not utf8). The current recommended best practice is to use english in the C locale encoded in seven bit ascii and then translate to all other languages. This is a significant redesign (and admittedly anglocentric) but unless someone improves the tools (which would be even more significant redesign) this is probably your best bet.

Only way to solve it was removing untranslated strings while compiling *.mo files.
Patch babel/messages/mofile.py write_mo with
messages = [m for m in messages if m.string]

Related

How to disable fuzzy on django translations?

I don't want to use fuzzy tag. Is it possible?
For example;
When i added new sentence or word translations , generally fuzz automatically wrap it. But i don't like it.
#: frontend/src/components/language_consts.js:74
#, fuzzy
#| msgid "Patient Address"
msgid "Patient's address?"
msgstr "Adresse du doctor"
This is probably because of the software you use to translate your strings. fuzzy means that the translation needs reviewing. Mark the translations as reviewed and it should disappear.

How to create a word docx using python docx in other than english?

I am building a program creating printed outputs from python code. Further, the final print containing the other language (Sinhala). I want to use python docx to save this output into a word document. How to write into word in another language?
My aim is to produce a report making program from another language (Sinhala). I take all user inputs from widgets and managed to print the resulted lines in another language in python.
Now, I want to write these lines into word file using the Sinhala language.
a= "කණ්ඩියේ උස මීටර් 5.0 ක් පළල මීටර් 2.0 හා දිග මීටර් 2.0 ක් පමණ වන කොටසක්
අස්ථාවර වී"
document = Document()
document.add_heading("python word doc")
document.add_paragraph(a)
document.save('****\\report.docx')
when I use English, the code does the job. But, for the Sinhala language, I'm not sure how to do that?
I get the following error message for sinala language.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The error code you're seeing is not directly related to the language. The only thing Word knows about language is which spelling dictionary to use. Otherwise its text is just an arbitrary sequence of unicode characters.
What I suspect is that the Unicode encoding of the Sinhala strings you're trying to write is not UTF-8. The other possibility is that the string contains some control characters (as mentioned in the error message), particularly the vertical-tab (VT, 0xB or decimal 11) which can arise in copy and paste from PowerPoint.
This latter one is easier to check for, so perhaps start there.
import re
def sanitize_str(s):
control_chars = "\x00-\x1f\x7f-\x9f"
control_char_re = re.compile("[%s]" % control_chars)
return control_char_re.sub("", s)
document.add_paragraph(sanitize_str(a))

Detecting Arabic characters in regex

I have a dataset of Arabic sentences, and I want to remove non-Arabic characters or special characters. I used this regex in python:
text = re.sub(r'[^ء-ي0-9]',' ',text)
It works perfectly, but in some sentences (4 cases from the whole dataset) the regex also removes the Arabic words!
I read the dataset using Panda (python package) like:
train = pd.read_excel('d.xlsx', encoding='utf-8')
Just to show you in a picture, I tested on Pythex site:
What is the problem?
------------------ Edited:
The sentences in the example:
انا بحكي رجعو مبارك واعملو حفلة واحرقوها بالمعازيم ولما الاخوان يروحو
يعزو احرقو العزا -- احسنلكم والله #مصر
ﺷﻔﻴﻖ ﺃﺭﺩﻭﻏﺎﻥ ﻣﺼﺮ ..ﺃﺣﻨﺍ ﻧﺒﻘﻰ ﻣﻴﻦ ﻳﺎ ﺩﺍﺩﺍ؟ #ﻣﺴﺨﺮﺓ #ﻋﺒﺚ #EgyPresident #Egypt #ﻣﻘﺎﻃﻌﻮﻥ لا يا حبيبي ما حزرت: بشار غبي بوجود بعثة أنان حاب يفضح روحه انه مجرم من هيك نفذ المجزرة لترى البعثة اجرامه بحق السورين
Those incorrectly included characters are not in the common Unicode range for Arabic (U+0621..U+64A), but are "hardcoded" as their initial, medial, and final forms.
Comparable to capitalization in Latin-based languages, but more strict than that, Arabic writing indicates both the start and end of words with a special 'flourish' form. In addition it also allows an "isolated" form (to be used when the character is not part of a full word).
This is usually encoded in a file as 'an' Arabic character and the actual rendering in initial, medial, or final form is left to the text renderer, but since all forms also have Unicode codepoints of their own, it is also possible to "hardcode" the exact forms. That is what you encountered: a mix of these two systems.
Fortunately, the Unicode ranges for the hardcoded forms are also fixed values:
Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The presentation forms are present only for compatibility with older standards such as codepage 864 used in DOS, and are typically used in visual and not logical order.
(https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A)
and their ranges are U+FB50..U+FDFF (Presentation Forms A) and U+FE70..U+FEFC (Presentation Forms B). If you add these ranges to your exclusion set, the regex will no longer delete these texts:
[^ء-ي0-9ﭐ-﷿ﹰ-ﻼ]
Depending on your browser and/or editor, you may have problems with selecting this text to copy and paste it. It may be more clear to explicitly use a string specifying the exact characters:
[^0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]
I have made some try on Pythex and I Found this (With the help from Regular Expression Arabic characters and numbers only) : [\u0621-\u064A0-9] who catch almost all non-Arabic characters. For un Unknown reason, this dosen't catch 'y' so you have to add it yourself : [\u0621-\u064A0-9y]
This can catch all non-arabic character. For special character, i'm sorry but i found nothing except to add them inside : [\u0621-\u064A0-9y#\!\?\,]

Django French Translation - how to handle single quotes in translation strings?

I am using Python 3.5.2 and Django 1.10.
I have received the French translation .po file and can run the compilemessages command without receiving any errors.
However, when I run the site, many pages refuse to load.
I suspect that this is because the French translation .po file contains many single quotes (') in the translation strings.
For example,
#: .\core\constants\address_country_style_types.py:274
msgid "Ascension Island"
msgstr "Île de l'Ascension"
I remember reading somewhere (but cannot find that reference anywhere) that the single quotes must have either a forward or back slash before them. So I tried that, but when I ran the compilemessage command, I got an error message of:
C:\Users\me\desktop\myapp\myapp\locale\fr\LC_MESSAGES\django.po:423:18: invalid control sequence
So how do I escape the French single quote in strings issue?
here is the header of my French language .po file:
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL#ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2017-05-04 12:55+1000\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL#ADDRESS>\n"
"Language-Team: LANGUAGE <LL#li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"
I am unsure what is the cause of this issue (maybe that the translator somehow corrupted the file?).
However, a workaround is instead of using the standard single quotation mark ', I have used this single quotation mark (taken from symbols in MS Word):
′
I am yet to check this with the French translator, but it looks and works OK.
I hope this helps someone.
The correct way is to "Escape" the single quote, however, you need to know the end-point consuming the text. Like you found out with the backslash, as in:
L\'Ascension
Trust me, nobody that is French will like seeing the backquote. Back in the DOS days of the 90's, visually, there was almost no difference. Now with fonts, it gets ugly.
Since you're producing for the web, use a HTML replacement, like &apos;
See this article:
Why shouldn't `&apos;` be used to escape single quotes?
The solution is
#: .\core\constants\address_country_style_types.py:274
msgid "Ascension Island"
msgstr "Ile de l‘Ascension"
It works, even if it will be used in some JavaScript. Don't use the numeric code ', it will not work inside Form fields, it will not be rendered and you will see the ugly number. I already tested all this.
As I said in the comments, beginning a word with a uppercase accented letter is not recommended. If you put Île and you then sort the list of countries, the Î character will come after the Z and will not be sorted following a natural order, as you would expect.
This is another problem with Python sorting capabilities. It will only follow the extended ASCII code according of each letter encoding number. And Î has an ANSI code of 206, it comes after the Z, which is 90.
Maybe Python provides a solution to this, but I didn't find yet. If someone found it I would be glad to know.
I'm a French speaker, so are most of my users.
Very annoying bug.
the normal django escaping techniques (through \' or format_html(my_translated_string)) do not work for me as well.
I have used ′ instead of ' and it works OK - the compilemessage command works and the html node works ok.
it is however not very elegant or Robust as any future message needs to take this into account, and it is not very common to use the character ´
I found out another better and more robust solution:
escaping through template filters.
in html template:
<h5 class="modal-title">{{help_message_body|escape}}</h5>
and in javascript:
modal.find('.modal-message').html('<h5 class="modal-title">{{help_message_body|escapejs}}</h5>')

Translating content in filesystem for a Plone product

I'm trying to get certain strings in a .py file translated, using the i18n machinery. Translating .pt files is not a problem, but whenever I try to translate using _('Something') in Python code on the filesystem, it always gives English text (which is the default) instead of the Norwegian text that should be there. So I can see output from python code in English, while other Page Templates bits are correctly translated.
Is there a how-to or something similar for this?
Is the domain name used for _('Something') the same as what you use in the Norwegian .po file that has the translation? They should be the same, so do not use 'plone' in one case and 'my.domain' in the other.
Also, the call to the underscore function does not in itself translate the string; it only creates a string that can be translated. If this string ends up on its own directly in a template, you should add i18n:translate="" to that tag, probably with a matching i18n:domain.
Otherwise you should manually call the translate method, as in http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html#manually-translated-message-ids. Read the Plone 4 migration guide for some differences between Plone 3 and 4 that might bite you here.
if you are seeking for how-tos you should probably read these docs:
http://plone.org/documentation/kb/i18n-for-developers
http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html
Bye,
Giacomo
be aware that _() does not translate the text at call, but returns a Message object which will be translated when rendered in a template.
That means:
do not concat Message objects. "text %s" % _('translation') will not work, as well as "text" + _('translation')
if you do not send the text to the browser through a template, it may not be translated. for example if you generate a email you need to translate the Message object manually

Categories