Working with Urdu and Arabic names in Python - python

I'm trying to work with Urdu text but am unable to get the right output.
name = '\xd9\x87\xd9\x84\xd9\x84\xd8\xa7 \xd8\xa7\xd9\x85\xd8\xa7\xd9\x86'
print name
OUTPUT
هللا امان
DESIRED OUTPUT
امان اللہ
please advise.

I see two main issues with your snippet.
The first is that in Arabic, there are special code points for entire words, and the word you are trying to print اللہ is called ARABIC LIGATURE ALLAH ISOLATED FORM, which is 0xFDF2 or 0xEF 0xB7 0xB2.
If you write it isolated (each individual character), you will not get the correct representation.
Second, your font in your terminal (or whatever application is being used to render the text) should support the glyph, and you should ensure that the text direction is switched to right-to-left.
Here is an example from the online Python shell:
>>> print(u"\uFDF2")
ﷲ
Since this shell is not configured for right to left you can see that it is printing it left to right.

Related

right both right to left and left to right in python

i wrote a code which write in a text file a String which contains
both a language whose wrote in right to left (like hebrew )and left to right like english
i used unicode code to make it right to left and left to right
u'\u2067' + hebrew + u'\u2069' surrounding the part of the hebrew part but it is not working
after running i see that the printing is good as you can see in the picture
but when i look in the text file , it changed the positions of each fields
and i want that the text file will be the same as the printing
how can i make it the same also in the text file ???
.txt is literally just a plain text editor. There is no formatting on that file. .txt didn't have style formatting like ms word when you saved it, you can load the document with style formatting.
.txt is literally just a plain text editor. You cannot style it like HTML using CSS.

Unexpected question mark when trying to regex replace

I run this file test.py in my Sublime venv Python build system:
import re
text = "skull ☠️..."
print(text)
print(repr(text))
x = re.sub(r' *[\u2600-\u26FF]', r'', text)
print(x)
print(repr(x))
And see the output in Sublime window as expected:
skull ☠️...
'skull ☠️...'
skull️...
'skull️...'
But when I run the same file from command line in Windows 10 I get a strange question marks:
In Google Colab it also works as expected:
There is an invisible symbol with index 5:
What's happening here? How can I remove ☠️ without any question marks or zero width symbols on its place?
To identify the character that is left, you can paste it in some online Tool like this one.
The left character is U+FE0F : VARIATION SELECTOR-16 [VS16] {emoji variation selector}
and you can match or replace it by: \uFE0F
Together with your current pattern: [\u2600-\u26FF\uFE0F]
The Windows command prompt is a text user interface. So why do you want to output graphic symbols like emojis on a pure text interface at all? The font configured for drawing characters and symbols into a Windows console window must support the characters and symbols you want to see in the console window.
So simply you have to add custom fonts to your cmd so it can support the drawing of this emoji , here's a link to help you on how to add custom fonts to your command prompt https://www.maketecheasier.com/add-custom-fonts-command-prompt-windows10/
The Windows default console host (conhost.exe) does not support printing Unicode characters. However, the new Windows Terminal does. Run that code in the Windows Terminal (wt.exe), because it has fully Unicode support.
As per this answer:does all windows command prompt not support emoji?
This is a very lovely article about What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ will help you understand the encoding of every windows version.
I hope I could help you

How to create a word docx using python docx in other than english?

I am building a program creating printed outputs from python code. Further, the final print containing the other language (Sinhala). I want to use python docx to save this output into a word document. How to write into word in another language?
My aim is to produce a report making program from another language (Sinhala). I take all user inputs from widgets and managed to print the resulted lines in another language in python.
Now, I want to write these lines into word file using the Sinhala language.
a= "කණ්ඩියේ උස මීටර් 5.0 ක් පළල මීටර් 2.0 හා දිග මීටර් 2.0 ක් පමණ වන කොටසක්
අස්ථාවර වී"
document = Document()
document.add_heading("python word doc")
document.add_paragraph(a)
document.save('****\\report.docx')
when I use English, the code does the job. But, for the Sinhala language, I'm not sure how to do that?
I get the following error message for sinala language.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The error code you're seeing is not directly related to the language. The only thing Word knows about language is which spelling dictionary to use. Otherwise its text is just an arbitrary sequence of unicode characters.
What I suspect is that the Unicode encoding of the Sinhala strings you're trying to write is not UTF-8. The other possibility is that the string contains some control characters (as mentioned in the error message), particularly the vertical-tab (VT, 0xB or decimal 11) which can arise in copy and paste from PowerPoint.
This latter one is easier to check for, so perhaps start there.
import re
def sanitize_str(s):
control_chars = "\x00-\x1f\x7f-\x9f"
control_char_re = re.compile("[%s]" % control_chars)
return control_char_re.sub("", s)
document.add_paragraph(sanitize_str(a))

Advanced input in python

I want to receive some information from a user in a next way:
My score is of 10 - is already printed
Between 'is' and 'of' there is an empty place for user's input so he doesn't enter his information at the end( if using simple input() ) but in the middle. While user is entering some information it appears between 'is' and 'of'
Is there any possible way to do it?
One way to get something close to what you want is if you terminal supports ANSI escape codes:
x = input("My score is \x1b[s of 10\x1b[u")
\x1b is the escape character. Neither escape character is dipsplayed on the screen; instead, they introduce byte sequences that the terminal interprets as an instruction of some kind. ESC[s tells the terminal to remember where the cursor is at the moment. ESC[u tells the terminal to move the cursor to the last-remembered position.
(The rectangle is the cursor in an unfocused window.)
Using a library that abstracts away the exact terminal you are using is preferable, but this gives you an idea of how such libraries interact with your terminal: it's all just bytes written to standard output.
If you use console then consider importing curses library. It works on both linux and windows. Download it for windows from http://www.lfd.uci.edu/~gohlke/pythonlibs/#curses
With this library you have a total control over console. Here is the answer to your question.
How to input a word in ncurses screen?

PDFtotext - whitespace showing as aacute on commandline

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.
The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.
In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).
I also printed to the command line and here I see a string of aacute. [?]
I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.
Oddly though, when I print "\255" I don't get an aacute, I get an o grave.
It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?
Many thanks.
The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:
print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x
and edit your question and copy/paste the result.
Second tip: When asking for help, give information about your environment:
What version of Python? What version of what operating system?
Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::
>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>
Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.
What is the relevance of "\255"? Where did you get that from?
Wild guesses while waiting for some facts to emerge:
(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0" and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850 and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.
(2) "\255" == \xAD implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.

Categories