extract some arabic/persian (unicode) words with regex using python [duplicate]

extract some arabic/persian (unicode) words with regex using python [duplicate] - python

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None

Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Related

Emoji to unicode [duplicate]

This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')

From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'

How to ignore backslashes as escape characters in Python? [duplicate]

This question already has answers here:
How to write string literals in Python without having to escape them?
(6 answers)
Closed 7 months ago.
I know this is similar to many other questions regarding backslashes, but this deals with a specific problem that has yet to have been addressed. Is there a mode that can be used to completely eliminate backslashes as escape characters in a print statement? I need to know this for ascii art, as it is very difficult to find correct positioning when all backslashes must be doubled.
print('''
/\\/\\/\\/\\/\\
\\/\\/\\/\\/\\/
''')
\```

Preface the string with r (for "raw", I think) and it will be interpreted literally without substitutions:
>>> # Your original
>>> print('''
... /\\/\\/\\/\\/\\
... \\/\\/\\/\\/\\/
... ''')
/\/\/\/\/\
\/\/\/\/\/
>>> # as a raw string instead
>>> print(r'''
... /\\/\\/\\/\\/\\
... \\/\\/\\/\\/\\/
... ''')
/\\/\\/\\/\\/\\
\\/\\/\\/\\/\\/
These are often used for regular expressions, where it gets tedious to have to double-escape backslashes. There are a couple other letters you can do this with, including f (for format strings, which act differently), b (a literal bytes object, instead of a string), and u, which used to designate Unicode strings in python 2 and I don't think does anything special in python 3.

Details of Unicode Names \N Documented? [duplicate]

This question already has answers here:
List of unicode character names
(7 answers)
Closed 2 years ago.
It appears, based on a urwid example that u'\N{HYPHEN BULLET} will create a unicode character that is a hyphen intended for a bullet.
The names for unicode characters seem to be defined at fileformat.info and some element of using Unicode in Python appears in the howto documentation. Though there is no mention of the \N{} syntax.
If you pull all these docs together you get the idea that the constant u"\N{HYPHEN BULLET}" creates a ⁃
However, this is all a theory based on pulling all this data together. I can find no documentation for "\N{} in the Python docs.
My question is whether my theory of operation is correct and whether it is documented anywhere?

Not every gory detail can be found in a how-to. The table of escape sequences in the reference manual includes:
Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database (Unicode only)

You are correct that u"\N{CHARACTER NAME} produces a valid unicode character in Python.
It is not documented much in the Python docs, but after some searching I found a reference to it on effbot.org
http://effbot.org/librarybook/ucnhash.htm
The ucnhash module
(Implementation, 2.0 only) This module is an implementation module,
which provides a name to character code mapping for Unicode string
literals. If this module is present, you can use \N{} escapes to map
Unicode character names to codes.
In Python 2.1, the functionality of this module was moved to the
unicodedata module.
Checking the documentation for unicodedata shows that the module is using the data from the Unicode Character Database.
unicodedata — Unicode Database
This module provides access to the Unicode Character Database (UCD)
which defines character properties for all Unicode characters. The
data contained in this database is compiled from the UCD version
9.0.0.
The full data can be found at: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt
The data has the structure: HEXVALUE;CHARACTER NAME;etc.. so you could use this data to look up characters.
For example:
# 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
>>> u"\N{LATIN CAPITAL LETTER A}"
'A'
# FF7B;HALFWIDTH KATAKANA LETTER SA;Lo;0;L;<narrow> 30B5;;;;N;;;;;
>>> u"\N{HALFWIDTH KATAKANA LETTER SA}"
'ｻ'

The \N{} syntax is documented in the Unicode HOWTO, at least.
The names are documented in the Unicode standard, such as:
http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt
The unicodedata module can look up a name for a character:
>>> import unicodedata as ud
>>> ud.name('A')
'LATIN CAPITAL LETTER A'
>>> print('\N{LATIN CAPITAL LETTER A}')
A

utf-8 encoding and greek characters [duplicate]

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
How to output a utf-8 string list as it is in python?
(4 answers)
Closed 6 years ago.
While I managed to get all the data that I need as well as save it on a cv file, the output I get is in UTF-8 format, which is normal(correct me If I'm wrong)
TBH I've already "played" with the .encode() and .decode() option without any results.
here is my code
brands=[name.text for name in Unibrands]
here is the output
u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae'
And this is the desired output
u'Spirulina Ελληνική'

That string is already fine; you're seeing the repr of it, which does escape certain characters because this is intended to be safe to copy and paste directly into Python source code (which in Python 2.x means it needs to have only printable ASCII characters) - eg, \u0395 represents the codepoint U+0395 GREEK CAPITAL LETTER EPSILON. You're seeing this form of it because printing a list (or other container) always shows you the repr of its contents - if you instead print the string directly, you should see an appropriate glyph instead of the escaped form:
>>> print(u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae')
>>> 'Spirulina Ελληνική'
You could also consider upgrading to a newer Python version; Python 3.5 (and possibly earlier 3.x versions) no longer escape these letters in the repr, since Python now accepts Unicode characters in source files by default.

Unicode Regex in Python 3 (from Python 2 Code)

I'm trying to convert my Python 2 script to Python 3. How do we do Regex with Unicode?
This is what I had in Python 2 which works It replaces quotes to « and »:
text = re.sub(ur'"(.*?)"', ur'«\1»', text)
I have some really complex ones which the "ur" made it so easy. But it doesn't work in Python 3:
text = re.sub(ur'ه\sایم([\]\.،\:»\)\s])', ur'ه\u200cایم\1', text)

All strings in Python3 are unicode by default. Just remove the u and you should be fine.
In Python2 strings are lists of bytes by default, so we use u to mark them as unicode strings.

Since Python 3.0, the language features a str type that contain
Unicode characters, meaning any string created using "unicode rocks!",
'unicode rocks!', or the triple-quoted string syntax is stored as
Unicode.
Unicode HOWTO This doc will help you.
so, you just do want every you do in Python2, and it will works, no extra effects.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract some arabic/persian (unicode) words with regex using python [duplicate] - python

Related

Emoji to unicode [duplicate]

How to ignore backslashes as escape characters in Python? [duplicate]

Details of Unicode Names \N Documented? [duplicate]

utf-8 encoding and greek characters [duplicate]

Unicode Regex in Python 3 (from Python 2 Code)

Categories

Resources