utf-8 encoding and greek characters [duplicate] - python

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
How to output a utf-8 string list as it is in python?
(4 answers)
Closed 6 years ago.
While I managed to get all the data that I need as well as save it on a cv file, the output I get is in UTF-8 format, which is normal(correct me If I'm wrong)
TBH I've already "played" with the .encode() and .decode() option without any results.
here is my code
brands=[name.text for name in Unibrands]
here is the output
u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae'
And this is the desired output
u'Spirulina Ελληνική'

That string is already fine; you're seeing the repr of it, which does escape certain characters because this is intended to be safe to copy and paste directly into Python source code (which in Python 2.x means it needs to have only printable ASCII characters) - eg, \u0395 represents the codepoint U+0395 GREEK CAPITAL LETTER EPSILON. You're seeing this form of it because printing a list (or other container) always shows you the repr of its contents - if you instead print the string directly, you should see an appropriate glyph instead of the escaped form:
>>> print(u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae')
>>> 'Spirulina Ελληνική'
You could also consider upgrading to a newer Python version; Python 3.5 (and possibly earlier 3.x versions) no longer escape these letters in the repr, since Python now accepts Unicode characters in source files by default.

Related

Emoji to unicode [duplicate]

This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')
From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'

How to make python 3 understand double backslash? [duplicate]

This question already has answers here:
How to fix "<string> DeprecationWarning: invalid escape sequence" in Python?
(2 answers)
Closed 1 year ago.
So, as SO keeps suggesting me, I do not want to replace double backslashes, I want python to understand them.
I need to copy files from a windows distant directory to my local machine.
For example, a "equivalent" (even if not) in shell (with windows paths):
cp \\directory\subdirectory\file ./mylocalfile
But python does not even understand double backslashes in strings:
source = "\\dir\subdir\file"
print(source)
$ python __main__.py
__main__.py:1: DeprecationWarning: invalid escape sequence \s
source = "\\dir\subdir\file"
\dir\subdir
ile
Is Python able to understand windows paths (with double backslashes) in order to perform file copies ?
You can try this also:
source = r"\dir\subdir\file"
print(source)
# \dir\subdir\file
You can solve this issue by using this raw string also.
What we are doing here is making "\dir\subdir\file" to raw string by using r at first.
You can visit here for some other information.
raw strings are raw string literals that treat backslash (\ ) as a literal character. For example, if we try to print a string with a “\n” inside, it will add one line break. But if we mark it as a raw string, it will simply print out the “\n” as a normal character.

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Escape sequence in python [duplicate]

This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
What is the difference between \x and \u escape sequences in python? (Apart from the fact that \x uses the syntax \xXX and \u uses \uXXXX). print('\xa5') gives the output as '¥' in script mode and so does print('\u00a5'), so how is one different from the other, apart from the syntax used?
The most important difference is that \uXXXX accepts 4 hexadecimal digits and is therefore suitable for higher numbers (and therefore can be used to refer to special characters that are not in ASCII or your current code page). It can therefore only be used in unicode strings:
u'\u0123'
The older \xXX can be used in both unicode strings and str strings, but only for code points up to 255:
u'\u0123\x20'
'\x20'

how to convert u'\uf04a' to unicode in python [duplicate]

This question already has answers here:
Python unicode codepoint to unicode character
(4 answers)
Closed 1 year ago.
I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode
The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS
Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm
one example looks like this:
"Oh god please some advice ":
Out[408]: u'Oh god please some advice \uf04c'
Given a thread like this as one example for test:
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')
print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!
'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined
With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?
References
https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py
Handling non-standard American English Characters and Symbols in a CSV, using Python
Solution:
According to the comments below: I replace these character set with the question mark('?')
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')
Hope this helpful to the other beginners.
The notation u'\uf04a' denotes the Unicode codepoint U+F04A, which is by definition a private use codepoint. This means that the Unicode standard does not assign any character to it, and never will; instead, it can be used by private agreements.
It is thus meaningless to talk about printing it. If there is a private agreement on using it in some context, then you print it using a font that has a glyph allocated to that codepoint. Different agreements and different fonts may allocate completely different characters and glyphs to the same codepoint.
It is possible that U+F04A is a result of erroneous processing (e.g., wrong conversions) of character data at some earlier phase.
u'\uf04a'
already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).
u'\uf04a'.encode("utf-8")
gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.
You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.
What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:
>>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
'who? why?'

Categories