Replace a Unicode symbol - python

In Python, I am trying to replace a symbol in a string.
I have this string:
a = "• HELLO • HOW • ARE • AYOU"
I want to replace the "•" by ";".
I tried that, but no modification to my string:
b = a.replace("•", ";")
I tried that as well, that works in Python:
b = a.replace("•", ";")
but when I launched in my spark-submit, I have this error:
SyntaxError: Non-UTF-8 code starting with '\x95' in file file_test.py on line 392, but no encoding declared;
thank you for your help

the ascii number of • is 8229 which can be found using ord("•")
try changing the line to b=a.replace(chr(8226), ";")

The error message tells you that you need to declare an encoding in your source file. You do this by including the following command at the beginning:
# coding=utf-8
(Either as the very first line, or as the second line behind the shebang declaration.)
Your first code doesn’t work because • is a HTML entity. It has nothing to do with Python. In Python, instead of using a Unicode character in code, you could also use an escape sequence to encode the value of the bullet character:
a.replace('\u2022', ';')
(U+2022 is the Unicode code point “BULLET”.)

Related

Specifying Python statements using Unicode Code Points

I am trying to understand the python parser on how it handles the source code text and how it tokenizes while parsing. I have 3 statements in the same source file , essentially doing the same function.
# -*- coding: latin-1 -*-
print("This is a Unicode List")
eval("\u0070\u0072\u0069\u006E\u0074\u0028\u0022This is a Unicode List\u0022\u0029")
\u0070\u0072\u0069\u006E\u0074\u0028\u0022This is a Unicode List\u0022\u0029
The first 2 lines works as expected however I get a syntax error for line 3
File "...\UnicodeInput.py", line 4
\u0070\u0072\u0069\u006E\u0074\u0028\u0022This is a Unicode List\u0022\u0029
^
SyntaxError: unexpected character after line continuation character
is there a way for me to provide python statements with its equivalent codepoints
No, Python won't allow escape sequences as valid source code, they only make sense inside strings. Backslash is the line continuation character in Python source. The only way you can do what you want is to use eval or generate the python file by using another python script using the escape sequences.

accented letters in Python

Is there any way to define a string with accented letters in python?
An extreme example is this one:
message = "ÂÃÄÀÁÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ"
Error:
SyntaxError: Non-UTF-8 code starting with '\xc2'
When souce code contains something else than ASCII, you have to add a line to tell the python interpreter:
#!/usr/bin/env python
# encoding: utf-8
Read more in PEP-0263 for the exact rules how to include the encoding hint in a magic comment.
If you use Python 3.x you can use accented (Unicode) strings without doing anything special. If you are using Python 2.x, use u prefix to denote Unicode:
message = u"ÂÃÄÀÁÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ"
Also remember to include the following line at the top of your script:
# coding=utf-8
PEP-0263 explains this in detail:
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>

How do I split a multi-languages line in Python and get the Unicode hex value?

I try to split this kind of lines in Python:
aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"
This line contains Hebrew, simplified Chinese and English.
If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).
The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:
print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))
And I get this error:
SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
In Python 2, you need to open the file specifying an encoding like this:
import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8")
In Python 3, you can just add the encoding option to any open() calls.
This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).
To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():
>>> ord(u"£")
163
if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.
Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:
>>> unicodedata.bidirectional(u"£")
ET # 'E'uropean 'T'erminator
In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:
print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))
In Python 3, string constants are Unicode by default.

Python - \n interfering

Hopefully a quick for for this one. I have a script replacing a specific value with a file location. The location unfortunetly seems to quite often contain \n or n\ in (it because the current directory is in the temp folders), causing the line to either break or remove itself from the line entirely making the folder location invalid.
The temp dir usually looks something like this.
C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp\Firefox
Is there a way to prevent \n or n\ from executing? Any help is appreciated, and here's what my line replacement script looks like. Thanks in advance!
#Editing Prefs.fs
def replaceAll(file,searchExp,replaceExp):
for line in fileinput.input(file, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
replaceAll(rootDir + "/Firefox/Data/prefs.js",'FirefoxAppDirHere',rootDir + "\\FirefoxApp.exe")
EDIT:
eryksun method that he commented with on this post worked perfectly for me! Thanks a lot! I'd mark the question as solved but you must make a post first.
If you are specifying the directory name within your script, you should use a raw string literal by prefixing the literal with r. For example, r"C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp\Firefox". This will keep the backslashes from being interpreted.
Your string in memory has plain backslash characters. It's not a problem of accidentally creating control characters such as line feed on the Python side. But if you're writing this out to a Javascript program, then you have to escape the backslashes. For example:
>>> x = r"C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp"
>>> print(x)
C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp
So in memory this string has single backslash characters. Let's try to compile and evaluate it as a string:
>>> print(eval("'%s'" % x))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "<string>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 2-4: truncated \UXXXXXXXX escape
To fix this you can replace each backslash with two backslashes:
>>> x = x.replace('\\', '\\\\')
>>> print(x)
C:\\Users\\Admin\\AppData\\Local\\Temp\\nsfCDAC.tmp
>>> print(eval("'%s'" % x))
C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp
Michael Hoffman's solution is good in general, if for any reason you need the string not to be raw, you can also add an extra backslash:
"C:\Users\Admin\AppData\Local\Temp\\nsfCDAC.tmp"
The extra backslash keeps the \n (or any other special function like that) from running. For example (I believe, I'm running off of vague recollection here), if you need a string with ' and " in it, you can do:
"blah blah blah, he said \"hi!\", and continued on, \'til he got to the road. Blah blah!"
you should use a raw string literal by prefixing the literal with r. For more details about raw strings
you can visit here or other link is here

How do I regex search for weird non-ASCII characters in Python?

I'm using the following regular expression basically to search for and delete these characters.
invalid_unicode = re.compile(ur'(Û|²|°|±|É|¹|Í)')
My source code in ASCII encoded, and whenever I try to run the script it spits out:
SyntaxError: Non-ASCII character '\xdb' in file ./release.py on line 273, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
If I follow the instructions at the given website, and place utf-8 on the second line encoding, my script doesn't run. Instead it gives me this error:
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xdb in position 0: unexpected end of data
How do I get this one regular expression running in an ASCII written script that'd be great.
You need to find out what encoding your editor is using, and set that per PEP263; or, make things more stable and portable (though alas perhaps a bit less readable) and use escape sequences in your string literal, i.e., use u'(\xdb|\xb2|\xb0|\xb1|\xc9|\xb9|\xcd)' as the parameter to the re.compile call.
After telling Python that your source file uses UTF-8 encoding, did you actually make sure that your editor is saving the file using UTF-8 encoding? The error you get indicates that your editor is probably not using UTF-8.
What text editor are you using?
\x{c0de}
In a regex will match the Unicode character at code point c0de.
Python uses PCRE, right? (If it doesn't, it's probably \uC0DE instead...)

Categories