python converttion from u'\\u795d\\u798f' to u'\u795d\u798f' - python

How to convert from u'\\u795d\\u798f' to u'\u795d\u798f'?
I'm quite confused....
u'\u795d\u798f' is 祝福 in Chinese.
Thanks~
UPDATE:
I'm sorry, I didn't know how to express it at the beginning. Now, my problem is :I got u'\\u795d\\u798f' and I want it to be u'\u795d\u798f'.

From your title (but not the question text) it looks like the problem is that the backslashes in the strings are escaped (i.e., you have \\u795d and you want \u795d). There are several questions on this issue (like Process escape sequences in a string in Python).
In python 2, you can do:
>>> u'\\u795d\\u798f'.decode('unicode_escape')
u'\u795d\u798f'
Applying the print statement to this should print the Chinese characters.
The python 3 equivalent is:
>>> bytes('\\u795d\\u798f','utf-8').decode('unicode_escape')
'祝福'

Related

Output of print("""Hello World's"s""""") in python 3.6 [duplicate]

This question already has answers here:
String concatenation without '+' operator
(6 answers)
Closed 4 years ago.
I read that anything between triple quotes inside print is treated literal so tried messing things a little bit. Now I am not able to get above statement working. I searched internet but could not find anything.
statement:
print("""Hello World's"s""""")
Output I am getting:
Hello World's"s
Expected output:
Hello World's"s""
print("""Hello World's"s""""") is seen as print("""Hello World's"s""" "") because when python find """ it automatically ends the previous string beginning with a triple double-quote.
Try this:
>>> print("a"'b')
ab
So basically your '"""Hello World's"s"""""' is just <str1>Hello World's"s</str1><str2></str2> with str2 an empty string.
Triple quoted string is usually used for doc-string.
As #zimdero pointed out Triple-double quote v.s. Double quote
You can also read https://stackoverflow.com/a/19479874/1768843
And https://www.python.org/dev/peps/pep-0257/
If you really want to get the result you want just use \" or just you can do combination with ``, .format() etc
print("Hello World's\"s\"\"")
https://repl.it/repls/ThatQuarrelsomeSupercollider
Triple quotes within a triple-quoted string must still be escaped for the same reason a single quote within a single quoted string must be escaped: The string parsing ends as soon as python sees it. As mentioned, once tokenized your string is equivalent to
"""Hello World's"s""" ""
That is, two strings which are then concatenated by the compiler. Triple quoted strings can include newlines. Your example is similar to
duke = """Thou seest we are not all alone unhappy:
This wide and universal theatre
Presents more woeful pageants than the scene
Wherein we play in."""
jaques = """All the world's a stage,
And all the men and women merely players:
They have their exits and their entrances;
And one man in his time plays many parts."""
If python was looking for the outermost triple quotes it would only have defined one string here.
Simple with ''' to not complicate things:
print('''Hello World's"s""''')
Maybe this is what you are looking for?
print("\"\"Hello World's's\"\"")
Output:
""Hello World's's""

How to fix "nothing to repeat" regex error?

I know from this question that, nothing to repeat in a regex expression, is a known bug of python.
But I must compile this unicode expression
re.compile(u'\U0000002A \U000020E3')
as a unique character. This is an emoticon and is a unique character. Python understand this string as u'* \\u20e3' and rise me 'nothing to repeat' error.
I am looking around but I can't find any solution. Does exist any work around?
This has little to do with the question you linked. You're not running into a bug. Your regex simply has a special character (a *) that you haven't escaped.
Simply escape the string before compiling it into a regex:
re.compile(re.escape(u'\U0000002A \U000020E3'))
Now, I'm a little unsure as to why you're representing * as \U0000002A — perhaps you could clarify what your intent is here?
You need to use re.escape (as shown in "Thomas Orozco" answer)
But use it only on the part that is dynamic such as:
print re.findall( u"cool\s*%s" % re.escape(u'\U0000002A \U000020E3'),
u"cool * \U000020E3 crazy")

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

How can I print a string using .format(), and print literal curly brackets around my replaced string [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How can I print a literal “{}” characters in python string and also use .format on it?
Basically, I want to use .format(), like this:
my_string = '{{0}:{1}}'.format('hello', 'bonjour')
And have it match:
my_string = '{hello:bonjour}' #this is a string with literal curly brackets
However, the first piece of code gives me an error.
The curly brackets are important, because I'm using Python to communicate with a piece of software via text-based commands. I have no control over what kind of formatting the fosoftware expects, so it's crucial that I sort out all the formatting on my end. It uses curly brackets around strings to ensure that spaces in the strings are interpreted as single strings, rather than multiple arguments — much like you normally do with quotation marks in file paths, for example.
I'm currently using the older method:
my_string = '{%s:%s}' % ('hello', 'bonjour')
Which certainly works, but .format() seems easier to read, and when I'm sending commands with five or more variables all in one string, then readability becomes a significant issue.
Thanks!
Here is the new style:
>>> '{{{0}:{1}}}'.format('hello', 'bonjour')
'{hello:bonjour}'
But I thinking escaping is somewhat hard to read, so I prefer to switch back to the older style to avoid escaping:
>>> '{%s:%s}' % ('hello', 'bonjour')
'{hello:bonjour}'

Python regex to convert non-ascii characters in a string to closest ascii equivalents

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent.
For example, diacritics and whatnot should be dropped.
I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.
Example input/output:
"Étienne" -> "Etienne"
Reading this question made me go looking for something better.
https://pypi.python.org/pypi/Unidecode/0.04.1
Does exactly what you ask for.
In Python 3 and using the regex implementation at PyPI:
http://pypi.python.org/pypi/regex
Starting with the string:
>>> s = "Étienne"
Normalise to NFKD and then remove the diacritics:
>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'
Doing a search for 'iconv TRANSLIT python' I found:
http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. The comments have some other ideas which use the standard library instead.
There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible.
Read the answers to some of the duplicate questions. The NFKD gimmick works only as an accent stripper. It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. For this a prepared translation table is necessary (and much faster).

Categories