Replace accented letters with the respective non-accented ones at Python 3 - python

I am not sure that this popular answer works in Python 3 since there is no unicode in Python 3.
Therefore, how can replace accented letters with the respective non-accented ones at Python 3?
For example,
sentence = 'intérêt'
to
new_sentence = 'interet'

The linked answer references the third-party module unidecode, not Python 2's unicode type.
$ python3
Python 3.7.1 (default, Nov 19 2018, 13:04:22)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unidecode
>>> unidecode.unidecode('intérêt')
'interet'

Related

Python 3.6 variable annotations and numeric literals

In the documentation on Python in the section "What's new in Python 3.6" among other things there are presented variable annotations and using underscores in numeric literals.
However I tried shown examples and not all of them were passed.
Are these examples incomplete and do they require some additional code that is assumed under the hood?
For example this statement
primes: List[int] = []
issues
NameError: name 'List' is not defined
This statement
print( 1_000_000_000_000_000 )
is also considered as wrong.
The first case works if you first import List from typing. Most types used with type-hints aren't built-in, they need to be imported first.
The second case also works if you are running under 3.6. On my machine it correctly prints:
Python 3.6.2 | packaged by conda-forge | (default, Jul 23 2017, 22:59:30)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print( 1_000_000_000_000_000 )
1000000000000000
If the error message you receive is: SyntaxError: invalid syntax you're on 3.5 or less. If it's SyntaxError: invalid token you're not using the underscores correctly. I'm guessing you're receiving the first.
So, you might want to double check you're running with 3.6 (python -V).

String/unicode references in Python embedded dictionaries [duplicate]

This question already has answers here:
Why does comparing strings using either '==' or 'is' sometimes produce a different result?
(15 answers)
Python string interning
(2 answers)
Closed 5 years ago.
I have a question about the Python 2.7.5-Python 2.7.13. It may be
about semantics or it may be a genuine Python bug. I'm not entirely
sure which. Here is the simplest code I can construct with the
issue
Python 2.7.13 |Enthought, Inc. (x86_64)| (default, Mar 2 2017, 08:20:50)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
>>> dd = {'foo': {'yy':u'Tannenbaum'}}
>>> dd['foo']['yy'] is u'Tannenbaum'
False
>>> dd['foo']['yy'] == u'Tannenbaum'
True
Note: If 'Tannebaum' is changed from unicode to a string the outcome changes. Both of the final tests are true. The question is: Why do the two final tests differ in the unicode case? My understanding is that since unicode and strings are both immutables the "is" and "==" tests should never differ in value. But I get this behavior in both Python 2.7.13 and the old 2.7.5 that came installed on my Mac. Am I relying on something I shouldn't rely on? Is the moral that I should never use "is" for string equality? But what is the principle that tells me that?
Postscript: I have access to a Python 3.6.2 on another machine, and lo and behold, I cannot reproduce this anomaly.
Python 3.6.2 (default, Jul 30 2017, 12:03:06)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> dd = {'foo': {'yy':u'Tannenbaum'}}
>>> dd['foo']['yy'] is u'Tannenbaum'
True
>>> dd['foo']['yy'] == u'Tannenbaum'
True

Does python support unicode beyond basic multilingual plane?

Below is a simple test. repr seems to work fine. yet len and x for x in doesn't seem to divide the unicode text correctly in Python 2.6 and 2.7:
In [1]: u"爨爵"
Out[1]: u'\U0002f920\U0002f921'
In [2]: [x for x in u"爨爵"]
Out[2]: [u'\ud87e', u'\udd20', u'\ud87e', u'\udd21']
Good news is Python 3.3 does the right thing ™.
Is there any hope for Python 2.x series?
Yes, provided you compiled your Python with wide-unicode support.
By default, Python is built with narrow unicode support only. Enable wide support with:
./configure --enable-unicode=ucs4
You can verify what configuration was used by testing sys.maxunicode:
import sys
if sys.maxunicode == 0x10FFFF:
print 'Python built with UCS4 (wide unicode) support'
else:
print 'Python built with UCS2 (narrow unicode) support'
A wide build will use UCS4 characters for all unicode values, doubling memory usage for these. Python 3.3 switched to variable width values; only enough bytes are used to represent all characters in the current value.
Quick demo showing that a wide build handles your sample Unicode string correctly:
$ python2.6
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> [x for x in u'\U0002f920\U0002f921']
[u'\U0002f920', u'\U0002f921']

Python sys.maxint, sys.maxunicode on Linux and windows

On 64-bit Debian Linux 6:
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
9223372036854775807
>>> sys.maxunicode
1114111
On 64-bit Windows 7:
Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
2147483647
>>> sys.maxunicode
65535
Both Operating Systems are 64-bit. They have sys.maxunicode, according to wikipedia There are 1,114,112 code points in unicode. Is sys.maxunicode on Windows wrong?
And why do they have different sys.maxint?
I don't know what your question is, but sys.maxunicode is not wrong on Windows.
See the docs:
sys.maxunicode
An integer giving the largest supported code point for a Unicode character. The value of this depends on the configuration option that
specifies whether Unicode characters are stored as UCS-2 or UCS-4.
Python on Windows uses UCS-2, so the largest code point is 65,535 (and the supplementary-plane characters are encoded by 2*16 bit "surrogate pairs").
About sys.maxint, this shows at which point Python 2 switches from "simple integers" (123) to "long integers" (12345678987654321L). Obviously Python for Windows uses 32 bits, and Python for Linux uses 64 bits. Since Python 3, this has become irrelevant because the simple and long integer types have been merged into one. Therefore, sys.maxint is gone from Python 3.
Regarding the difference is sys.maxint, see What is the bit size of long on 64-bit Windows?. Python uses the long type internally to store a small integer on Python 2.x.

Python2.4 and 2.6 behaves differently for os.path.getmtime() on Windows

Getting two different modification time when calculated from different Python versions on Windows XP.
Python2.4
C:\Copy of elisp>c:\python24\python
Python 2.4.4 (#71, Oct 18 2006, 08:34:43) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.getmtime("auto-complete-emacs-lisp.el")
1251684178
>>> ^Z
Python2.6
C:\Copy of elisp>C:\Python26\python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.getmtime("auto-complete-emacs-lisp.el")
1251687778.0
>>>
There is a difference of 3600 seconds reported by Python2.6 and Python2.4.
What is the reason of this strange behavior?
It's a bug in Microsoft's implementation of the C standard library. Python 2.4 used to use the stdlib fstat call to get file information, and hence could end up an hour out in locales that use DST.
In Python 2.5 and later, os.stat calls the direct Win32-only API to get file information when running on Windows, resulting in the correct output. See this thread for more.
There is a difference of 3600 seconds ...
This should be the kicker. It's a timezone problem, pure and simple.
Now all you have to do is find out why 2.4 and 2.6 are using different timezone information :-)

Categories