Why was does Python3 differentiate between ascii and utf-8 [closed]

Why was does Python3 differentiate between ascii and utf-8 [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
All the ascii characters can be represented by utf-8 (the first seven bits of space). Using exclusively utf-8 could simplify string handling greatly. Granted utf-8 is not a fixed length format and therefore has certain performance penalties with respect to ascii but I have the feeling python normally goes for pythonic before performance.
My question: Has it ever been addressed why python3 implements strings this way instead of utf-8 exclusively? Thereby not representing it as a bitstream with different representations but always with the utf-8 encoding.
I'm not looking for personal opinions from SO users but for PEP's or a transcript from the dictator addressing this very point.

From PEP 393:
Rationale
There are two classes of complaints about the current implementation
of the unicode type: on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).
One problem with the approach is support for existing applications
(e.g. extension modules). For compatibility, redundant representations
may be computed. Applications are encouraged to phase out reliance on
a specific internal representation if possible. As interaction with
other libraries will often require some sort of internal
representation, the specification chooses UTF-8 as the recommended way
of exposing strings to C code.
For many strings (e.g. ASCII), multiple representations may actually
share memory (e.g. the shortest form may be shared with the UTF-8 form
if all characters are ASCII). With such sharing, the overhead of
compatibility representations is reduced. If representations do share
data, it is also possible to omit structure fields, reducing the base
size of string objects.
If it is not clear from the above text:
We want most strings representation to be space efficient
We want efficient indexing whenever possible
We want to be compatible with all systems and provide all Unicode on all systems
Result is that using a single internal representation would fail at least one of the constraints.

Related

What are the differences between the modules unicode and unicodedata?

I have a large dataset with over 2 million rows of textual data. Now I want to remove the accents from the strings.
In the link below, two different modules are described to remove the accents:
What is the best way to remove accents in a Python unicode string?
The modules described are unicode and unicodedata. To me it's not clear what the differences are between the two and a comparison is hard, because I don't have many rows with accents and I don't know what accents might be replaced and which ones are not.
Therefore, I would like to know what the differences are between the two and which one is recommended to use.

There is only one module: unicodedata, which includes the unicode database, so the names and properties of unicode code points.
unicode was a built-in function in Python 2. This function just convert strings to unicode strings, so it was just the encoding, no need to store all the data. On python3 all strings are unicode (with some particularities). Just the encoding now should be defined explicitly.
On that answer, you see only import unicodedata, so only one module. To remove accents, you do no need just unicode code point, but also information about the type of a unicode code point (combining character), so you need unicodedata.
Maybe you mean unidecode. This is a special module, but outside standard library. It could be useful for some uses. The modules is simple and give only results in ASCII domain. This could be ok on some cases, but it could cause problems outside Latin writing system.
On the other hand, unicodedata do nothing for you. You should understand unicode and apply the right filter function (and maybe knowing how other languages works).
So it depends on the case, and maybe you need just other slug functions (to create non escaped string). When workign with languages, you should care not to overdo things (you may built an offensive word).

Why do we need str type? Why not just byte-strings?

Python3 has unicode strings (str) and bytes. We already have bytestring literals and methods. Why do we need two different types, instead of just byte strings of various encodings?

The answer to your question depends on the meaning of the word "need."
We certainly don't need the str type in the sense that everything we can compute with the type we can also compute without it (as you know quite well from your well-worded question).
But we can also understand "need" from the point of view of convenience. Isn't it nice to have a sqrt function? Or log or exp or sin? You could write these yourself, but why bother? A standard library designer will add functions that are useful and convenient.
It is the same for the language itself. Do we "need" a while loop? Not really, we can use tail-recursive functions. Do we "need" list comprehensions? Tons of things in Python are not primitive. For that matter do we "need" high level languages. John von Neumann himself once asked "why would you want more than machine language?"
It is the same with str and bytes. The type str, while not necessary, is a nice, time-saving, convenient thing to have. It gives us an interface as a sequence of characters, so that we can manipulate text character-by-character without:
us having to write all the encoding and decoding logic ourselves, or
bloating the string interface with multiple sets of iterators, like each_byte and each_char.
As you suspect, we could have one type which exposes the byte sequence and the character sequence (as Ruby's String class does). The Python designers wanted to separate those usages into two separate types. You can convert an object of one type into the other very easily. By having two types, they are saying that separation of concerns (and usages) is more important than having fewer built-in types. Ruby makes a different choice.
TL;DR It's a matter of preference in language design: separation of concerns by distinct type rather than by different methods on the same type.

Because bytes should not be considered strings, and strings should not be considered bytes. Python3 gets this right, no matter how jarring this feels to the brand new developer.
In Python 2.6, if I read data from a file, and I passed the "r" flag, the text would be read in the current locale by default, which would be a string, while passing the "rb" flag would create a series of bytes. Indexing the data is entirely different, and methods that take a str may be unsure of whether I am using bytes or a str. This gets worse since for ASCII data the two are often synonymous, meaning that code which works in simple test cases or English locales will fail upon encountering non-ASCII characters.
There was therefore a conscious effort to ensure bytes and strings were not identical: that one was a sequence of "dumb bytes", and the other was a Unicode string with the optimal encoding for the data to preserve O(1) indexing (ASCII, UCS-2, or UTF-32, depending on the data used, I believe).
In Python 2, the Unicode string was used to disambiguate text from "dumb bytes", however, str was treated as text by many users.
Or, to quote the Benevolent Dictator:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs. In future versions of Python, string objects will be used for holding character data. The bytes object will fulfil the role of a byte container. Eventually the unicode type will be renamed to str and the old str type will be removed.
tl;dr version
Forcing the separation of bytes and str forces coders to be conscious of their difference, to short-term dissatisfaction, but better code long-term. It's a conscious choice after years of experience: that forcing you to be conscious of the difference immediately will save you days in a debugger later.

Byte strings with different encodings are incompatible with each other, but until Python 3 there was nothing in the language to remind you of this fact. It turns out that mixing different character encodings is a surprisingly common problem in today's world, leading to far too many bugs.
Also it's often just easier to work with whole characters, without having to worry that you just modified a byte that accidentally rendered your 4-byte character into an invalid sequence.

There are at least two reasons:
the str type has an important property "one element = one character".
the str type does not depend on encoding.
Just imagine how would you implement a simple operation like reversing a string (rword = word[::-1]) if word were a bytestring with some encoding.

File manipulation in binary [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Is it possible to convert a file text/image/mp3 to just the binary code thats its made up of for it then to be manipulated for example in python or whatever language. I poked around a bit online and Binary files were mentioned a lot but nothing was particularly useful or coherent. Thanks for any info, i've done a fair bit of high level programming so now am looking to branch out a bit.

If you want to manipulate binary files, use the rb (read binary) and wb (write binary) file modes:
with open('binary_file.mp3', 'rb') as f:
first_byte = f.read(1)
To be clear, all files are binary. Some binary files can be interpreted as text files, but they're still stored in binary underneath. Think of it this way, a file is a series of numbers, the numbers can only be in the range 0 to 255. Then in the 60's and 70's some Americans decided that if you see the number 65, it's actually the capital letter "A", then 66 is "B" etc. Then 97 is lower case "a" 98 is "b" etc. and we would never use numbers greater than 127. You could come up with your own mapping of numbers to letters (and other people in different countries did) but you should probably use the mapping people have more or less all agreed on using, which is called ASCII (and its extension, UTF-8). If you want to look at the actual numbers under the hood of a file, you need a hex editor. But they represent numbers not like we are used to.
If you want to see what the actual ones and zeros of a file are just use this (the := operator requires Python 3.8+)
with open('binary_file_name', 'rb') as f:
while byte := f.read(1):
print(f'{ord(byte):08b}')

A binary file is just an array of bytes and most programming languages deal with arrays, there's no "binary code" conversion to do. Binary formats then exist to tell a file type from another (e.g. an image from an mp3), because you can only interpret raw bytes if you gave them a meaning in the first place.

Use cases for bytearray

Can you point out a scenario in which Python's bytearray is useful? Is it simply a non-unicode string that supports list methods which can be easily achieved anyway with str objects?
I understand some think of it as "the mutable string". However, when would I need such a thing? And: unlike strings or lists, I am strictly limited to ascii, so in any case I'd prefer the others, is that not true?

Bach,
bytearray is a mutable block of memory. To that end, you can push arbitrary bytes into it and manage your memory allocations carefully. While many Python programmers never worry about memory use, there are those of us that do. I manage buffer allocations in high load network operations. There are numerous applications of a mutable array -- too many to count.
It isn't correct to think of the bytearray as a mutable string. It isn't. It is an array of bytes. Strings have a specific structure. bytearrays do not. You can interpret the bytes however you wish. A string is one way. Be careful though, there are many corner cases when using Unicode strings. ASCII strings are comparatively trivial. Modern code runs across borders. Hence, Python's rather complete Unicode-based string classes.
A bytearray reveals memory allocated by the buffer protocol. This very rich protocol is the foundation of Python's interoperation with C and enables numpy and other direct access memory technologies. In particular, it allows memory to be easily structured into multi-dimensional arrays in either C or FORTRAN order.
You may never have to use a bytearray.
Anon, Andrew

deque seems to need much more space than bytearray.
>>> sys.getsizeof(collections.deque(range(256)))
1336
>>> sys.getsizeof(bytearray(range(256)))
293
I Guess this is because of the layout.
If you need samples using bytearray I suggest searching the online code with nullege.
bytearray also has one more advantage: you do not need to import anything for that. This means that people will use it - whether it makes sense or not.
Further reading about bytearray: the bytes type in python 2.7 and PEP-358

python precautions to economize on size of text file of purely numerical characters

I am tabulating a lot of output from some network analysis, listing an edge per line, which results in dozens of gigabytes, stretching the limits of my resources (understatement). As I only deal with numerical values, it occurred to me that I might be smarter than using the Py3k defaults. I.e. some other character encoding might save me quite some space if I only have digits (and space and the occasional decimal dot). As constrained I am, I might even save on the line endings (Not to have the Windows standard CRLF duplicate). What is the best practice on this?
An example line would read like this:
62233 242344 0.42442423
(Where actually the last number is pointlessly precise, I will cut it back to three nonzero digits.)
As I will need to read in the text file with other software (Stata, actually), I cannot keep the data in arbitrary binary, though I see no reason why Stata would only read UTF-8 text. Or you simply say that avoiding UTF-8 barely saves me anything?
I think compression would not work for me, as I write the text line by line and it would be great to limit the output size even during this. I might easily be mistaken how compression works, but I thought it could save me space after the file is generated, but my issue is that my code crashes already as I am tabulating the text file (line by line).
Thanks for all the ideas and clarifying questions!

You can use zlib or gzip to compress the data as you generate it. You won't need to change your format at all, the compression will adjust to the characters and sequences that you use the most to create an optimal file size.

Avoid the character encodings entirely and save your data in a binary format. See Python's struct. Ascii-encoded a value like 4-billion takes 10 bytes, but fits in a 4-byte integer. There are a lot of downsides to a custom binary format (its hard to manually debug, or inspect with other tools, etc)

I have done some study on this. Clever encoding does not matter once you apply compression. Even if you use some binary encoding, they seems to contain the same entropy and end up in similar size after compression.
The Power of Gzip
Yes there are Python library allow you to stream output and automatically compress it.
Lossy encoding does save space. Cutting down the precision helps.

I don't know the capabilities of data input in Stata, and a quick search reveals that said capabilities are described in the User's Guide, which seems to be available only on dead-tree copies. So I don't know if my suggestion is feasible.
An instant saving of half the size would be if you used 4-bits per character. You have an alphabet of 0 to 9, period, (possibly) minus sign, space and newline, which are 14 characters fitting perfectly in 2**4==16 slots.
If this can be used in Stata, I can help more with suggestions for quick conversions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.