Can you point out a scenario in which Python's bytearray is useful? Is it simply a non-unicode string that supports list methods which can be easily achieved anyway with str objects?
I understand some think of it as "the mutable string". However, when would I need such a thing? And: unlike strings or lists, I am strictly limited to ascii, so in any case I'd prefer the others, is that not true?
Bach,
bytearray is a mutable block of memory. To that end, you can push arbitrary bytes into it and manage your memory allocations carefully. While many Python programmers never worry about memory use, there are those of us that do. I manage buffer allocations in high load network operations. There are numerous applications of a mutable array -- too many to count.
It isn't correct to think of the bytearray as a mutable string. It isn't. It is an array of bytes. Strings have a specific structure. bytearrays do not. You can interpret the bytes however you wish. A string is one way. Be careful though, there are many corner cases when using Unicode strings. ASCII strings are comparatively trivial. Modern code runs across borders. Hence, Python's rather complete Unicode-based string classes.
A bytearray reveals memory allocated by the buffer protocol. This very rich protocol is the foundation of Python's interoperation with C and enables numpy and other direct access memory technologies. In particular, it allows memory to be easily structured into multi-dimensional arrays in either C or FORTRAN order.
You may never have to use a bytearray.
Anon, Andrew
deque seems to need much more space than bytearray.
>>> sys.getsizeof(collections.deque(range(256)))
1336
>>> sys.getsizeof(bytearray(range(256)))
293
I Guess this is because of the layout.
If you need samples using bytearray I suggest searching the online code with nullege.
bytearray also has one more advantage: you do not need to import anything for that. This means that people will use it - whether it makes sense or not.
Further reading about bytearray: the bytes type in python 2.7 and PEP-358
Related
Is there any Python module that allows one to freely reinterpret raw bytes as different data types and perform various arithmetic operations on them? For example, take a look at this C snippet:
char test[9] = "test_data";
int32_t *x = test;
(*x)++;
printf("%d %.9s\n", x, test);
//outputs 1245780400 uest_data on LE system.
Here, both test and x point to the same chunk of memory. I can use x to perform airthmetics and testfor string-based operations and individual byte access.
I would like to do the same in Python - I know that I can use struct.pack and struct.unpack whenerver I feel the need to convert between list of bytes and an integer, but maybe there's a module that makes such fiddling much easier.
I'd like to have a datatype that would support arithmetics and at the same time, would allow me to mess with the individual bytes. This doesn't sound like a use-case that Python was designed for, but nevertheless, it's a quite common problem.
I took a look into pycrypto codebase, since cryptography is one of the most common use cases for such functionality, but pycrypto implements most of its algorithms in plain C and the much smaller Python part uses two handcrafted methods (long_to_bytes and bytes_to_long) for conversion between the types.
You may be able to use ctypes.Union, which is analogous to Unions in C.
I don't have much experience in C, but as I understand Unions allow you to write to a memory location as one type and read it as another type, and vice versa, which seems to fit your use case.
https://docs.python.org/3/library/ctypes.html#structures-and-unions
Otherwise, if your needs are simpler, you could just convert bytes / bytearray objects from / to integers using the built-in int.to_bytes and int.from_bytes methods.
Python3 has unicode strings (str) and bytes. We already have bytestring literals and methods. Why do we need two different types, instead of just byte strings of various encodings?
The answer to your question depends on the meaning of the word "need."
We certainly don't need the str type in the sense that everything we can compute with the type we can also compute without it (as you know quite well from your well-worded question).
But we can also understand "need" from the point of view of convenience. Isn't it nice to have a sqrt function? Or log or exp or sin? You could write these yourself, but why bother? A standard library designer will add functions that are useful and convenient.
It is the same for the language itself. Do we "need" a while loop? Not really, we can use tail-recursive functions. Do we "need" list comprehensions? Tons of things in Python are not primitive. For that matter do we "need" high level languages. John von Neumann himself once asked "why would you want more than machine language?"
It is the same with str and bytes. The type str, while not necessary, is a nice, time-saving, convenient thing to have. It gives us an interface as a sequence of characters, so that we can manipulate text character-by-character without:
us having to write all the encoding and decoding logic ourselves, or
bloating the string interface with multiple sets of iterators, like each_byte and each_char.
As you suspect, we could have one type which exposes the byte sequence and the character sequence (as Ruby's String class does). The Python designers wanted to separate those usages into two separate types. You can convert an object of one type into the other very easily. By having two types, they are saying that separation of concerns (and usages) is more important than having fewer built-in types. Ruby makes a different choice.
TL;DR It's a matter of preference in language design: separation of concerns by distinct type rather than by different methods on the same type.
Because bytes should not be considered strings, and strings should not be considered bytes. Python3 gets this right, no matter how jarring this feels to the brand new developer.
In Python 2.6, if I read data from a file, and I passed the "r" flag, the text would be read in the current locale by default, which would be a string, while passing the "rb" flag would create a series of bytes. Indexing the data is entirely different, and methods that take a str may be unsure of whether I am using bytes or a str. This gets worse since for ASCII data the two are often synonymous, meaning that code which works in simple test cases or English locales will fail upon encountering non-ASCII characters.
There was therefore a conscious effort to ensure bytes and strings were not identical: that one was a sequence of "dumb bytes", and the other was a Unicode string with the optimal encoding for the data to preserve O(1) indexing (ASCII, UCS-2, or UTF-32, depending on the data used, I believe).
In Python 2, the Unicode string was used to disambiguate text from "dumb bytes", however, str was treated as text by many users.
Or, to quote the Benevolent Dictator:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs. In future versions of Python, string objects will be used for holding character data. The bytes object will fulfil the role of a byte container. Eventually the unicode type will be renamed to str and the old str type will be removed.
tl;dr version
Forcing the separation of bytes and str forces coders to be conscious of their difference, to short-term dissatisfaction, but better code long-term. It's a conscious choice after years of experience: that forcing you to be conscious of the difference immediately will save you days in a debugger later.
Byte strings with different encodings are incompatible with each other, but until Python 3 there was nothing in the language to remind you of this fact. It turns out that mixing different character encodings is a surprisingly common problem in today's world, leading to far too many bugs.
Also it's often just easier to work with whole characters, without having to worry that you just modified a byte that accidentally rendered your 4-byte character into an invalid sequence.
There are at least two reasons:
the str type has an important property "one element = one character".
the str type does not depend on encoding.
Just imagine how would you implement a simple operation like reversing a string (rword = word[::-1]) if word were a bytestring with some encoding.
Python PEP 3137 introduced bytearray as a mutable 8-bit array type. However, a list of the immutable bytes type accomplishes the same goal, and actually has better performance, albeit perhaps a clunkier syntax. This new type contradicts the Zen of Python
There should be one-- and preferably only one --obvious way to do it.
So my question is: is there any documented major advantage or design consideration for using a bytearray over a list of bytes?
So far, I have not found a piece of motivation documented in the PEP or in the documentation pages. In fact, the documentation treats them as near-equals:
The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences...
And then,
List and bytearray objects support additional operations that allow in-place modification of the object. Other mutable sequence types (when added to the language) should also support these operations.
As bytearrays are statically typed (as 8-bit unsigned integers) one might expect a performance increase, but as mentioned above the inverse is probably true. Also, there should be no memory advantage for a bytearray over a list of bytes. I could imagine that there was a need to a itertools.chain-style mutable type, but this is not mentioned anywhere and does not seem to be the design goal.
First, a list-style container for sequences of objects that can seamlessly iterate through them (something that itertools.chain provides, actually) definitely seems convenient. As PEP 3137 mentions, the array module might have served for this purpose, but was "far from ideal".
The authors wished for "both a mutable and an immutable bytes type", which might mean that the design goal was having the same interface for both a mutable (perhaps using non-contiguous memory for fast insertions and deletions), as well as an immutable (definitely contiguous memory) implementation. This created, as far as I can appreciate, a jumble of sequence types, including array.array, bytes, bytearray, str, unicode, list, tuple and generator. In essence, they provide interfaces to statically typed or ducked type, mutable or unmutable sequences which are stored in memory or evaluated on-the-fly (generators). Most of these do follow the rationale of the Abstract Base Classes, but I do think that there is more design to be done before this can be deemed Pythonic.
Note: I'm leaving this answer open for editing hoping that people might contribute insights or corrections.
After an initial search on this, I'm bit lost.
I want to use a buffer object to hold a sequence of Unicode code points. I just need to scan and extract tokens from said sequence, so basically this is a read only buffer, and we need functionality to advance a pointer within the buffer, and to extract sub-segments. The buffer object should of course support the usual regex and search ops on strings.
An ordinary Unicode string can be used for this, but the issue would be the creating of sub-string copies to simulate advancing a pointer within the buffer. This seems to be very inefficient esp for larger buffers, unless there's some workaround.
I can see that there's a Memoryview object that would be suitable, but it does not support Unicode (?).
What else can I use to provide the above functionality? (Whether in Py2 or Py3).
It depends on what exactly is needed, but usually just one Unicode string is enough. If you need to take non-tiny slices, you can keep them as 3-tuples (big unicode, start pos, end pos) or just make custom objects with these 3 attributes and whatever API is needed. The point is that a lot of methods like unicode.find() or the regex pattern objects's search() support specifying start and end points. So you can do most basic things without actually needing to slice the single big unicode string.
I want to send lots of numbers via zeromq but converting them to str is inefficient. What is the best way to send numbers via zmq?
A few options:
use python's struct.pack / struct.unpack methods e.g. struct.pack("!L", 1234567)
use another serializer like msgpack
You state that converting numbers to str is inefficient. And yet, unless you have a truly exotic network, that is exactly what must occur no matter what solution is chosen, because all networks in wide use today are byte-based.
Of course, some ways of converting numbers to byte-strings are faster than others. Performing the conversion in C code will likely be faster than in Python code, but consider also whether it is acceptable to exclude "long" (bignum) integers. If excluding them is not acceptable, the str function may be as good as it gets.
The struct and cpickle modules may perform better than str if excluding long integers is acceptable.