How to implement a Unicode buffer in python

How to implement a Unicode buffer in python - python

After an initial search on this, I'm bit lost.
I want to use a buffer object to hold a sequence of Unicode code points. I just need to scan and extract tokens from said sequence, so basically this is a read only buffer, and we need functionality to advance a pointer within the buffer, and to extract sub-segments. The buffer object should of course support the usual regex and search ops on strings.
An ordinary Unicode string can be used for this, but the issue would be the creating of sub-string copies to simulate advancing a pointer within the buffer. This seems to be very inefficient esp for larger buffers, unless there's some workaround.
I can see that there's a Memoryview object that would be suitable, but it does not support Unicode (?).
What else can I use to provide the above functionality? (Whether in Py2 or Py3).

It depends on what exactly is needed, but usually just one Unicode string is enough. If you need to take non-tiny slices, you can keep them as 3-tuples (big unicode, start pos, end pos) or just make custom objects with these 3 attributes and whatever API is needed. The point is that a lot of methods like unicode.find() or the regex pattern objects's search() support specifying start and end points. So you can do most basic things without actually needing to slice the single big unicode string.

Related

read list of strings as file, python3

I have a list of strings, and would like to pass this to an api that accepts only a file-like object, without having to concatenate/flatten the list to use the likes of StringIO.
The strings are utf-8, don't necessarily end in newlines, and if naively concatenated could be used directly in StringIO.
Preferred solution would be within the standard library (python3.8) (Given the shape of the data is naturally similar to a file (~identical to readlines() obviously), and memory access pattern would be efficient, I have a feeling I'm just failing to DuckDuckGo correctly) - but if that doesn't exist any "streaming" (no data concatenation) solution would suffice.
[Update, based on #JonSG's links]
Both RawIOBase and TextIOBase look provide an api that decouples arbitrarily sized "chunks"/fragments (in my case: strings in a list) from a file-like read which can specify its own read chunk size, while streaming the data itself (memory cost increases by only some window at any given time [dependent of course on behavior of your source & sink])
RawIOBase.readinto looks especially promising because it provides the buffer returned to client reads directly, allowing much simpler code - but this appears to come at the cost of one full copy (into that buffer).
TextIOBase.read() has its own cost for its operation solving the same subproblem, which is concatenating k (k much smaller than N) chunks together.
I'll investigate both of these.

What are the differences between the modules unicode and unicodedata?

I have a large dataset with over 2 million rows of textual data. Now I want to remove the accents from the strings.
In the link below, two different modules are described to remove the accents:
What is the best way to remove accents in a Python unicode string?
The modules described are unicode and unicodedata. To me it's not clear what the differences are between the two and a comparison is hard, because I don't have many rows with accents and I don't know what accents might be replaced and which ones are not.
Therefore, I would like to know what the differences are between the two and which one is recommended to use.

There is only one module: unicodedata, which includes the unicode database, so the names and properties of unicode code points.
unicode was a built-in function in Python 2. This function just convert strings to unicode strings, so it was just the encoding, no need to store all the data. On python3 all strings are unicode (with some particularities). Just the encoding now should be defined explicitly.
On that answer, you see only import unicodedata, so only one module. To remove accents, you do no need just unicode code point, but also information about the type of a unicode code point (combining character), so you need unicodedata.
Maybe you mean unidecode. This is a special module, but outside standard library. It could be useful for some uses. The modules is simple and give only results in ASCII domain. This could be ok on some cases, but it could cause problems outside Latin writing system.
On the other hand, unicodedata do nothing for you. You should understand unicode and apply the right filter function (and maybe knowing how other languages works).
So it depends on the case, and maybe you need just other slug functions (to create non escaped string). When workign with languages, you should care not to overdo things (you may built an offensive word).

C-Style data fiddling in Python

Is there any Python module that allows one to freely reinterpret raw bytes as different data types and perform various arithmetic operations on them? For example, take a look at this C snippet:
char test[9] = "test_data";
int32_t *x = test;
(*x)++;
printf("%d %.9s\n", x, test);
//outputs 1245780400 uest_data on LE system.
Here, both test and x point to the same chunk of memory. I can use x to perform airthmetics and testfor string-based operations and individual byte access.
I would like to do the same in Python - I know that I can use struct.pack and struct.unpack whenerver I feel the need to convert between list of bytes and an integer, but maybe there's a module that makes such fiddling much easier.
I'd like to have a datatype that would support arithmetics and at the same time, would allow me to mess with the individual bytes. This doesn't sound like a use-case that Python was designed for, but nevertheless, it's a quite common problem.
I took a look into pycrypto codebase, since cryptography is one of the most common use cases for such functionality, but pycrypto implements most of its algorithms in plain C and the much smaller Python part uses two handcrafted methods (long_to_bytes and bytes_to_long) for conversion between the types.

You may be able to use ctypes.Union, which is analogous to Unions in C.
I don't have much experience in C, but as I understand Unions allow you to write to a memory location as one type and read it as another type, and vice versa, which seems to fit your use case.
https://docs.python.org/3/library/ctypes.html#structures-and-unions
Otherwise, if your needs are simpler, you could just convert bytes / bytearray objects from / to integers using the built-in int.to_bytes and int.from_bytes methods.

Why do we need str type? Why not just byte-strings?

Python3 has unicode strings (str) and bytes. We already have bytestring literals and methods. Why do we need two different types, instead of just byte strings of various encodings?

The answer to your question depends on the meaning of the word "need."
We certainly don't need the str type in the sense that everything we can compute with the type we can also compute without it (as you know quite well from your well-worded question).
But we can also understand "need" from the point of view of convenience. Isn't it nice to have a sqrt function? Or log or exp or sin? You could write these yourself, but why bother? A standard library designer will add functions that are useful and convenient.
It is the same for the language itself. Do we "need" a while loop? Not really, we can use tail-recursive functions. Do we "need" list comprehensions? Tons of things in Python are not primitive. For that matter do we "need" high level languages. John von Neumann himself once asked "why would you want more than machine language?"
It is the same with str and bytes. The type str, while not necessary, is a nice, time-saving, convenient thing to have. It gives us an interface as a sequence of characters, so that we can manipulate text character-by-character without:
us having to write all the encoding and decoding logic ourselves, or
bloating the string interface with multiple sets of iterators, like each_byte and each_char.
As you suspect, we could have one type which exposes the byte sequence and the character sequence (as Ruby's String class does). The Python designers wanted to separate those usages into two separate types. You can convert an object of one type into the other very easily. By having two types, they are saying that separation of concerns (and usages) is more important than having fewer built-in types. Ruby makes a different choice.
TL;DR It's a matter of preference in language design: separation of concerns by distinct type rather than by different methods on the same type.

Because bytes should not be considered strings, and strings should not be considered bytes. Python3 gets this right, no matter how jarring this feels to the brand new developer.
In Python 2.6, if I read data from a file, and I passed the "r" flag, the text would be read in the current locale by default, which would be a string, while passing the "rb" flag would create a series of bytes. Indexing the data is entirely different, and methods that take a str may be unsure of whether I am using bytes or a str. This gets worse since for ASCII data the two are often synonymous, meaning that code which works in simple test cases or English locales will fail upon encountering non-ASCII characters.
There was therefore a conscious effort to ensure bytes and strings were not identical: that one was a sequence of "dumb bytes", and the other was a Unicode string with the optimal encoding for the data to preserve O(1) indexing (ASCII, UCS-2, or UTF-32, depending on the data used, I believe).
In Python 2, the Unicode string was used to disambiguate text from "dumb bytes", however, str was treated as text by many users.
Or, to quote the Benevolent Dictator:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs. In future versions of Python, string objects will be used for holding character data. The bytes object will fulfil the role of a byte container. Eventually the unicode type will be renamed to str and the old str type will be removed.
tl;dr version
Forcing the separation of bytes and str forces coders to be conscious of their difference, to short-term dissatisfaction, but better code long-term. It's a conscious choice after years of experience: that forcing you to be conscious of the difference immediately will save you days in a debugger later.

Byte strings with different encodings are incompatible with each other, but until Python 3 there was nothing in the language to remind you of this fact. It turns out that mixing different character encodings is a surprisingly common problem in today's world, leading to far too many bugs.
Also it's often just easier to work with whole characters, without having to worry that you just modified a byte that accidentally rendered your 4-byte character into an invalid sequence.

There are at least two reasons:
the str type has an important property "one element = one character".
the str type does not depend on encoding.
Just imagine how would you implement a simple operation like reversing a string (rword = word[::-1]) if word were a bytestring with some encoding.

Use cases for bytearray

Can you point out a scenario in which Python's bytearray is useful? Is it simply a non-unicode string that supports list methods which can be easily achieved anyway with str objects?
I understand some think of it as "the mutable string". However, when would I need such a thing? And: unlike strings or lists, I am strictly limited to ascii, so in any case I'd prefer the others, is that not true?

Bach,
bytearray is a mutable block of memory. To that end, you can push arbitrary bytes into it and manage your memory allocations carefully. While many Python programmers never worry about memory use, there are those of us that do. I manage buffer allocations in high load network operations. There are numerous applications of a mutable array -- too many to count.
It isn't correct to think of the bytearray as a mutable string. It isn't. It is an array of bytes. Strings have a specific structure. bytearrays do not. You can interpret the bytes however you wish. A string is one way. Be careful though, there are many corner cases when using Unicode strings. ASCII strings are comparatively trivial. Modern code runs across borders. Hence, Python's rather complete Unicode-based string classes.
A bytearray reveals memory allocated by the buffer protocol. This very rich protocol is the foundation of Python's interoperation with C and enables numpy and other direct access memory technologies. In particular, it allows memory to be easily structured into multi-dimensional arrays in either C or FORTRAN order.
You may never have to use a bytearray.
Anon, Andrew

deque seems to need much more space than bytearray.
>>> sys.getsizeof(collections.deque(range(256)))
1336
>>> sys.getsizeof(bytearray(range(256)))
293
I Guess this is because of the layout.
If you need samples using bytearray I suggest searching the online code with nullege.
bytearray also has one more advantage: you do not need to import anything for that. This means that people will use it - whether it makes sense or not.
Further reading about bytearray: the bytes type in python 2.7 and PEP-358

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.