LZ77 compression reserved bytes "< , >"

LZ77 compression reserved bytes "< , >" - python

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!

LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.

If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

Related

'utf-8' codec can't decode byte 0xe1: unexpected end of data [duplicate]

I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?

ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.

(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.

Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.

Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.

Which are the advantages of byte objects over string objects in Python?

I understand the differences between byte/bytearray and string in Python and how to handle/manipulate/convert these objects but I cannot find real life scenarios/examples where you would prefer to work with bytes instead of strings in the code.
Which are the advantages of byte objects over string objects in Python?
and in which real life scenarios should you convert in your code strings into bytes and why?

For all modern computer architectures, a byte consists of 8 bits and thus can encode 256 distinct values.
In the ASCII character encoding, there are only 128 different values, with only a subset of those being printable. With UTF-8 it gets a little more complicated, but you end up in a similar problem, that not all byte sequences are representable as a string. So anytime you have a sequence of bytes that is not representable as a string, you have to use bytes() or bytearray.
One example of when you might need to use bytes, is when working with crypto and pseudo-random sequence generation, where you will often end up with a sequence of bytes that cannot be represented 1-to-1 as a string. This is because you want to work with as large as possible an output space when generating pseudo-random numbers and sequences. See for example secrets.token_bytes from the stdlib.
If you want to represent such a sequence as a string, it's possible to encode it into a sequence of bytes that are all inside the ASCII encoding space, but of course, at the cost of using more bytes. For example, you can encode it as hex characters or in base64. Hex has the advantage that the size of the resulting string is always 2 * n_bytes, while base64 is the most efficient way of encoding bytes into ASCII, i.e. it will use the least amount of extra bytes. Note that the secrets stdlib module also gives you convenience functions that does this conversion for you.

in which real life scenarios should you convert in your code strings into bytes and why?
One example is using some compression algorithm which works on bytes rather than str. Take look at lzma built-in module examples, note that it does work with bytes rather than str. In case of a lot of text this allow more effiecient usage of available memory (i.e. saving same text in smaller space).

How to display Chinese characters in Python? [duplicate]

I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?

(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.

Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.

Can't regex string between string and line break - str(x) versus x.decode()

Why does a regex fail of a string cast from an object when line breaks are present?
That is why does this fail to find a match (ie print 'Green') in a string created from str(obj):
import re
s = str(b'Package Name: Green\\r\\n Release version: 8.1\\r\\n')
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
When this succeeds using a string created from obj.decode()?
import re
s = (b'Package Name: Green\\r\\n Release version: 8.1\\r\\n').decode()
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
No matter what search pattern was tried, searching the string created by str(obj) failed to find a match...

The reason you get different results is that you’re doing different things. Calling str on a bytes with newline characters returns a string with a literal backslash and n; calling decode returns a string with a newline character in it. So, if you’re searching the results for newline characters, the second one will succeed, and the first will fail. And it’s the second one that you wanted.
In other words, using decode here is right, and str is wrong; that’s why you get different results. If you can’t think through the difference, try just printing them out: print(b.decode()); print(str(b)) and you’ll see the difference immediately.
In fact, you should usually be decoding the strings as soon as you receive them, and never looking at the bytes again. Then you never have to worry about the str representation of bytes objects (except maybe in some code that logs errors caused by invalid strings that you couldn’t decode). The only exception is when you know the bytes are some kind of encoded text, but can’t be sure what the encoding is. For example, if you’re parsing HTTP headers or email messages or Python source code, you don’t know the character set until you read part of the file and search it for special ASCII-encoded strings. Or, if you’re converting a bunch of old text files from Windows to Unix line endings and some are cp1252 while others are cp1250, you don’t care which is which because they both encode line endings the same way. For those cases, just stick with bytes, and search for b'\n' instead of '\n'.
If you want to know why Python makes this so complicated:
bytes objects are used to store strings encoded in your default encoding—but they’re also used to store strings encoded in different encodings, and binary data that isn’t a string at all. And a bytes object has no idea which of those it’s storing; they’re all just sequences of numbers.
Python 2 effectively assumed that a bytes was being used to store a string in your default encoding, so it let you convert back and forth to Unicode by calling functions like str, or even concatenating a bytes and Unicode string. That turned out to be one of the biggest sources of errors in the language. You still see Python 2 users posting questions here every few days asking why they got a UnicodeEncodeError when they weren’t calling encode anywhere (or, worse, when they were calling decode), and fixing that was one of the main reasons for Python 3’s existence.
The human-readable representation of a bytes object has to be something that can be produced without error, and read unambiguously, whether it’s a string in the default encoding, a string in a completely different encoding, or a sequence of pixel brightness values ranging from 0 to 255. The compromise solution (for things like that HTTP headers case above) is the backslash-escaped quoted string.
By the way, during the Python 2 to 3 transition, the core devs assumed multiple people would come up with clever EncodedBytes types that carried around their encoding, and could therefore act more like Python 2 byte strings but without all the associated errors, and after a couple years one of them would be the clear winner on PyPI and maybe they could add it to Python 3.3 or so. That’s what you’re probably instinctively reaching for here. But, as it turned out, nobody used any such libraries, because it’s almost always easier to just decode and encode at the edges of your program and use Unicode everywhere, and the exceptions are almost always cases where you don’t know the encoding so EncodedBytes wouldn’t help.
One last thing: thinking of functions like str or float as “casts” is misleading. While it looks superficially similar to the way you do explicit casts in C or Java or Go or whatever language you’re used to, it has a very different meaning

Python 3.5 base64 decoding seems to be incorrect?

In Python 3.5 the base64 module has a method, standard_b64decode() for decoding strings from base64, which returns a bytes object.
When I run base64.standard_b64decode("wc==") the output is b\xc1. When you base64 encode "\xc1", you get "wQ==". It looks like there is an error in the decoding function. Actually, I think "wc==" is an invalid base64 encoded string, by this reasoning:
wc== ends with ==, which means that it was produced from a single input byte.
The corresponding values of 'w' and 'c' in the regular base64 alphabet are, respectively, 48 and 28, meaning their 6-bit representations are, respectively, 110000 and 011100.
Concatenating these, the first 8 bits are 11000001, which is \xc1, but the remaining bits (1100) are non-zero, so couldn't have been produced by the padding process performed during base64 encoding, as that only appends bits with value 0, which means these extra 1 bits can't have been produced through valid base64 encoding -> the string is not a valid base64 encoded string.
I think this is true for any 4 character chunk of base64 encoding ending in == when any of the last 4 bits of the second character are 1.
I'm pretty convinced that this is right, but I'm rather less experienced than the Python developers.
Can anyone confirm the above, or explain why it's wrong, if indeed it is?

The Base64 standard is defined by RFC 4648. Your question is answered by §3.5:
Canonical Encoding
The padding step in base 64 and base 32 encoding can, if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed.
In some environments, the alteration is critical and therefore decoders MAY chose to reject an encoding if the pad bits have not been set to zero.
The meaning of MAY is defined by RFC 2119:
MAY This word, or the adjective "OPTIONAL", mean that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item.
So Python is not obliged by the standard to reject non-canonical encodings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.