Python unicode.splitlines() triggers at non-EOL character

Python unicode.splitlines() triggers at non-EOL character - python

Triyng to make this in Python 2.7:
>>> s = u"some\u2028text"
>>> s
u'some\u2028text'
>>> l = s.splitlines(True)
>>> l
[u'some\u2028', u'text']
\u2028 is Left-To-Right Embedding character, not \r or \n, so that line should not be splitted. Is there a bug or just my misunderstanding?

\u2028 is LINE SEPARATOR, left-to-right embedding is \u202A:
>>> import unicodedata
>>> unicodedata.name(u'\u2028')
'LINE SEPARATOR'
>>> unicodedata.name(u'\u202A')
'LEFT-TO-RIGHT EMBEDDING'
The list of codepoints considered linebreaks is easy (not that easy to find though) to see in python source (python 2.7, comments by me):
/* Returns 1 for Unicode characters having the line break
* property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
* type 'B', 0 otherwise.
*/
int _PyUnicode_IsLinebreak(register const Py_UNICODE ch)
{
switch (ch) {
// Basic Latin
case 0x000A: // LINE FEED
case 0x000B: // VERTICAL TABULATION
case 0x000C: // FORM FEED
case 0x000D: // CARRIAGE RETURN
case 0x001C: // FILE SEPARATOR
case 0x001D: // GROUP SEPARATOR
case 0x001E: // RECORD SEPARATOR
// Latin-1 Supplement
case 0x0085: // NEXT LINE
// General punctuation
case 0x2028: // LINE SEPARATOR
case 0x2029: // PARAGRAPH SEPARATOR
return 1;
}
return 0;
}

U+2028 is LINE SEPARATOR. Both U+2028 and U+2029 (PARAGRAPH SEPARATOR) should be treated as newlines, so Python is doing the right thing.
Of course it is sometimes perfectly reasonable to want to split on a non-standard list of newline characters. But you can't do that with splitlines. You will have to use split—and, if you need the additional features of splitlines, you'll have to implement them yourself. For example:
return [line.rstrip(sep) for line in s.split(sep)]

Related

re.search seaches only in one line [duplicate]

I have this file loaded in string:
// some preceding stuff
static char header_data[] = {
1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,
1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,
1,1,0,1,0,1,0,1,1,0,1,0,1,0,1,1,
1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,
0,0,0,1,1,1,1,1,1,1,1,1,1,0,1,1,
1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,
0,1,0,0,0,1,0,0,1,1,1,1,0,0,0,0,
0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,
0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,
0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,
1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,
1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,
1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,
1,1,0,1,0,1,0,1,1,1,1,0,0,0,0,1,
1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,
1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1
};
I want to get only the block with ones and zeros, and then somehow process it.
I imported re, and tried:
In [11]: re.search('static char header_data(.*);', src, flags=re.M)
In [12]: re.findall('static char header_data(.*);', src, flags=re.M)
Out[12]: []
Why doesn't it match anything? How to fix this? (It's python3)

You need to use the re.S flag, not re.M.
re.M (re.MULTILINE) controls the behavior of ^ and $ (whether they match at the start/end of the entire string or of each line).
re.S (re.DOTALL) controls the behavior of the . and is the option you need when you want to allow the dot to match newlines.
See also the documentation.

and then somehow process it.
Here we go to get a useable list out of the file:
import re
match = re.search(r"static char header_data\[\] = {(.*?)};", src, re.DOTALL)
if match:
header_data = "".join(match.group(1).split()).split(',')
print header_data
.*? is a non-greedy match so you really will get just the value between this set of braces.
A more expicit way without DOTALL or MULTILINE would be
match = re.search(r"static char header_data\[\] = {([01,\s\r\n]*?)};", src)

If the format of the file does not change, you might as well not resort to re but use slices. Something on these lines could be useful
>>> file_in_string
'\n// some preceding stuff\nstatic char header_data[] = {\n 1,1,1,1,1,1,0,0,0
,0,1,1,1,1,1,1,\n 1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,\n 1,1,0,1,0,1,0,1,1,0,1
,0,1,0,1,1,\n 1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,\n 0,0,0,1,1,1,1,1,1,1,1,1,1
,0,1,1,\n 1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,\n 0,1,0,0,0,1,0,0,1,1,1,1,0,0,0
,0,\n 0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,\n 0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,\
n 0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,\n 1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,\n
1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,\n 1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,\n 1,1
,0,1,0,1,0,1,1,1,1,0,0,0,0,1,\n 1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,\n 1,1,1,1
,1,1,0,0,0,0,1,1,1,1,1,1\n };\n'
>>> lines = file_in_string.split()
>>> lines[9:-1]
['1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,', '1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,', '1,1,0,
1,0,1,0,1,1,0,1,0,1,0,1,1,', '1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,', '0,0,0,1,1,1,1,
1,1,1,1,1,1,0,1,1,', '1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,', '0,1,0,0,0,1,0,0,1,1,1,
1,0,0,0,0,', '0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,', '0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,
0,', '0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,', '1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,', '1,
1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,', '1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,', '1,1,0,1,0,
1,0,1,1,1,1,0,0,0,0,1,', '1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,', '1,1,1,1,1,1,0,0,0,
0,1,1,1,1,1,1']

Why is \x00 not converted to \0 by repr

Here is an interesting oddity about Python's repr:
The tab character \x09 is represented as \t. However this convention does not apply for the null terminator.
Why is \x00 represented as \x00, rather than \0?
Sample code:
# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True
>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0

The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\, \n, \r, \t, (plus \' when both " and ' characters are present) because there are explicit tests for those.
The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \xhh, \uhhhh and \Uhhhhhhhh, always using the shortest of the 3 options that'll fit the value).
Moreover, when generating the repr() output, for a string consisting of a null byte followed by a digit from '1' through to '7' (so bytes([0x00, 0x49]), or bytes([0x00, 0x4A]), etc), you can't just use \0 in the output without then also having to escape the following digit. '\01' is a single octal escape sequence, and not the same value as '\x001', which is two bytes. While forcing the output to always use three octal digits (e.g. '\0001') could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\x001' and '\0Ol'?)
The output is always consistent. Apart from the single quote (which can appear either with ' or \', depending on the presence of " characters), Python will always use same escape sequence style for a given codepoint.
If you want to study the code that produces the output, you can find the Python 3 str.__repr__ implementation in the Objects/unicodeobject.c unicode_repr() function, which uses
/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, ch);
continue;
}
/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'r');
}
for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr() function does much the same thing.

If it tried to use \0, then it would have to special-case when numbers immediately followed it, to prevent them from being interpreted as an octal literal. Always using \x00 is simpler and always correct.

python re.search not working on multiline string

I have this file loaded in string:
// some preceding stuff
static char header_data[] = {
1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,
1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,
1,1,0,1,0,1,0,1,1,0,1,0,1,0,1,1,
1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,
0,0,0,1,1,1,1,1,1,1,1,1,1,0,1,1,
1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,
0,1,0,0,0,1,0,0,1,1,1,1,0,0,0,0,
0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,
0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,
0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,
1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,
1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,
1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,
1,1,0,1,0,1,0,1,1,1,1,0,0,0,0,1,
1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,
1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1
};
I want to get only the block with ones and zeros, and then somehow process it.
I imported re, and tried:
In [11]: re.search('static char header_data(.*);', src, flags=re.M)
In [12]: re.findall('static char header_data(.*);', src, flags=re.M)
Out[12]: []
Why doesn't it match anything? How to fix this? (It's python3)

You need to use the re.S flag, not re.M.
re.M (re.MULTILINE) controls the behavior of ^ and $ (whether they match at the start/end of the entire string or of each line).
re.S (re.DOTALL) controls the behavior of the . and is the option you need when you want to allow the dot to match newlines.
See also the documentation.

and then somehow process it.
Here we go to get a useable list out of the file:
import re
match = re.search(r"static char header_data\[\] = {(.*?)};", src, re.DOTALL)
if match:
header_data = "".join(match.group(1).split()).split(',')
print header_data
.*? is a non-greedy match so you really will get just the value between this set of braces.
A more expicit way without DOTALL or MULTILINE would be
match = re.search(r"static char header_data\[\] = {([01,\s\r\n]*?)};", src)

If the format of the file does not change, you might as well not resort to re but use slices. Something on these lines could be useful
>>> file_in_string
'\n// some preceding stuff\nstatic char header_data[] = {\n 1,1,1,1,1,1,0,0,0
,0,1,1,1,1,1,1,\n 1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,\n 1,1,0,1,0,1,0,1,1,0,1
,0,1,0,1,1,\n 1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,\n 0,0,0,1,1,1,1,1,1,1,1,1,1
,0,1,1,\n 1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,\n 0,1,0,0,0,1,0,0,1,1,1,1,0,0,0
,0,\n 0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,\n 0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,\
n 0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,\n 1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,\n
1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,\n 1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,\n 1,1
,0,1,0,1,0,1,1,1,1,0,0,0,0,1,\n 1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,\n 1,1,1,1
,1,1,0,0,0,0,1,1,1,1,1,1\n };\n'
>>> lines = file_in_string.split()
>>> lines[9:-1]
['1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,', '1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,', '1,1,0,
1,0,1,0,1,1,0,1,0,1,0,1,1,', '1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,', '0,0,0,1,1,1,1,
1,1,1,1,1,1,0,1,1,', '1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,', '0,1,0,0,0,1,0,0,1,1,1,
1,0,0,0,0,', '0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,', '0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,
0,', '0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,', '1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,', '1,
1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,', '1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,', '1,1,0,1,0,
1,0,1,1,1,1,0,0,0,0,1,', '1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,', '1,1,1,1,1,1,0,0,0,
0,1,1,1,1,1,1']

Increase C++ regex replace performance

I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?
My C++ code:
std::regex xml_tags_regex("<[^>]*>");
for (std::vector<std::string>::iterator it = _files.begin(); it !=
_files.end(); it++) {
std::ifstream file(*it);
file.seekg(0, std::ios::end);
size_t size = file.tellg();
std::string buffer(size, ' ');
file.seekg(0);
file.read(&buffer[0], size);
buffer = regex_replace(buffer, xml_tags_regex, "");
file.close();
}
My Python code:
regex = re.compile('<[^>]*>')
for filename in filenames:
with open(filename) as f:
content = f.read()
content = regex.sub('', content)
P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.

C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
insert text matching corresponding parenthesized group from the
pattern. \0 in "rewrite" refers to the entire matching text.
This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:
Command string syntax
Command syntax follows Perl s/pattern/substitute/[options]
convention. Any character (except the backslash \) can be used as a
delimiter, not just /, but make sure that delimiter is escaped with
a backslash (\) if used in pattern, substitute or options
substrings, e.g.:
s/\\/\//g to replace all backslashes with forward ones
Remember to double backslashes in C++ code, unless using raw string
literal (see string literal):
pcrscpp::replace rx("s/\\\\/\\//g");
Pattern string syntax
Pattern string is passed directly to pcre*_compile, and thus has to
follow PCRE syntax as described in PCRE documentation.
Substitute string syntax
Substitute string backreferencing syntax is similar to Perl's:
$1 ... $n: nth capturing subpattern matched.
$& and $0: the whole match
${label} : labled subpattern matched. label is up to 32 alphanumerical +
underscore characters ('A'-'Z','a'-'z','0'-'9','_'),
first character must be alphabetical
$` and $' (backtick and tick) refer to the areas of the subject before
and after the match, respectively. As in Perl, the unmodified
subject is used, even if a global substitution previously matched.
Also, following escape sequences get recognized:
\n: newline
\r: carriage return
\t: horizontal tab
\f: form feed
\b: backspace
\a: alarm, bell
\e: escape
\0: binary zero
Any other escape sequence \<char>, is interpreted as <char>,
meaning that you have to escape backslashes too
Options string syntax
In Perl-like manner, options string is a sequence of allowed modifier
letters. PCRSCPP recognizes following modifiers:
Perl-compatible flags
g: global replace, not just the first match
i: case insensitive match
(PCRE_CASELESS)
m: multi-line mode: ^ and $ additionally match positions
after and before newlines, respectively
(PCRE_MULTILINE)
s: let the scope of the . metacharacter include newlines
(treat newlines as ordinary characters)
(PCRE_DOTALL)
x: allow extended regular expression syntax,
enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)
PHP-compatible flags
A: "anchor" pattern: look only for "anchored" matches: ones that
start with zero offset. In single-line mode is identical to
prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)
D: treat dollar $ as subject end assertion only, overriding the default:
end, or immediately before a newline at the end.
Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)
U: invert * and + greediness logic: make ungreedy by default,
? switches back to greedy. (?U) and (?-U) in-pattern switches
remain unaffected
(PCRE_UNGREEDY)
u: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string.
Unlike in PHP, also affects newlines, \R, \d, \w, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY
| PCRE_BSR_UNICODE | PCRE_UCP)
PCRSCPP own flags:
N: skip empty matches
(PCRE_NOTEMPTY)
T: treat substitute as a trivial string, i.e., make no backreference
and escape sequences interpretation
n: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines,
the replacement result is plain concatenation of matches,
be specifically aware of this in multiline mode
I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:
#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>
#include <chrono>
int main (int argc, char *argv[]) {
const std::string file_name("move.sh");
pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
std::regex std_rx (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");
std::ifstream file (file_name);
if (!file.is_open ()) {
std::cerr << "Unable to open file " << file_name << std::endl;
return 1;
}
std::string buffer;
{
file.seekg(0, std::ios::end);
size_t size = file.tellg();
file.seekg(0);
if (size > 0) {
buffer.resize(size);
file.read(&buffer[0], size);
buffer.resize(size - 1); // strip '\0'
}
}
file.close();
std::string bigstring;
bigstring.reserve(10*buffer.size());
for (std::string::size_type i = 0; i < 10; i++)
bigstring.append(buffer);
int n = 10;
std::cout << "Running tests " << n << " times: be patient..." << std::endl;
std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
std::chrono::high_resolution_clock::time_point t1, t2;
std::string result1, result2;
for (int i = 0; i < n; i++) {
// clear result
std::string().swap(result1);
t1 = std::chrono::high_resolution_clock::now();
result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
t2 = std::chrono::high_resolution_clock::now();
std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);
// clear result
std::string().swap(result2);
t1 = std::chrono::high_resolution_clock::now();
result2 = pcrscpp_rx.replace_copy (bigstring);
t2 = std::chrono::high_resolution_clock::now();
pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
}
std::cout << "Time taken by std::regex_replace: "
<< std_regex_duration.count()
<< " ms" << std::endl
<< "Result size: " << result1.size() << std::endl;
std::cout << "Time taken by pcrscpp::replace: "
<< pcrscpp_duration.count()
<< " ms" << std::endl
<< "Result size: " << result2.size() << std::endl;
return 0;
}
(note that std and pcrscpp regular expressions do the same here, the trailing newline in expression for pcrscpp is due to std::regex_replace not stripping newlines despite std::regex_constants::format_no_copy)
and launched it on a large (20.9 MB) shell move script:
Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330
As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.
Regards,
Alex

I don't think you're doing anything "wrong" per-say, the C++ regex library just isn't as fast as the python one (for this use case at this time at least). This isn't too surprising, keeping in mind the python regex code is all C/C++ under the hood as well, and has been tuned over the years to be pretty fast as that's a fairly important feature in python, so naturally it is going to be pretty fast.
But there are other options in C++ for getting things faster if you need. I've used PCRE ( http://pcre.org/ ) in the past with great results, though I'm sure there are other good ones out there these days as well.
For this case in particular however, you can also achieve what you're after without regexes, which in my quick tests yielded a 10x performance improvement. For example, the following code scans your input string copying everything to a new buffer, when it hits a < it starts skipping over characters until it sees the closing >
std::string buffer(size, ' ');
std::string outbuffer(size, ' ');
... read in buffer from your file
size_t outbuffer_len = 0;
for (size_t i=0; i < buffer.size(); ++i) {
if (buffer[i] == '<') {
while (buffer[i] != '>' && i < buffer.size()) {
++i;
}
} else {
outbuffer[outbuffer_len] = buffer[i];
++outbuffer_len;
}
}
outbuffer.resize(outbuffer_len);

Show non printable characters in a string

Is it possible to visualize non-printable characters in a python string with its hex values?
e.g. If I have a string with a newline inside I would like to replace it with \x0a.
I know there is repr() which will give me ...\n, but I'm looking for the hex version.

I don't know of any built-in method, but it's fairly easy to do using a comprehension:
import string
printable = string.ascii_letters + string.digits + string.punctuation + ' '
def hex_escape(s):
return ''.join(c if c in printable else r'\x{0:02x}'.format(ord(c)) for c in s)

I'm kind of late to the party, but if you need it for simple debugging, I found that this works:
string = "\n\t\nHELLO\n\t\n\a\17"
procd = [c for c in string]
print(procd)
# Prints ['\n,', '\t,', '\n,', 'H,', 'E,', 'L,', 'L,', 'O,', '\n,', '\t,', '\n,', '\x07,', '\x0f,']
While just list is simpler, a comprehension makes it easier to add in filtering/mapping if necessary.

You'll have to make the translation manually; go through the string with a regular expression for example, and replace each occurrence with the hex equivalent.
import re
replchars = re.compile(r'[\n\r]')
def replchars_to_hex(match):
return r'\x{0:02x}'.format(ord(match.group()))
replchars.sub(replchars_to_hex, inputtext)
The above example only matches newlines and carriage returns, but you can expand what characters are matched, including using \x escape codes and ranges.
>>> inputtext = 'Some example containing a newline.\nRight there.\n'
>>> replchars.sub(replchars_to_hex, inputtext)
'Some example containing a newline.\\x0aRight there.\\x0a'
>>> print(replchars.sub(replchars_to_hex, inputtext))
Some example containing a newline.\x0aRight there.\x0a

Modifying ecatmur's solution to handle non-printable non-ASCII characters makes it less trivial and more obnoxious:
def escape(c):
if c.printable():
return c
c = ord(c)
if c <= 0xff:
return r'\x{0:02x}'.format(c)
elif c <= '\uffff':
return r'\u{0:04x}'.format(c)
else:
return r'\U{0:08x}'.format(c)
def hex_escape(s):
return ''.join(escape(c) for c in s)
Of course if str.isprintable isn't exactly the definition you want, you can write a different function. (Note that it's a very different set from what's in string.printable—besides handling non-ASCII printable and non-printable characters, it also considers \n, \r, \t, \x0b, and \x0c as non-printable.
You can make this more compact; this is explicit just to show all the steps involved in handling Unicode strings. For example:
def escape(c):
if c.printable():
return c
elif c <= '\xff':
return r'\x{0:02x}'.format(ord(c))
else:
return c.encode('unicode_escape').decode('ascii')
Really, no matter what you do, you're going to have to handle \r, \n, and \t explicitly, because all of the built-in and stdlib functions I know of will escape them via those special sequences instead of their hex versions.

I did something similar once by deriving a str subclass with a custom __repr__() method which did what I wanted. It's not exactly what you're looking for, but may give you some ideas.
# -*- coding: iso-8859-1 -*-
# special string subclass to override the default
# representation method. main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):
def __repr__(self):
# use double quotes unless there are more of them within the string than
# single quotes
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"
rep = [quotechar]
for ch in self:
# control char?
if ord(ch) < ord(' '):
# remove the single quotes around the escaped representation
rep += repr(str(ch)).strip("'")
# embedded quote matching quotechar being used?
elif ch == quotechar:
rep += "\\"
rep += ch
# else just use others as they are
else:
rep += ch
rep += quotechar
return "".join(rep)
if __name__ == "__main__":
s1 = '\tWürttemberg'
s2 = MsgStr(s1)
print "str s1:", s1
print "MsgStr s2:", s2
print "--only the next two should differ--"
print "repr(s1):", repr(s1), "# uses built-in string 'repr'"
print "repr(s2):", repr(s2), "# uses custom MsgStr 'repr'"
print "str(s1):", str(s1)
print "str(s2):", str(s2)
print "repr(str(s1)):", repr(str(s1))
print "repr(str(s2)):", repr(str(s2))
print "MsgStr(repr(MsgStr('\tWürttemberg'))):", MsgStr(repr(MsgStr('\tWürttemberg')))

There is also a way to print non-printable characters in the sense of them executing as commands within the string even if not visible (transparent) in the string, and their presence can be observed by measuring the length of the string using len as well as by simply putting the mouse cursor at the start of the string and seeing/counting how many times you have to tap the arrow key to get from start to finish, as oddly some single characters can have a length of 3 for example, which seems perplexing. (Not sure if this was already demonstrated in prior answers)
In this example screenshot below, I pasted a 135-bit string that has a certain structure and format (which I had to manually create beforehand for certain bit positions and its overall length) so that it is interpreted as ascii by the particular program I'm running, and within the resulting printed string are non-printable characters such as the 'line break` which literally causes a line break (correction: form feed, new page I meant, not line break) in the printed output there is an extra entire blank line in between the printed result (see below):
Example of printing non-printable characters that appear in printed string
Input a string:100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000
HPQGg]+\,vE!:#
>>> len('HPQGg]+\,vE!:#')
17
>>>
In the above code excerpt, try to copy-paste the string HPQGg]+\,vE!:# straight from this site and see what happens when you paste it into the Python IDLE.
Hint: You have to tap the arrow/cursor three times to get across the two letters from P to Q even though they appear next to each other, as there is actually a File Separator ascii command in between them.
However, even though we get the same starting value when decoding it as a byte array to hex, if we convert that hex back to bytes they look different (perhaps lack of encoding, not sure), but either way the above output of the program prints non-printable characters (I came across this by chance while trying to develop a compression method/experiment).
>>> bytes(b'HPQGg]+\,vE!:#').hex()
'48501c514767110c5d2b5c2c7645213a40'
>>> bytes.fromhex('48501c514767110c5d2b5c2c7645213a40')
b'HP\x1cQGg\x11\x0c]+\\,vE!:#'
>>> (0x48501c514767110c5d2b5c2c7645213a40 == 0b100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000)
True
>>>
In the above 135 bit string, the first 16 groups of 8 bits from the big-endian side encode each character (including non-printable), whereas the last group of 7 bits results in the # symbol, as seen below:
Technical breakdown of the format of the above 135-bit string
And here as text is the breakdown of the 135-bit string:
10010000 = H (72)
10100000 = P (80)
00111000 = x1c (28 for File Separator) *
10100010 = Q (81)
10001110 = G(71)
11001110 = g (103)
00100010 = x11 (17 for Device Control 1) *
00011000 = x0c (12 for NP form feed, new page) *
10111010 = ] (93 for right bracket ‘]’
01010110 = + (43 for + sign)
10111000 = \ (92 for backslash)
01011000 = , (44 for comma, ‘,’)
11101100 = v (118)
10001010 = E (69)
01000010 = ! (33 for exclamation)
01110100 = : (58 for colon ‘:’)
1000000 = # (64 for ‘#’ sign)
So in closing, the answer to the sub-question about showing the non-printable as hex, in byte array further above appears the letters x1c which denote the file separator command which was also noted in the hint. The byte array could be considered a string if excluding the prefix b on the left side, and again this value shows in the print string albeit it is invisible (although its presence can be observed as demonstrated above with the hint and len command).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python unicode.splitlines() triggers at non-EOL character - python

Triyng to make this in Python 2.7: >>> s = u"some\u2028text" >>> s u'some\u2028text' >>> l = s.splitlines(True) >>> l [u'some\u2028', u'text'] \u2028 is Left-To-Right Embedding character, not \r or \n, so that line should not be splitted. Is there a bug or just my misunderstanding?

Related

re.search seaches only in one line [duplicate]

Why is \x00 not converted to \0 by repr

python re.search not working on multiline string

Increase C++ regex replace performance

Show non printable characters in a string

Categories

Resources