Unescape _xHHHH_ XML escape sequences using Python

Unescape _xHHHH_ XML escape sequences using Python - python

I'm using Python 2.x [not negotiable] to read XML documents [created by others] that allow the content of many elements to contain characters that are not valid XML characters by escaping them using the _xHHHH_ convention e.g. ASCII BEL aka U+0007 is represented by the 7-character sequence u"_x0007_". Neither the functionality that allows representation of any old character in the document nor the manner of escaping is negotiable. I'm parsing the documents using cElementTree or lxml [semi-negotiable].
Here is my best attempt at unescapeing the parser output as efficiently as possible:
import re
def unescape(s,
subber=re.compile(r'_x[0-9A-Fa-f]{4,4}_').sub,
repl=lambda mobj: unichr(int(mobj.group(0)[2:6], 16)),
):
if "_" in s:
return subber(repl, s)
return s
The above is biassed by observing a very low frequency of "_" in typical text and a better-than-doubling of speed by avoiding the regex apparatus where possible.
The question: Any better ideas out there?

You might as well check for '_x' rather than just _, that won't matter much but surely the two-character sequence's even rarer than the single underscore. Apart from such details, you do seem to be making the best of a bad situation!

Related

Simple parser, but not a calculator

I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?

Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.

Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?

As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!

Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

Converting html entities into their values in python

I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

What's a good way to replace international characters with their base Latin counterparts using Python?

Say I have the string "blöt träbåt" which has a few a and o with umlaut and ring above. I want it to become "blot trabat" as simply as possibly. I've done some digging and found the following method:
import unicodedata
unicode_string = unicodedata.normalize('NFKD', unicode(string))
This will give me the string in unicode format with the international characters split into base letter and combining character (\u0308 for umlauts.) Now to get this back to an ASCII string I could do ascii_string = unicode_string.encode('ASCII', 'ignore') and it'll just ignore the combining characters, resulting in the string "blot trabat".
The question here is: is there a better way to do this? It feels like a roundabout way, and I was thinking there might be something I don't know about. I could of course wrap it up in a helper function, but I'd rather check if this doesn't exist in Python already.

It would be better if you created an explicit table, and then used the unicode.translate method. The advantage would be that transliteration is more precise, e.g. transliterating "ö" to "oe" and "ß" to "ss", as should be done in German.
There are several transliteration packages on PyPI: translitcodec, Unidecode, and trans.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unescape _xHHHH_ XML escape sequences using Python - python

You might as well check for '_x' rather than just _, that won't matter much but surely the two-character sequence's even rarer than the single underscore. Apart from such details, you do seem to be making the best of a bad situation!

Related

Simple parser, but not a calculator

Pyparsing delimited list only returns first element

Efficient way to do a large number of search/replaces in Python?

Converting html entities into their values in python

What's a good way to replace international characters with their base Latin counterparts using Python?

Categories

Resources