Dive into Python: XML Processing -
Here I am referring to a portion of kgp.py program -
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
if not standaloneXrefs:
raise NoSourceError, "can't guess source, and no source specified"
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
self.grammar: parsed XML representation (using xml.dom.minidom) of -
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>
<ref id="byte">
<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
self.refs: is the caching of all the refs of the above XML key'd by their id
I have two doubts with this code:
Doubt 1:
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
eventaully xrefs holds the id values in a list. Couldn't we have done this simply by -
xrefs = [xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]
Doubt 2:
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
Here, we are saving the ref from self.refs which we do NOT see in our computed xrefs. But next instead of creating a <ref> element, we are creating a <xref> with the same ID. This takes us one step backward, since later we are anyway going to find the cross reference for this computed <xref> and eventually reach the <ref>. We could have just started with this <ref> in the first place.
Disclaimer
I am in no way trying to make a remark on the book. I am not even qualified for that.
I am loving every moment of reading this book. I realize few chapters have gone outdated, but I love Mark Pilgrim's writing style and I cannot stop reading.
Dive Into Python is seven years old now (published 2004), and doesn't always contain the most modern code. So you need to go easy on it: Dive Into Python 3 might be a better bet.
Your suggestion for doubt 1 changes the meaning of the code: putting the ids into the keys of a dictionary and then getting them out again eliminates duplicates, whereas your list comprehension includes duplicates. The modern approach would be to use a set comprehension:
xrefs = {xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")}
but this wasn't available in 2004.
On your doubt 2, I'm not entirely sure I see the problem. Yes, in some sense this is a waste, but on the other hand the code already has a handler for the xref case, so it makes sense to re-use that handler rather than add an extra special case.
There are several other bits of code in that example that could be modernized. For example,
source and source or self.getDefaultSource()
would now be source or self.getDefaultSource(). And the line
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
would be better expressed as a set difference operation, something like:
standaloneXrefs = set(self.refs) - set(xrefs)
But that's what happens as languages become more expressive: old code starts to look rather inelegant.
Your doubts are totally justified: that code doesn't look very good to me at all. For example, it uses 1 as a boolean value where True would have sufficed and been clearer.
Doubt 1:
These two snippets don't do the same. If there are duplicates, the original code will filter them out, but your alternative won't. On the other hand, your code preserves the original ordering whereas the original returns the elements in an arbitrary order.
To be fully equivalent, we could use the set builtin:
xrefs = list(set([xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]))
(It might not make sense to convert back to a list, though.)
Doubt 2:
Out of time, gotta run, sorry...
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
This is an extremely crude way to construct a set. This should be written as
set(xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref"))
or even (in Python 2.7+):
{xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")) }
If avoiding duplicates is not an issue, your solution (constructing a list) works too. Since xref is iterated over anyway, one could even generate an iterator.
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
This code is completely broken if xref contains a special character such as " or &.
However, in principle, it is correct to construct an <xref> element here, since this must be the same format that the external source has (getDefaultSource is called as
self.loadSource(source and source or self.getDefaultSource())
).
Both code excerpts are examples of bad programming and should not be included in a book that intends to teach people how to program. Dive Into Python3 has better XML examples and code.
Related
I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.
Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?
What I ended up doing was writing a simple parser, without a meta-language in 339 source lines:
https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py
Implementation of Concrete Syntax Tree [List!]
Reads source character by character;
Once end of statement† is detected, add statement-type into 1D list;
†end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
Once finished there is a big (1D) list where each element is a namedtuple with a value property.
Integration with builtin Abstract Syntax Tree ast
Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
…
Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).
Quality control
100% test coverage
100% doc coverage
Support for last 6 versions of Python (including latest alpha)
CI/CD
(Apache-2.0 OR MIT) licensed
Limitations
Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
Not builtin to stdlib so stdlib could break compatibility;
Deleting nodes is [probably] not supported;
Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.
Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.
Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:
import ast
program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
'b'] # Ending comment for b lost
def p1f(x):
"p1f docstring"
# Comment in function p1f lost
return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))
The output shows all comments gone:
== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']
def p1f(x):
"""p1f docstring"""
return x
print(p1f(p1l), p1f(p1v))
However, if we scan the comments with the tokenizer, we can
use this to merge the comments back in:
from io import StringIO
import tokenize
def scan_comments(source):
""" Scan source code file for relevant comments
"""
# Find token for comments
for k,v in tokenize.tok_name.items():
if v == 'COMMENT':
comment = k
break
comtokens = []
with StringIO(source) as f:
tokens = tokenize.generate_tokens(f.readline)
for token in tokens:
if token.type != comment:
continue
comtokens += [token]
return comtokens
comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])
Output (edited to split long line):
== Comment after p1l[0]
TokenInfo(type=60 (COMMENT),
string='# Implicit line joining comment for a lost',
start=(4, 12), end=(4, 54),
line="p1l = ['a', # Implicit line joining comment for a lost\n")
Using a slightly modified version of ast.unparse(), replacing
methods maybe_newline() and traverse() with modified versions,
you should be able to merge back in all comments at their
approximate locations, using the location info from the comment
scanner (start variable), combined with the location info from the
AST; most nodes have a lineno attribute.
Not exactly. See for example the list variable assignment. The
source code is split out over two lines, but ast.unparse()
generates only one line (see output in the second code segment).
Also, you need to ensure to update the location info in the AST
using ast.increment_lineno() after adding code.
It seems some more calls to
maybe_newline() might be needed in the library code (or its
replacement).
i have list like this
["<name:john student male age=23 subject=\computer\sience_{20092973}>",
"<name:Ahn professor female age=61 subject=\computer\math_{20092931}>"]
i want to get student using {20092973},{20092931}.
so i want to split to list like this
my expect result 1 is this (input is {20092973})
"student"
my expect result 2 is this (input is {20092931})
"professor"
i already searching... but i can't find.. sorry..
how can i this?
I don't think you should be doing this in the first place. Unlike your toy example, your real problem doesn't involve a string in some clunky format; it involves a Scapy NetworkInterface object. Which has attributes that you can just access directly. You only have to parse it because for some reason you stored its string representation. Just don't do that; store the attributes you actually want when you have them as attributes.
The NetworkInterface object isn't described in the documentation (because it's an implementation detail of the Windows-specific code), but you can interactively inspect it like any other class in Python (e.g., dir(ni) will show you all the attributes), or just look at the source. The values you want are name and win_name. So, instead of print ni, just do something like print '%s,%s' % (ni.name, ni.win_name). Then, parsing the results in some other program will be trivial, instead of a pain in the neck.
Or, better, if you're actually using this in Scapy itself, just make the dict directly out of {ni.win_name: ni.name for ni in nis}. (Or, if you're running Scapy against Python 2.5 or something, dict((ni.win_name, ni.name) for ni in nis).)
But to answer the question as you asked it (maybe you already captured all the data and it's too late to capture new data, so now we're stuck working around your earlier mistake…), there are three steps to this: (1) Figure out how to parse one of these strings into its component parts. (2) Do that in a loop to build a dict mapping the numbers to the names. (3) Just use the dict for your lookups.
For parsing, I'd use a regular expression. For example:
<name:\S+\s(\S+).*?\{(\d+)\}>
Debuggex Demo
Now, let's build the dict:
r = re.compile(r'<name:\S+\s(\S+).*?\{(\d+)\}>')
matches = (r.match(thing) for thing in things)
d = {match.group(2): match.group(1) for match in matches}
And now:
>>> d['20092973']
'student'
Code:
def grepRole(role, lines):
return [line.split()[1] for line in lines if role in line][0]
l = ["<name:john student male age=23 subject=\computer\sience_{20092973}>",
"<name:Ahn professor female age=61 subject=\compute\math_{20092931}>"]
print(grepRole("{20092973}", l))
print(grepRole("{20092931}", l))
Output:
student
professor
current_list = ["<name:john student male age=23 subject=\computer\sience_{20092973}>", "<name:Ahn professor female age=61 subject=\computer\math_{20092931}>"]
def get_identity(code):
print([row.split(' ')[1] for row in current_list if code in row][0])
get_identity("{20092973}")
regular expression is good ,but for me, a rookie, regular expression is another big problem...
I'm parsing some data with the pattern as follows:
tagA:
titleA
dataA1
dataA2
dataA3
...
tagB:
titleB
dataB1
dataB2
dataB3
...
tagC:
titleC
dataC1
dataC2
...
...
These tags are stored in a list list_of_tags, if I iterate through the list, I can get all the tags; also, if iterating through the tags, I can get the title and the data associated with the title.
The tags in my data are pretty much something like <div>, so they are not useful to me; what I'm trying to do is to construct a dictionary which uses titles as keys and datas as a list of values.
The constructed dictionary would look like:
{
titleA: [dataA1, dataA2, dataA3...],
titleB: [dataB1, dataB2, dataB3...],
...
}
Notice every tag only contains one title and some datas, and title always comes before data.
So here are my working codes:
Method 1:
result = {}
for tag in list_of_tags:
list_of_values = []
for idx, elem in enumerate(tag):
if not idx:
key = elem
else:
construct_list_of_values()
update_the_dictionary()
Actually, method 1 works fine and gives me my desired result; however, if I put that piece of codes in PyCharm, it warns me that "Local variable 'key' might be referenced before assignment" at the last line. Hence, I try another approach:
Method 2:
result = {tag[0]: tag[1:] for tag in list_of_tags}
Method 2 works fine if tags are lists, but I also want the code to work normally if tags are generators ('generator' object is not subscriptable will occur with method 2)
In order to work with generators, I come up with:
Method 3:
key_val_list = [(next(tag), list(tag)) for tag in list_of_tags]
result = dict(key_val_list)
Method 3 also works; but I cannot write this in dictionary comprehension ({next(tag): list(tag) for tag in list_of_tags} would give StopIteration exception because list(tag) will be evaluated first)
So, my question is, is there an elegant way for dealing with this pattern which could work no matter tags are lists or generators? (method 1 seems to work for both, but I don't know if I should ignore the warning PyCharms gives; the other two methods looks more concise, but one can only work on lists while the other can only work on generators)
Sorry for the long question, thanks for the patience!
I guess the reason why PyCharm is giving you a warning is that you are using key in update_the_dictionary, but key could be left unassigned if tag does not contain at least one element. You might have the knowledge that the title will always be in the list, but the static analyzer is not able to infer that from the context.
If you are using Python 3, you might want to try using PEP 3132 - Extended Iterable Unpacking. It should work for both lists and generators.
e.g.
title, *data = tag
I'm automating the process of creating a Word document with the python-docx module. In particular, I'm creating a multiple choice test where the questions are numbered 1., 2., 3., ... and under each question there are 4 answers that should be labeled as A., B., C., and D. I used a style to create the numbered list and the lettered list. However, I don't know how to restart the letters. For example, the answers for the 2nd question would range from E., F., G., H. Does anyone know how to restart the lettering back to A? I could manually specify the lettering in the answer string but I'm wondering how to do it with the style sheet. Thank you.
The short answer is that this is not supported yet in python-docx but we might be able to provide you with a workaround if you ask on this issue on the Github issue list for the project: https://github.com/python-openxml/python-docx/issues/25
This particular operation turns out to be way more difficult in Word than anyone I know would have imagined. I believe that has to do with a need to maintain certain backward compatibilities on so many versions across the couple three decades of Word now.
Just to give you an idea, the style references a numbering definition which itself references an abstract numbering definition. Each paragraph with the style gets the next number/letter in the sequence. To restart the sequence, you have to create a NEW numbering definition that references the same abstract numbering sequence as the prior one. Then you reference the new numbering definition on the paragraph where the sequence should restart. Paragraphs with the style following that one get the next number/letter in the restarted sequence.
In order to accomplish that, you need to:
locate the numbering definition of the style
locate the abstract numbering definition it points to
create a new numbering definition with the restart bit set
tweak the paragraph element to refer to the new numbering definition
Anyway, now that I've vented about all that I can tell you we've actually gotten it to work before. We haven't added it to the API yet mostly I suppose because it's not entirely clear what the API should look like and it hasn't risen to the top of the backlog yet. But a couple workaround functions could probably get it done for you if you want it badly enough.
In your case I suppose it would be a toss-up. I suppose I would strongly consider placing those bits in directly in the paragraph in this case, but you'll be best able to decide.
I've created a pull request (#582) that addresses this situation at a low-level. All I have done is define the XML types necessary to implement the numbering subsystem of WML. #scanny has create a submodule called xmlchemy that creates a semi-abstract representation of the XML so that you can handle multilevel lists and other numbering tasks if you are familiar with the standard. So if you build my fork, the following code will work:
#!/usr/bin/python
from docx import Document
from docx import oxml
d = Document()
"""
1. Create an abstract numbering definition for a multi-level numbering style.
"""
numXML = d.part.numbering_part.numbering_definitions._numbering
nextAbstractId = max([ J.abstractNumId for J in numXML.abstractNum_lst ] ) + 1
l = numXML.add_abstractNum()
l.abstractNumId = nextAbstractId
m = l.add_multiLevelType()
m.val = 'multiLevel'
"""
2. Define numbering formats for each (zero-indexed)
level. N.B. The formatting text is one-indexed.
The user agent will accept up to nine levels.
"""
formats = {0: "decimal", 1: "upperLetter" }
textFmts = {0: '%1.', 1: '%2.' }
for i in range(2):
lvl = l.add_lvl()
lvl.ilvl = i
n = lvl.add_numFmt()
n.val = formats[i]
lt = lvl.add_lvlText()
lt.val = textFmts[i]
"""
3. Link the abstract numbering definition to a numbering definition.
"""
n = numXML.add_num(nextAbstractId)
"""
4. Define a function to set the (0-indexed) numbering level of a paragraph.
"""
def set_ilvl(p,ilvl):
pr = p._element._add_pPr()
np = pr.get_or_add_numPr()
il = np.get_or_add_ilvl()
il.val = ilvl
ni = np.get_or_add_numId()
ni.val = n.numId
return(p)
"""
5. Create some content
"""
for x in [1,2,3]:
p = d.add_paragraph()
set_ilvl(p,0)
p.add_run("Question %i" % x)
for y in [1,2,3,4]:
p2 = d.add_paragraph()
set_ilvl(p2,1)
p2.add_run("Choice %i" % y)
d.save('test.docx')
I have a task to do that I'm sure Python and pyparsing could really help with, but I'm still too much of a novice with programming to make a smart choice about how challenging the complete implementation will be and whether it's worth trying or is certain to be a fruitless time-sink.
The task is to translate strings of arbitrary length and nesting depth with a structure following the general grammar of this one:
item12345 'topic(subtopic(sub-subtopic), subtopic2), topic2'
into an item in a dictionary like this one:
{item12345, 'topic, topic:subtopic, topic:subtopic:sub-subtopic, topic:subtopic2, topic2'}
In other words, the logic is exactly like mathematics where the item immediately to the left of parentheses is distributed to everything inside, and the ',' designates the terms inside of the parentheses, much like how addition functions with respect to factors of a binomial.
I've either discovered for myself or found and understood examples of some of the seemingly necessary elements for creating this solution so far.
Parsing nested expressions in Python:
def parenthetic_contents(string):
"""Generate parenthesized contents in string as pairs (level, contents)."""
stack = []
for i, c in enumerate(string):
if c == '(':
stack.append(i)
elif c == ')' and stack:
start = stack.pop()
yield (len(stack), string[start + 1: i])
Distributing one string to others:
from pyparsing import Suppress,Word,ZeroOrMore,alphas,nums,delimitedList
data = '''\
MSE 2110, 3030, 4102
CSE 1000, 2000, 3000
DDE 1400, 4030, 5000
'''
def memorize(t):
memorize.dept = t[0]
def token(t):
return "Course: %s %s" % (memorize.dept, int(t[0]))
course = Suppress(Word(alphas).setParseAction(memorize))
number = Word(nums).setParseAction(token)
line = course + delimitedList(number)
lines = ZeroOrMore(line)
final = lines.parseString(data)
for i in final:
print i
And some others, but these methods won't directly apply to my ultimate solution, and I've still got a ways to go before I understand python and pyparsing well enough to combine the ideas or find new ones.
I've been hammering away at it by looking for examples, looking for stuff that works similarly, learning more python and more of pyparsing's classes and methods, but I'm not sure how far away I am from knowing enough to make something that works for my full solution rather than just intermediate exercises that won't work for the general case.
So my questions are these. How complex a solution will I ultimately need in order to do what I'm looking for? What suggestions do you have that might help me get closer?
Thanks in advance! (PS - first post on StackOverflow, let me know if I need to do anything differently with regard to this post)
In pyparsing, your example would look something like:
from pyparsing import Word,alphanums,Forward,Optional,nestedExpr,delimitedList
topicString = Word(alphanums+'-')
expr = Forward()
expr << topicString + Optional(nestedExpr(content=delimitedList(expr)))
test = 'topic(subtopic(sub-subtopic), subtopic2), topic2'
print delimitedList(expr).parseString(test).asList()
Prints
['topic', ['subtopic', ['sub-subtopic'], 'subtopic2'], 'topic2']
Converting to topic:subtopic, etc. is left as an exercise for the OP.