Difficulty of this particular job using pyparsing? (beginner) - python

I have a task to do that I'm sure Python and pyparsing could really help with, but I'm still too much of a novice with programming to make a smart choice about how challenging the complete implementation will be and whether it's worth trying or is certain to be a fruitless time-sink.
The task is to translate strings of arbitrary length and nesting depth with a structure following the general grammar of this one:
item12345 'topic(subtopic(sub-subtopic), subtopic2), topic2'
into an item in a dictionary like this one:
{item12345, 'topic, topic:subtopic, topic:subtopic:sub-subtopic, topic:subtopic2, topic2'}
In other words, the logic is exactly like mathematics where the item immediately to the left of parentheses is distributed to everything inside, and the ',' designates the terms inside of the parentheses, much like how addition functions with respect to factors of a binomial.
I've either discovered for myself or found and understood examples of some of the seemingly necessary elements for creating this solution so far.
Parsing nested expressions in Python:
def parenthetic_contents(string):
"""Generate parenthesized contents in string as pairs (level, contents)."""
stack = []
for i, c in enumerate(string):
if c == '(':
stack.append(i)
elif c == ')' and stack:
start = stack.pop()
yield (len(stack), string[start + 1: i])
Distributing one string to others:
from pyparsing import Suppress,Word,ZeroOrMore,alphas,nums,delimitedList
data = '''\
MSE 2110, 3030, 4102
CSE 1000, 2000, 3000
DDE 1400, 4030, 5000
'''
def memorize(t):
memorize.dept = t[0]
def token(t):
return "Course: %s %s" % (memorize.dept, int(t[0]))
course = Suppress(Word(alphas).setParseAction(memorize))
number = Word(nums).setParseAction(token)
line = course + delimitedList(number)
lines = ZeroOrMore(line)
final = lines.parseString(data)
for i in final:
print i
And some others, but these methods won't directly apply to my ultimate solution, and I've still got a ways to go before I understand python and pyparsing well enough to combine the ideas or find new ones.
I've been hammering away at it by looking for examples, looking for stuff that works similarly, learning more python and more of pyparsing's classes and methods, but I'm not sure how far away I am from knowing enough to make something that works for my full solution rather than just intermediate exercises that won't work for the general case.
So my questions are these. How complex a solution will I ultimately need in order to do what I'm looking for? What suggestions do you have that might help me get closer?
Thanks in advance! (PS - first post on StackOverflow, let me know if I need to do anything differently with regard to this post)

In pyparsing, your example would look something like:
from pyparsing import Word,alphanums,Forward,Optional,nestedExpr,delimitedList
topicString = Word(alphanums+'-')
expr = Forward()
expr << topicString + Optional(nestedExpr(content=delimitedList(expr)))
test = 'topic(subtopic(sub-subtopic), subtopic2), topic2'
print delimitedList(expr).parseString(test).asList()
Prints
['topic', ['subtopic', ['sub-subtopic'], 'subtopic2'], 'topic2']
Converting to topic:subtopic, etc. is left as an exercise for the OP.

Related

Parsing a "hierarchical" URL with regexes if there are splits in parsing logic

Is there any way to adjust the remaining regex pattern to what have been already matched? A rough sketch to illustrate the idea:
pattern
/ | \
/ | \
prefix1 prefix2 prefix3
| | |
postfix1 postfix2 postfix3
This is a rather theoretical question; the practical application below is only for illustrative purposes.
I'm trying to find the first URLs to popular code hosting platforms, like github, gitlab etc., in a large text. The problem is, all platforms have different URL patterns:
github.com/<user>/<repo>
gitlab.com/<group1>/<group2>/.../<repo>
sourceforge.net/projects/<repo>
I can use lookbehind expressions, but then the expression gets really monstrous (Python re):
pattern = re.compile(
r"(github\.com|bitbucket\.org|gitlab\.com|sourceforge\.net)/"
# middle part - empty for all except sourceforge
r"(?:(?<=github\.com/)|(?<=bitbucket\.org/)|(?<=gitlab\.com/)|"
r"(?<=sourceforge\.net/)projects/)("
# final part, the repository pattern
r"(?<=github\.com/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+|"
r"(?<=bitbucket\.org/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+|"
r"(?<=gitlab\.com/)[a-zA-Z0-9_.-]+(?:/[a-zA-Z0-9_.-]+)+|"
r"(?<=sourceforge\.net/projects/)[a-zA-Z0-9_.-]+"
r")")
Is there a more elegant way to do something like this?
The best way would probably be to rather use a custom parser and parse state-machine-style: first determine the site, then go a site-specific route:
patterns={
'github.com': r'/(?P<user>[^/]+)/(?P<project>[^/#]+)(?:[/#]|$)',
'sourceforge.net': r'/projects/(?P<project>)[^/]+/',
<etc etc etc>
}
import urllib.parse
pr = urllib.parse.urlparse(url)
site = pr.hostname # in case port is specified
parts = re.match(patterns[site], pr.path).groupdict()
Instead of regexes, paths can be parsed with a state machine, too, which would likely be more manageable if there are further splits ahead:
(they recommend a enum instead of magic strings for states; I used magic strings solely to simplify the example code)
def parse_github(path):
r = argparse.Namespace()
pp = path.split('/')
p = pp.pop(0)
assert(p == '')
state='user'
for p in pp: # we dont need to backtrack in this case,
# so `for' is a fitting mechanism to iterate
# over the parts.
# if we needed to backtrack, we'd have to use
# an index variable or a stack or something
if state=='user':
r.user=p
state='project'
else if state=='project':
r.project==p
state='kind'
else if state=='kind':
if p in {'pull','commit','blob'}:
state=p
else: break #end parsing, ignore anything that's left
else if state=='pull':
r.pr=p
state='pr_tab'
<etc etc>
return r
In principle, there are no recursive constructs here, so this can be done solely with regexes, but this is very awkward:
site_patterns = [
r"(github\.com/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+",
r"(bitbucket\.org/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+",
r"(gitlab\.com/)[a-zA-Z0-9_.-]+(?:/[a-zA-Z0-9_.-]+)+",
r"(sourceforge\.net/projects/)[a-zA-Z0-9_.-]+",
<etc etc etc>
]
r_all = re.compile("("+"|".join(site_patterns)+")") #good luck debugging this monster

Understanding a custom encryption method in python

As part of an assignment I've been given some code written in python that was used to encrypt a message, and I have to try and understand the code and decrypt the ciphertext. I've never used python before and am somewhat out of my depth.
I understand most of it and the overall gist of what the code is trying to accomplish, however there are a few lines near the end tripping me up. Here's the entire thing (the &&& denotes sections of code which are supposed to be "damaged", while testing the code I've set secret to "test" and count to 3):
import string
import random
from base64 import b64encode, b64decode
secret = '&&&&&&&&&&&&&&' # We don't know the original message or length
secret_encoding = ['step1', 'step2', 'step3']
def step1(s):
_step1 = string.maketrans("zyxwvutsrqponZYXWVUTSRQPONmlkjihgfedcbaMLKJIHGFEDCBA","mlkjihgfedcbaMLKJIHGFEDCBAzyxwvutsrqponZYXWVUTSRQPON")
return string.translate(s, _step1)
def step2(s): return b64encode(s)
def step3(plaintext, shift=4):
loweralpha = string.ascii_lowercase
shifted_string = loweralpha[shift:] + loweralpha[:shift]
converted = string.maketrans(loweralpha, shifted_string)
return plaintext.translate(converted)
def make_secret(plain, count):
a = '2{}'.format(b64encode(plain))
for count in xrange(count):
r = random.choice(secret_encoding)
si = secret_encoding.index(r) + 1
_a = globals()[r](a)
a = '{}{}'.format(si, _a)
return a
if __name__ == '__main__':
print make_secret(secret, count=&&&)
Essentially, I assume the code is meant to choose randomly from the three encryption methods step1, step2 and step3, then apply them to the cleartext a number or times as governed by whatever the value of "count" is.
The "make_secret" method is the part that's bothering me, as I'm having difficulty working out how it ties everything together and what the overall purpose of it is. I'll go through it line by line and give my reasons on each part, so someone can correct me if I'm mistaken.
a = '2{}'.format(b64encode(plain))
This takes the base64 encoding of whatever the "plain" variable corresponds to and appends a 2 to the start of it, resulting in something like "2VGhpcyBpcyBhIHNlY3JldA==" using "this is a secret" for plain as a test. I'm not sure what the 2 is for.
r = random.choice(secret_encoding)
si = secret_encoding.index(r) + 1
r is a random selection from the secret_encoding array, while si corresponds to the next array element after r.
_a = globals()[r](a)
This is one of the parts that has me stumped. From researching global() it seems that the intention here is to turn "r" into a global dictionary consisting of the characters found in "a", ie somewhere later in the code a's characters will be used as a limited character set to choose from. Is this correct or am I way off base?
I've tried printing _a, which gives me what appears to be the letters and numbers found in the final output of the code.
a = '{}{}'.format(si, _a)
It seems as if this is creating a string which is a concatenation of the si and _a variables, however I'll admit I don't understand the purpose of doing this.
I realize this is a long question, but I thought it would be best to put the parts that are bothering me into context.
I will refrain from commenting on the readability of the code. I daresay
it was all intentional, anyway, for purposes of obfuscation. Your
professor is an evil bastard and I want to take his or her course :)
r = random.choice(secret_encoding)
...
_a = globals()[r](a)
You're way off base. This is essentially an ugly and hard-to-read way to
randomly choose one of the three functions and run it on a. The
function globals() returns a dict that maps names to identifiers; it
includes the three functions and other things. globals()[r] looks up
one of the three functions based on the name r. Putting (a) after
that runs the function with a as the argument.
a = '{}{}'.format(si, _a)
The idea here is to prepend each interim result with the number of the
function that encrypted it, so you know which function you need to
reverse to decrypt that step. They all accumulate at the beginning, and
get encrypted and re-encrypted with each step, except for the last one.
a = '2{}'.format(b64encode(plain))
Essentially, this is applying step2 first. Each encryption with
step2 prepends a 2.
So, the program applies count encryptions to the plaintext, with each
step using a randomly-chosen transformation, and the choice appears in
plaintext before the ciphertext. Your task is to read each prepended
number and apply the inverse transformation to the rest of the message.
You stop when the first character is not in "123".
One problem I see is that if the plaintext begins with a digit in
"123", it will look like we should perform another decryption step. In
practice, however, I feel sure that the professor's choice of plaintext
does not begin with such a digit (unless they're really evil).

Fastest way to compare and replace key value pairs in Python

I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
replaceFile.seek(0)
replaceFile.truncate()
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
replaceFile.write(line)
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.
If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data = f.read()
f.seek(0)
f.truncate()
for k, v in mappings.items():
data = data.replace(k, v)
f.write(data)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
p.start()
processes.append(p)
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
active_keys.append(queue.get())
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
p.join()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(f.read(), mappings)
f.seek(0)
f.truncate()
f.write(data)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._mappings.has_key(key):
return self._mappings[key]
else:
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text = fp.read()
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
fp.write(text)
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )
The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
tree.add(key)
tree.make()
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]
This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text = fd.read()
# seek to the start of the file and truncate so file will be edited inline
fd.seek(0)
fd.truncate()
for key in mappings.keys():
text = text.replace(key, mappings[key])
fd.write(text)

Construct a tree from list os file paths (Python) - Performance dependent

Hey I am working on a very high performance file-managing/analyzing toolkit written in python.
I want to create a function that gives me a list or something like that in a tree format.
Something like in this question (java-related)
From:
dir/file
dir/dir2/file2
dir/file3
dir3/file4
dir3/file5
Note: the list of paths is unsorted
To:
dir/
file
dir2/
file2
file3
dir3/
file4
file5
[[dir, [file, [dir2, [file2]], file3]], [dir3, [file4, file5]]]
something along those lines. I've been playing around with some ideas but none of them provided the speed that I would like to have.
Note: I do already have the list of paths, so no worrying about that. The function takes paths list and gives tree list.
Thanks in Advance
Now that you clarified the question a bit more, I guess the following is what you want:
from collections import defaultdict
input_ = '''dir/file
dir/dir2/file2
dir/file3
dir2/alpha/beta/gamma/delta
dir2/alpha/beta/gamma/delta/
dir3/file4
dir3/file5'''
FILE_MARKER = '<files>'
def attach(branch, trunk):
'''
Insert a branch of directories on its trunk.
'''
parts = branch.split('/', 1)
if len(parts) == 1: # branch is a file
trunk[FILE_MARKER].append(parts[0])
else:
node, others = parts
if node not in trunk:
trunk[node] = defaultdict(dict, ((FILE_MARKER, []),))
attach(others, trunk[node])
def prettify(d, indent=0):
'''
Print the file tree structure with proper indentation.
'''
for key, value in d.iteritems():
if key == FILE_MARKER:
if value:
print ' ' * indent + str(value)
else:
print ' ' * indent + str(key)
if isinstance(value, dict):
prettify(value, indent+1)
else:
print ' ' * (indent+1) + str(value)
main_dict = defaultdict(dict, ((FILE_MARKER, []),))
for line in input_.split('\n'):
attach(line, main_dict)
prettify(main_dict)
It outputs:
dir3
['file4', 'file5']
dir2
alpha
beta
gamma
['delta']
delta
['']
dir
dir2
['file2']
['file', 'file3']
A few thing to note:
The script make heavy use of defaultdicts, basically this allows to skip checking for the existence of a key and its initialisation if it is not there
Directory names are mapped to dictionary keys, I thought this might be a good feature for you, as key are hashed and you will able to retrieve information much faster this way than with lists. You can access the hierarchy in the form main_dict['dir2']['alpha']['beta']...
Note the difference between .../delta and .../delta/. I thought this was helpful for you to be able to quickly differenciate between your leaf being a directory or a file.
I hope this answers your question. If anything is unclear, post a comment.
I'm not fully clear on what you have vs what you need (it'd probably help to provide some of the code you have that's too slow), but you probably should just break up your pathnames into dirnames and basenames, then build a tree from that using a purpose-made class, or at least a hierarchy of lists or dictionaries. Then various traversals should allow you to serialize in almost any way you require.
As to the performance issues, have you considered using Pypy, Cython or Shedskin? I have a deduplicating backup system I've been working on for fun, that can run the same code on Pypy or Cython; running it on Pypy actually outperforms the Cython-augmented version (by a lot on 32 bit, by a little bit on 64 bit). I'd love to compare shedskin as well, but it apparently can't yield across the shedskin/cpython boundary.
Also, profiling is de rigueur when you have performance issues - at least, if you've already selected an appropriate algorithm.
First off, "very hight performance" and "Python" don't mix well. If what you are looking for is optimising performance to the extreme, switching to C will bring you benefits far superior to any smart code optimisation that you might think of.
Secondly, it's hard to believe that the bottleneck in a "file-managing/analyzing toolkit" will be this function. I/O operations on disk are at least a few order of magnitude slower than anything happening in memory. Profiling your code is the only accurate way to gauge this but... I'm ready to pay you a pizza if I'm wrong! ;)
I built a silly test function just to perform some preliminary measurement:
from timeit import Timer as T
PLIST = [['dir', ['file', ['dir2', ['file2']], 'file3']], ['dir3', ['file4', 'file5', 'file6', 'file7']]]
def tree(plist, indent=0):
level = []
for el in plist:
if isinstance(el, list):
level.extend(tree(el, indent + 2))
else:
level.append(' ' * indent + el)
return level
print T(lambda : tree(PLIST)).repeat(number=100000)
This outputs:
[1.0135619640350342, 1.0107290744781494, 1.0090651512145996]
Since the test path list is 10 files, and the number of iterations is 100000 this means that in 1 second you can process a tree of about 1 million files. Now... unless you are working at Google, that seems an acceptable result to me.
By contrast, when I started writing this answer, I clicked on the "property" option on the root of my main 80Gb HD [this should be giving me the number of files on it, using C code]. A few minutes are gone, and I'm at around 50 GB, 300000 files...
HTH! :)

Specific doubts on kgp.py program in dive into python book

Dive into Python: XML Processing -
Here I am referring to a portion of kgp.py program -
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
if not standaloneXrefs:
raise NoSourceError, "can't guess source, and no source specified"
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
self.grammar: parsed XML representation (using xml.dom.minidom) of -
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>
<ref id="byte">
<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
self.refs: is the caching of all the refs of the above XML key'd by their id
I have two doubts with this code:
Doubt 1:
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
eventaully xrefs holds the id values in a list. Couldn't we have done this simply by -
xrefs = [xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]
Doubt 2:
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
Here, we are saving the ref from self.refs which we do NOT see in our computed xrefs. But next instead of creating a <ref> element, we are creating a <xref> with the same ID. This takes us one step backward, since later we are anyway going to find the cross reference for this computed <xref> and eventually reach the <ref>. We could have just started with this <ref> in the first place.
Disclaimer
I am in no way trying to make a remark on the book. I am not even qualified for that.
I am loving every moment of reading this book. I realize few chapters have gone outdated, but I love Mark Pilgrim's writing style and I cannot stop reading.
Dive Into Python is seven years old now (published 2004), and doesn't always contain the most modern code. So you need to go easy on it: Dive Into Python 3 might be a better bet.
Your suggestion for doubt 1 changes the meaning of the code: putting the ids into the keys of a dictionary and then getting them out again eliminates duplicates, whereas your list comprehension includes duplicates. The modern approach would be to use a set comprehension:
xrefs = {xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")}
but this wasn't available in 2004.
On your doubt 2, I'm not entirely sure I see the problem. Yes, in some sense this is a waste, but on the other hand the code already has a handler for the xref case, so it makes sense to re-use that handler rather than add an extra special case.
There are several other bits of code in that example that could be modernized. For example,
source and source or self.getDefaultSource()
would now be source or self.getDefaultSource(). And the line
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
would be better expressed as a set difference operation, something like:
standaloneXrefs = set(self.refs) - set(xrefs)
But that's what happens as languages become more expressive: old code starts to look rather inelegant.
Your doubts are totally justified: that code doesn't look very good to me at all. For example, it uses 1 as a boolean value where True would have sufficed and been clearer.
Doubt 1:
These two snippets don't do the same. If there are duplicates, the original code will filter them out, but your alternative won't. On the other hand, your code preserves the original ordering whereas the original returns the elements in an arbitrary order.
To be fully equivalent, we could use the set builtin:
xrefs = list(set([xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]))
(It might not make sense to convert back to a list, though.)
Doubt 2:
Out of time, gotta run, sorry...
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
This is an extremely crude way to construct a set. This should be written as
set(xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref"))
or even (in Python 2.7+):
{xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")) }
If avoiding duplicates is not an issue, your solution (constructing a list) works too. Since xref is iterated over anyway, one could even generate an iterator.
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
This code is completely broken if xref contains a special character such as " or &.
However, in principle, it is correct to construct an <xref> element here, since this must be the same format that the external source has (getDefaultSource is called as
self.loadSource(source and source or self.getDefaultSource())
).
Both code excerpts are examples of bad programming and should not be included in a book that intends to teach people how to program. Dive Into Python3 has better XML examples and code.

Categories