python-docx - how to restart list lettering - python

I'm automating the process of creating a Word document with the python-docx module. In particular, I'm creating a multiple choice test where the questions are numbered 1., 2., 3., ... and under each question there are 4 answers that should be labeled as A., B., C., and D. I used a style to create the numbered list and the lettered list. However, I don't know how to restart the letters. For example, the answers for the 2nd question would range from E., F., G., H. Does anyone know how to restart the lettering back to A? I could manually specify the lettering in the answer string but I'm wondering how to do it with the style sheet. Thank you.

The short answer is that this is not supported yet in python-docx but we might be able to provide you with a workaround if you ask on this issue on the Github issue list for the project: https://github.com/python-openxml/python-docx/issues/25
This particular operation turns out to be way more difficult in Word than anyone I know would have imagined. I believe that has to do with a need to maintain certain backward compatibilities on so many versions across the couple three decades of Word now.
Just to give you an idea, the style references a numbering definition which itself references an abstract numbering definition. Each paragraph with the style gets the next number/letter in the sequence. To restart the sequence, you have to create a NEW numbering definition that references the same abstract numbering sequence as the prior one. Then you reference the new numbering definition on the paragraph where the sequence should restart. Paragraphs with the style following that one get the next number/letter in the restarted sequence.
In order to accomplish that, you need to:
locate the numbering definition of the style
locate the abstract numbering definition it points to
create a new numbering definition with the restart bit set
tweak the paragraph element to refer to the new numbering definition
Anyway, now that I've vented about all that I can tell you we've actually gotten it to work before. We haven't added it to the API yet mostly I suppose because it's not entirely clear what the API should look like and it hasn't risen to the top of the backlog yet. But a couple workaround functions could probably get it done for you if you want it badly enough.
In your case I suppose it would be a toss-up. I suppose I would strongly consider placing those bits in directly in the paragraph in this case, but you'll be best able to decide.

I've created a pull request (#582) that addresses this situation at a low-level. All I have done is define the XML types necessary to implement the numbering subsystem of WML. #scanny has create a submodule called xmlchemy that creates a semi-abstract representation of the XML so that you can handle multilevel lists and other numbering tasks if you are familiar with the standard. So if you build my fork, the following code will work:
#!/usr/bin/python
from docx import Document
from docx import oxml
d = Document()
"""
1. Create an abstract numbering definition for a multi-level numbering style.
"""
numXML = d.part.numbering_part.numbering_definitions._numbering
nextAbstractId = max([ J.abstractNumId for J in numXML.abstractNum_lst ] ) + 1
l = numXML.add_abstractNum()
l.abstractNumId = nextAbstractId
m = l.add_multiLevelType()
m.val = 'multiLevel'
"""
2. Define numbering formats for each (zero-indexed)
level. N.B. The formatting text is one-indexed.
The user agent will accept up to nine levels.
"""
formats = {0: "decimal", 1: "upperLetter" }
textFmts = {0: '%1.', 1: '%2.' }
for i in range(2):
lvl = l.add_lvl()
lvl.ilvl = i
n = lvl.add_numFmt()
n.val = formats[i]
lt = lvl.add_lvlText()
lt.val = textFmts[i]
"""
3. Link the abstract numbering definition to a numbering definition.
"""
n = numXML.add_num(nextAbstractId)
"""
4. Define a function to set the (0-indexed) numbering level of a paragraph.
"""
def set_ilvl(p,ilvl):
pr = p._element._add_pPr()
np = pr.get_or_add_numPr()
il = np.get_or_add_ilvl()
il.val = ilvl
ni = np.get_or_add_numId()
ni.val = n.numId
return(p)
"""
5. Create some content
"""
for x in [1,2,3]:
p = d.add_paragraph()
set_ilvl(p,0)
p.add_run("Question %i" % x)
for y in [1,2,3,4]:
p2 = d.add_paragraph()
set_ilvl(p2,1)
p2.add_run("Choice %i" % y)
d.save('test.docx')

Related

Python comment-preserving parsing using only builtin libraries?

I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.
Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?
What I ended up doing was writing a simple parser, without a meta-language in 339 source lines:
https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py
Implementation of Concrete Syntax Tree [List!]
Reads source character by character;
Once end of statement† is detected, add statement-type into 1D list;
†end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
Once finished there is a big (1D) list where each element is a namedtuple with a value property.
Integration with builtin Abstract Syntax Tree ast
Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
…
Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).
Quality control
100% test coverage
100% doc coverage
Support for last 6 versions of Python (including latest alpha)
CI/CD
(Apache-2.0 OR MIT) licensed
Limitations
Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
Not builtin to stdlib so stdlib could break compatibility;
Deleting nodes is [probably] not supported;
Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.
Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.
Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:
import ast
program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
'b'] # Ending comment for b lost
def p1f(x):
"p1f docstring"
# Comment in function p1f lost
return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))
The output shows all comments gone:
== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']
def p1f(x):
"""p1f docstring"""
return x
print(p1f(p1l), p1f(p1v))
However, if we scan the comments with the tokenizer, we can
use this to merge the comments back in:
from io import StringIO
import tokenize
def scan_comments(source):
""" Scan source code file for relevant comments
"""
# Find token for comments
for k,v in tokenize.tok_name.items():
if v == 'COMMENT':
comment = k
break
comtokens = []
with StringIO(source) as f:
tokens = tokenize.generate_tokens(f.readline)
for token in tokens:
if token.type != comment:
continue
comtokens += [token]
return comtokens
comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])
Output (edited to split long line):
== Comment after p1l[0]
TokenInfo(type=60 (COMMENT),
string='# Implicit line joining comment for a lost',
start=(4, 12), end=(4, 54),
line="p1l = ['a', # Implicit line joining comment for a lost\n")
Using a slightly modified version of ast.unparse(), replacing
methods maybe_newline() and traverse() with modified versions,
you should be able to merge back in all comments at their
approximate locations, using the location info from the comment
scanner (start variable), combined with the location info from the
AST; most nodes have a lineno attribute.
Not exactly. See for example the list variable assignment. The
source code is split out over two lines, but ast.unparse()
generates only one line (see output in the second code segment).
Also, you need to ensure to update the location info in the AST
using ast.increment_lineno() after adding code.
It seems some more calls to
maybe_newline() might be needed in the library code (or its
replacement).

Can Python-Markdown support imageboard-style links?

I would like to add an additional syntax to Python-Markdown: if n is a positive integer, >>n should expand into n. (Double angled brackets (>>) is a conventional syntax for creating links in imageboard forums.)
By default, Python-Markdown expands >>n into nested blockquotes: <blockquote><blockquote>n</blockquote></blockquote>. Is there a way create links out of >>n, while preserving the rest of blockquote's default behavior? In other words, if x is a positive integer, >>x should expand into a link, but if x is not a positive integer, >>x should still expand into nested blockquotes.
I have read the relevant wiki article: Tutorial 1 Writing Extensions for Python Markdown. Based on what I learned in the wiki, I wrote a custom extension:
import markdown
import xml.etree.ElementTree as ET
from markdown.extensions import Extension
from markdown.inlinepatterns import Pattern
class ImageboardLinkPattern(Pattern):
def handleMatch(self, match):
number = match.group('number')
# Create link.
element = ET.Element('a', attrib={'href': f'#post-{number}'})
element.text = f'>>{number}'
return element
class ImageboardLinkExtension(Extension):
def extendMarkdown(self, md):
IMAGEBOARD_LINK_RE = '>>(?P<number>[1-9][0-9]*)'
imageboard_link = ImageboardLinkPattern(IMAGEBOARD_LINK_RE)
md.inlinePatterns['imageboard_link'] = imageboard_link
html = markdown.markdown('>>123',
extensions=[ImageboardLinkExtension()])
print(html)
However, >>123 still produces <blockquote><blockquote>123</blockquote></blockquote>. What is wrong with the implementation above?
The problem is that your new syntax conflicts with the preexisting blockquote syntax. Your extension would presumably work if it was ever called. However, due to the conflict, that never happens. Note that their are five types of processors. As documented:
Preprocessors alter the source before it is passed to the parser.
Block Processors work with blocks of text separated by blank lines.
Tree Processors modify the constructed ElementTree
Inline Processors are common tree processors for inline elements, such as *strong*.
Postprocessors munge of the output of the parser just before it is returned.
Of importance here is that the processors are run in that order. In other words, all block processors are run before any inline processors are run. Therefore, the blockquote block processor runs first on your input and removes the double angle bracket, wrapping the rest of the line in double blockquote tags. By the time your inline processor sees the document, your regex will no longer match and will therefore never be called.
That being said, an inline processor is the correct way to implement a link syntax. However, you would need to do one of two things to make it work.
Alter the syntax so that it does not clash with any preexisting syntax; or
Alter the blockquote behavior to avoid the conflict.
Personally, I would recommend option 1, but I understand you are trying to implement a preexisting syntax from another environment. So, if you want to explore option 2, then I would suggest perhaps making the blockquote syntax a little more strict. For example, while it is not required, the recommended syntax is to always insert a space after the angle bracket in a blockquote. It should be relatively simple to alter the BlockquoteProcessor to require the space, which would cause your syntax to no longer clash.
This is actually pretty simple. As you may note, the entire syntax is defined via a rather simple regex:
RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')
You simply need to rewrite that so that 0 whitespace is no longer accepted (> rather than >[ ]?). First import and subclass the existing processor and then override the regex:
from markdown.blockprocessors import BlockquoteProcessor
class CustomBlockquoteProcessor(BlockquoteProcessor):
RE = re.compile(r'(^|\n)[ ]{0,3}> (.*)')
Finally, you just need to tell Markdown to use your custom class rather than the default. Add the following to the extendMarkdown method of your ImageboardLinkExtension class:
md.parser.blockprocessors.register(CustomBlockQuoteProcessor(md.parser), 'quote', 20)
Now the blockquote syntax will no longer clash with your link syntax and you will get an opportunity to have your code run on the text. Just be careful to remember to always include the now required space for any actual blockquotes.

Getting element density from abaqus output database using python scripting

I'm trying to get the element density from the abaqus output database. I know you can request a field output for the volume using 'EVOL', is something similar possible for the density?
I'm afraid it's not because of this: Getting element mass in Abaqus postprocessor
What would be the most efficient way to get the density? Look for every element in which section set it is?
Found a solution, I don't know if it's the fastest but it works:
odb_file_path=r'your_path\file.odb'
odb = session.openOdb(name=odb_file_path)
instance = odb.rootAssembly.instances['MY_PART']
material_name = instance.elements[0].sectionCategory.name[8:-2]
density=odb.materials[material_name].density.table[0][0])
note: the 'name' attribute will give you a string like, 'solid MATERIALNAME'. So I just cut out the part of the string that gave me the real material name. So it's the sectionCategory attribute of an OdbElementObject that is the answer.
EDIT: This doesn't seem to work after all, it turns out that it gives all elements the same material name, being the name of the first material.
The properties are associated something like this:
sectionAssignment connects section to set
set is the container for element
section connects sectionAssignment to material
instance is connected to part (could be from a part from another model)
part is connected to model
model is connected to section
Use the .inp or .cae file if you can. The following gets it from an opened cae file. To thoroughly get elements from materials, you would do something like the following, assuming you're starting your search in rootAssembly.instances:
Find the parts which the instances were created from.
Find the models which contain these parts.
Look for all sections with material_name in these parts, and store all the sectionNames associated with this section
Look for all sectionAssignments which references these sectionNames
Under each of these sectionAssignments, there is an associated region object which has the name (as a string) of an elementSet and the name of a part. Get all the elements from this elementSet in this part.
Cleanup:
Use the Python set object to remove any multiple references to the same element.
Multiply the number of elements in this set by the number of identical part instances that refer to this material in rootAssembly.
E.g., for some cae model variable called model:
model_part_repeats = {}
model_part_elemLabels = {}
for instance in model.rootAssembly.instances.values():
p = instance.part.name
m = instance.part.modelName
try:
model_part_repeats[(m, p)] += 1
continue
except KeyError:
model_part_repeats[(m, p)] = 1
# Get all sections in model
sectionNames = []
for s in mdb.models[m].sections.values():
if s.material == material_name: # material_name is already known
# This is a valid section - search for section assignments
# in part for this section, and then the associated set
sectionNames.append(s.name)
if sectionNames:
labels = []
for sa in mdb.models[m].parts[p].sectionAssignments:
if sa.sectionName in sectionNames:
eset = sa.region[0]
labels = labels + [e.label for e in mdb.models[m].parts[p].sets[eset].elements]
labels = list(set(labels))
model_part_elemLabels[(m,p)] = labels
else:
model_part_elemLabels[(m,p)] = []
num_elements_with_material = sum([model_part_repeats[k]*len(model_part_elemLabels[k]) for k in model_part_repeats])
Finally, grab the material density associated with material_name then multiply it by num_elements_with_material.
Of course, this method will be extremely slow for larger models, and it is more advisable to use string techniques on the .inp file for faster performance.

Specific doubts on kgp.py program in dive into python book

Dive into Python: XML Processing -
Here I am referring to a portion of kgp.py program -
def getDefaultSource(self):
xrefs = {}
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
if not standaloneXrefs:
raise NoSourceError, "can't guess source, and no source specified"
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
self.grammar: parsed XML representation (using xml.dom.minidom) of -
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>
<ref id="byte">
<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
self.refs: is the caching of all the refs of the above XML key'd by their id
I have two doubts with this code:
Doubt 1:
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
eventaully xrefs holds the id values in a list. Couldn't we have done this simply by -
xrefs = [xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]
Doubt 2:
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
Here, we are saving the ref from self.refs which we do NOT see in our computed xrefs. But next instead of creating a <ref> element, we are creating a <xref> with the same ID. This takes us one step backward, since later we are anyway going to find the cross reference for this computed <xref> and eventually reach the <ref>. We could have just started with this <ref> in the first place.
Disclaimer
I am in no way trying to make a remark on the book. I am not even qualified for that.
I am loving every moment of reading this book. I realize few chapters have gone outdated, but I love Mark Pilgrim's writing style and I cannot stop reading.
Dive Into Python is seven years old now (published 2004), and doesn't always contain the most modern code. So you need to go easy on it: Dive Into Python 3 might be a better bet.
Your suggestion for doubt 1 changes the meaning of the code: putting the ids into the keys of a dictionary and then getting them out again eliminates duplicates, whereas your list comprehension includes duplicates. The modern approach would be to use a set comprehension:
xrefs = {xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")}
but this wasn't available in 2004.
On your doubt 2, I'm not entirely sure I see the problem. Yes, in some sense this is a waste, but on the other hand the code already has a handler for the xref case, so it makes sense to re-use that handler rather than add an extra special case.
There are several other bits of code in that example that could be modernized. For example,
source and source or self.getDefaultSource()
would now be source or self.getDefaultSource(). And the line
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
would be better expressed as a set difference operation, something like:
standaloneXrefs = set(self.refs) - set(xrefs)
But that's what happens as languages become more expressive: old code starts to look rather inelegant.
Your doubts are totally justified: that code doesn't look very good to me at all. For example, it uses 1 as a boolean value where True would have sufficed and been clearer.
Doubt 1:
These two snippets don't do the same. If there are duplicates, the original code will filter them out, but your alternative won't. On the other hand, your code preserves the original ordering whereas the original returns the elements in an arbitrary order.
To be fully equivalent, we could use the set builtin:
xrefs = list(set([xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")]))
(It might not make sense to convert back to a list, though.)
Doubt 2:
Out of time, gotta run, sorry...
for xref in self.grammar.getElementsByTagName("xref"):
xrefs[xref.attributes["id"].value] = 1
xrefs = xrefs.keys()
This is an extremely crude way to construct a set. This should be written as
set(xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref"))
or even (in Python 2.7+):
{xref.attributes["id"].value
for xref in self.grammar.getElementsByTagName("xref")) }
If avoiding duplicates is not an issue, your solution (constructing a list) works too. Since xref is iterated over anyway, one could even generate an iterator.
standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
...
return '<xref id="%s"/>' % random.choice(standaloneXrefs)
This code is completely broken if xref contains a special character such as " or &.
However, in principle, it is correct to construct an <xref> element here, since this must be the same format that the external source has (getDefaultSource is called as
self.loadSource(source and source or self.getDefaultSource())
).
Both code excerpts are examples of bad programming and should not be included in a book that intends to teach people how to program. Dive Into Python3 has better XML examples and code.

renumber residues in a protein structure file (pdb)

Hi
I am currently involved in making a website aimed at combining all papillomavirus information in a single place.
As part of the effort we are curating all known files on public servers (e.g. genbank)
One of the issues I ran into was that many (~50%) of all solved structures are not numbered according to the protein.
I.e. a subdomain was crystallized (amino acid 310-450) however the crystallographer deposited this as residue 1-140.
I was wondering whether anyone knows of a way to renumber the entire pdb file. I have found ways to renumber the sequence (identified by seqres), however this does not update the helix and sheet information.
I would appreciate it if you had any suggestions…
Thanks
I'm the maintainer of pdb-tools - which may be a tool that can assist you.
I have recently modified the residue-renumber script within my application to provide more flexibility. It can now renumber hetatms and specific chains, and either force the residue numbers to be continuous or just add a user-specified offset to all residues.
Please let me know if this assists you.
I frequently encounter this problem too. After abandoning an old perl script I had for this I've been experimenting with some python instead. This solution assumes you've got Biopython, ProDy (http://www.csb.pitt.edu/ProDy/#prody) and EMBOSS (http://emboss.sourceforge.net/) installed.
I used one of the papillomavirus PDB entries here.
from Bio import AlignIO,SeqIO,ExPASy,SwissProt
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Emboss.Applications import NeedleCommandline
from prody.proteins.pdbfile import parsePDB, writePDB
import os
oneletter = {
'ASP':'D','GLU':'E','ASN':'N','GLN':'Q',
'ARG':'R','LYS':'K','PRO':'P','GLY':'G',
'CYS':'C','THR':'T','SER':'S','MET':'M',
'TRP':'W','PHE':'F','TYR':'Y','HIS':'H',
'ALA':'A','VAL':'V','LEU':'L','ILE':'I',
}
# Retrieve pdb to extract sequence
# Can probably be done with Bio.PDB but being able to use the vmd-like selection algebra is nice
pdbname="2kpl"
selection="chain A"
structure=parsePDB(pdbname)
pdbseq_str=''.join([oneletter[i] for i in structure.select("protein and name CA and %s"%selection).getResnames()])
alnPDBseq=SeqRecord(Seq(pdbseq_str,IUPAC.protein),id=pdbname)
SeqIO.write(alnPDBseq,"%s.fasta"%pdbname,"fasta")
# Retrieve reference sequence
accession="Q96QZ7"
handle = ExPASy.get_sprot_raw(accession)
swissseq = SwissProt.read(handle)
refseq=SeqRecord(Seq(swissseq.sequence,IUPAC.protein),id=accession)
SeqIO.write(refseq, "%s.fasta"%accession,"fasta")
# Do global alignment with needle from EMBOSS, stores entire sequences which makes numbering easier
needle_cli = NeedleCommandline(asequence="%s.fasta"%pdbname,bsequence="%s.fasta"%accession,gapopen=10,gapextend=0.5,outfile="needle.out")
needle_cli()
aln = AlignIO.read("needle.out", "emboss")
os.remove("needle.out")
os.remove("%s.fasta"%pdbname)
os.remove("%s.fasta"%accession)
alnPDBseq = aln[0]
alnREFseq = aln[1]
# Initialize per-letter annotation for pdb sequence record
alnPDBseq.letter_annotations["resnum"]=[None]*len(alnPDBseq)
# Initialize annotation for reference sequence, assume first residue is #1
alnREFseq.letter_annotations["resnum"]=range(1,len(alnREFseq)+1)
# Set new residue numbers in alnPDBseq based on alignment
reslist = [[i,alnREFseq.letter_annotations["resnum"][i]] for i in range(len(alnREFseq)) if alnPDBseq[i] != '-']
for [i,r] in reslist:
alnPDBseq.letter_annotations["resnum"][i]=r
# Set new residue numbers in the structure
newresnums=[i for i in alnPDBseq.letter_annotations["resnum"][:] if i != None]
resindices=structure.select("protein and name CA and %s"%selection).getResindices()
resmatrix = [[newresnums[i],resindices[i]] for i in range(len(newresnums)) ]
for [newresnum,resindex] in resmatrix:
structure.select("resindex %d"%resindex).setResnums(newresnum)
writePDB("%s.renumbered.pdb"%pdbname,structure)
pdb-tools
Phenix pdb-tools
BioPython or Bio3D
Check the first one - it should fit your needs

Categories