Using python-docx's _document method - python

I am using python-docx and am trying to insert the a <w:bookmarkStart> tag. I do not see any immediate API method to create the tag. So I googled several references to gain access to the raw XML using the document._document_part attribute. However, when I attempt to use it, python tells me it does not exist:
>>> import docx
>>> document = docx.Document()
>>> print document._document_part
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Document' object has no attribute '_document_part'
I am using python-docx 0.8.5.
Is there a method to add a <w:bookmarkStart> tag?

I found the solution. Here's an example:
from docx.oxml.shared import OxmlElement # Necessary Import
tags = document.element.xpath('//w:r') # Locate the right <w:r> tag
tag = tags[0] # Specify which <w:r> tag you want
child = OxmlElement('w:ARBITRARY') # Create arbitrary tag
tag.append(child) # Append in the new tag
To add an attribute:
from docx.oxml.shared import qn
child.set( qn('w:val'), 'VALUE') # Add in the value

Related

python docx: AttributeError: 'function' object has no attribute 'add_paragraph'

just started learning python, now I need help with the python docx function.
I'm using python v3.5.1
That's the code I'd like to run from a .py file:
from docx import Document
document = Document
paragraph = document.add_paragraph('I am adding a new paragraph here.')
document.save('test-thu18feb-b.docx')
After pressing F5, I get this message in the python shell:
Traceback (most recent call last):
File "C:/Users/Schauer/AppData/Local/Programs/Python/Python35/docx-
test-thu18feb-a.py", line 4, in <module>
paragraph = document.add_paragraph('I am adding a new paragraph here.')
AttributeError: 'function' object has no attribute 'add_paragraph'
Thanks a lot for helping out!
The statement
document = Document
assigns the function docx.Document to document.
document = Document()
assigns the value returned by the function docx.Document to document. You need the latter.
docx.Document is a constructor function. It returns instances of the docx.document.Document class.
Per the docs, this is the definition of the docx.Document function:
def Document(docx=None):
"""
Return a |Document| object loaded from *docx*, where *docx* can be
either a path to a ``.docx`` file (a string) or a file-like object. If
*docx* is missing or ``None``, the built-in default document "template"
is loaded.
"""
docx = _default_docx_path() if docx is None else docx
document_part = Package.open(docx).main_document_part
if document_part.content_type != CT.WML_DOCUMENT_MAIN:
tmpl = "file '%s' is not a Word file, content type is '%s'"
raise ValueError(tmpl % (docx, document_part.content_type))
return document_part.document
So docx.Document is a function, but docx.document.Document is a class.
Since you import
from docx import Document
Document refers to docx.Document in your code.

python list substring

I am trying to read the variables from newreg.py (e.g. state, district, dcode, etc, a long list which in turn picking up data from a web form) into insertNew.py.
I have currently read the whole file into a list named 'lines'. Now, how do I filter each variable (like- state, district, etc. approx 50-55 variables. This 'list' also has html code as I have read the whole web page into it) from list 'lines'?
Is there a better and efficient way to do it ?
Once I am able to read each variable, I need to concatenate these value ( convert into string) and insert into MongoDB.
Lastly when the data has been inserted into DB, 'home.py' page opens.
I am giving details so that a complete picture is available for some solution which can be given. I hope it I have been able to keep it simple as well as complete.
I want to loop over the list (sample below) and filter out the variables (before '=' sign values). The following is in 'newreg.py' :
state = form.getvalue('state','ERROR')
district = form.getvalue('district','ERROR')
dcode = form.getvalue('Dcode','ERROR')
I read a file / page into a list
fp = open('/home/dev/wsgi-scripts/newreg.py','r')
lines = fp.readlines()
so that I can create dictionary to insert into MongoDB.eg.
info = {'state' : state , 'district' : district, . . . . }
{key : value } [value means --- which is the variable from the above list]
Thanks
but i am getting the following errors when i do
print getattr(newreg, 'state')
the error is
>>> print getattr(newreg, 'state')
Traceback (most recent call last):
File "<stdin>", line 1, in module
AttributeError: 'module' object has no attribute 'state'
I also tried
>>> print newreg.state
Traceback (most recent call last):
File "<stdin>", line 1, in module
AttributeError: 'module' object has no attribute 'state'
This is how I added the module
>>> import os,sys
>>> sys.path.append('/home/dev/wsgi-scripts/')
>>> import newreg
>>> newreg_vars = dir(newreg)
>>> print newreg_vars
['Connection', 'Handler', '__builtins__', '__doc__', '__file__', '__name__',
'__package__', 'application', 'cgi', 'datetime', 'os', 'sys', 'time']
Handler in the above list is a class in the following
#!/usr/bin/env python
import os, sys
import cgi
from pymongo import Connection
import datetime
import time
class Handler:
def do(self, environ, start_response):
form = cgi.FieldStorage(fp=environ['wsgi.input'],
environ=environ)
state = form.getvalue('state','<font color="#FF0000">ERROR</font>')
district = form.getvalue('district','<font color="#FF0000">ERROR</font>')
dcode = form.getvalue('Dcode','<font color="#FF0000">ERROR</font>')
I am assuming you want to copy the variables from one Python module to another at runtime.
import newreg
newreg_vars = dir(newreg)
print newreg_vars
will print all of the attributes of the module "newreg".
To read the variables from the module:
print getattr(newreg, 'state')
print getattr(newreg, 'district')
print getattr(newreg, 'dcode')
or if you know the names of the attributes:
print newreg.state
print newreg.district
print newreg.dcode
To change the attributes into strings, use a list comprehension (or a generator):
newreg_strings = [str(item) for item in newreg_vars]
This will save you lots of effort, as you will not have to parse "newreg" as a text file with re.
As a side note: Type conversion is not concatenation (although concatenation may involve type conversion in some other programming languages).

renderContents in beautifulsoup (python)

The code I'm trying to get working is:
h = str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
I get this error:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
print h.renderContents()
AttributeError: 'str' object has no attribute 'renderContents'
Any ideas?
I have a string with html tags and i need to clean it if there is a different way of doing that please suggest it.
Your error message and your code sample don't line up. You say you're calling:
heading.renderContents()
But your error message says you're calling:
print h.renderContents()
Which suggests that perhaps you have a bug in your code, trying to call renderContents() on a string object that doesn't define that method.
In any case, it would help if you checked what type of object heading is to make sure it's really a BeautifulSoup instance. This works for me with BeautifulSoup 3.2.0:
from BeautifulSoup import BeautifulSoup
heading = BeautifulSoup('<h1>heading</h1>')
repr(heading)
# '<h1>heading</h1>'
print heading.renderContents()
# <h1>heading</h1>
print str(heading)
# '<h1>heading</h1>'
h = str(heading)
print h
# <h1>heading</h1>

How can I get specific elements from XML data?

I have some code to retrieve XML data:
import cStringIO
import pycurl
from xml.etree import ElementTree
_API_KEY = 'my api key'
_ima = '/the/path/to/a/image'
sock = cStringIO.StringIO()
upl = pycurl.Curl()
values = [
("key", _API_KEY),
("image", (upl.FORM_FILE, _ima))]
upl.setopt(upl.URL, "http://api.imgur.com/2/upload.xml")
upl.setopt(upl.HTTPPOST, values)
upl.setopt(upl.WRITEFUNCTION, sock.write)
upl.perform()
upl.close()
xmldata = sock.getvalue()
#print xmldata
sock.close()
The resulting data looks like:
<?xml version="1.0" encoding="utf-8"?>
<upload><image><name></name><title></title><caption></caption><hash>dxPGi</hash><deletehash>kj2XOt4DC13juUW</deletehash><datetime>2011-06-10 02:59:26</datetime><type>image/png</type><animated>false</animated><width>1024</width><height>768</height><size>172863</size><views>0</views><bandwidth>0</bandwidth></image><links><original>http://i.stack.imgur.com/dxPGi.png</original><imgur_page>http://imgur.com/dxPGi</imgur_page><delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page><small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square><large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail></links></upload>
Now, following this answer, I'm trying to get some specific values from the data.
This is my attempt:
tree = ElementTree.fromstring(xmldata)
url = tree.findtext('original')
webpage = tree.findtext('imgur_page')
delpage = tree.findtext('delete_page')
print 'Url: ' + str(url)
print 'Pagina: ' + str(webpage)
print 'Link de borrado: ' + str(delpage)
I get an AttributeError if I try to add the .text access:
Traceback (most recent call last):
File "<pyshell#28>", line 27, in <module>
url = tree.find('original').text
AttributeError: 'NoneType' object has no attribute 'text'
I couldn't find anything in Python's help for ElementTree about this attribute. How can I get only the text, not the object?
I found some info about getting a text string here; but when I try it I get a TypeError:
Traceback (most recent call last):
File "<pyshell#32>", line 34, in <module>
print 'Url: ' + url
TypeError: cannot concatenate 'str' and 'NoneType' objects
If I try to print 'Url: ' + str(url) instead, there is no error, but the result shows as None.
How can I get the url, webpageanddelete_page` data from this XML?
Your find() call is trying to find an immediate child of the top of the tree with a tag named original, not a tag at any lower level than that. Use:
url = tree.find('.//original').text
if you want to find all elements in the tree with the tag named original. The pattern matching rules for ElementTree's find() method are laid out in a table on this page: http://effbot.org/zone/element-xpath.htm
For // matching it says:
Selects all subelements, on all levels beneath the current element (search the entire subtree). For example, “.//egg” selects all “egg” elements in the entire tree.
Edit: here is some test code for you, it use the XML sample string you posted I just ran it through XML Tidy in TextMate to make it legible:
from xml.etree import ElementTree
xmldata = '''<?xml version="1.0" encoding="utf-8"?>
<upload>
<image>
<name/>
<title/>
<caption/>
<hash>dxPGi</hash>
<deletehash>kj2XOt4DC13juUW</deletehash>
<datetime>2011-06-10 02:59:26</datetime>
<type>image/png</type>
<animated>false</animated>
<width>1024</width>
<height>768</height>
<size>172863</size>
<views>0</views>
<bandwidth>0</bandwidth>
</image>
<links>
<original>http://i.stack.imgur.com/dxPGi.png</original>
<imgur_page>http://imgur.com/dxPGi</imgur_page>
<delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page>
<small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square>
<large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail>
</links>
</upload>'''
tree = ElementTree.fromstring(xmldata)
print tree.find('.//original').text
On my machine (OS X running python 2.6.1) that produces:
Ian-Cs-MacBook-Pro:tmp ian$ python test.py
http://i.stack.imgur.com/dxPGi.png

pubDate RSS parsing weirdness with Beautifulsoup/Python

I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field.
data = urllib2.urlopen("http://www.democracynow.org/podcast.xml")
dom = BeautifulStoneSoup(data, fromEncoding='utf-8')
items = dom.findAll('item');
for item in items:
title = item.find('title').string.strip()
pubDate = item.find('pubDate').string.strip()
The title gets parsed fine but when it gets to pubDate, it says:
Traceback (most recent call last):
File "", line 2, in
AttributeError: 'NoneType' object has no attribute 'string'
However, when I download a copy of the XML file and rename 'pubDate' to something else, then parse it again, it seems to work. Is pubDate a reserved variable or something in Python?
Thanks,
g
It works with item.find('pubdate').string.strip().
Why don't you use feedparser ?

Categories