i'm looking to solve this problem.
When i try to write into the xml file , it writes twice the same thing.
It's the code:
def writeIntoXml(fileName, tagElement, textElement):
tree = ET.ElementTree(file = fileName)
root = tree.getroot()
newElement = ET.SubElement(root, tagElement)
newElement.text =textElement;
newElement.tail ="\n"
root.append(newElement)
tree.write(fileName, encoding='utf-8')
If i have this xml file, with this tags, if i write a new tag( es "Question-3" Example3 "/Question-3") i get a problem
XmlFile before being written:
<Questions>
<Question-1>Example1</Question-1>
<Question-2>Example2</Question-2>
</Questions>
XmlFile after being written:
<Questions>
<Question-1>Example1</Question-1>
<Question-2>Example2</Question-2>
<Question-3>Example3</Question-3>
<Question-3>Example3</Question-3>
</Questions>
Sorry for grammatical errors
Note that ET.SubElement() appends the element automatically. You are adding the element twice, first in SubElement(), next in append().
You should use either just
newElement = ET.SubElement(root, tagElement)
newElement.text = textElement;
newElement.tail = "\n"
or
newElement = ET.Element(tagElement)
newElement.text = textElement;
newElement.tail = "\n"
root.append(newElement)
Related
i am using textract to convert doc/docx file to text
here is my method
def extract_text_from_doc(file):
temp = None
temp = textract.process(file)
temp = temp.decode("UTF-8")
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
return ''.join(text)
I have two doc files both with docx extension. when i try to convert one file to string it is working fine but for other one it is throwing exception
'There is no item named \'word/document.xml\' in the archive'
I tried to look further and i found that zipfile.ZipFile(docx) is causing the problem
Code looks like this
def process(docx, img_dir=None):
text = u''
# unzip the docx in memory
zipf = zipfile.ZipFile(docx)
filelist = zipf.namelist()
# get header text
# there can be 3 header files in the zip
header_xmls = 'word/header[0-9]*.xml'
for fname in filelist:
if re.match(header_xmls, fname):
text += xml2text(zipf.read(fname))
# get main text
doc_xml = 'word/document.xml'
text += xml2text(zipf.read(doc_xml))
# some more code
In the above code, for the file(for which it is working) returns filelist with values like 'word/document.xml', 'word/header1.xml'
but for the file(for which it is not working) its returns filelist with values
['[Content_Types].xml', '_rels/.rels', 'theme/theme/themeManager.xml', 'theme/theme/theme1.xml', 'theme/theme/_rels/themeManager.xml.rels']
since second filelist dont contain 'word/document.xml'
doc_xml = 'word/document.xml'
text += xml2text(zipf.read(doc_xml))
is throwing exception(internally it try to open file name with word/document.xml)
can anyone please help me. i dont know its problem with docx file or code.
The loop goes through the list
for file in files:
if id == file['param_id']:
resources_dict = {'fileNo': str(i), 'startPageNo': str(i), 'endPageNo': str(i),
'format': 'cpk:JPEG'}
ET.SubElement(cpf_resources, 'cpf:ContentFile', resources_dict).text = 'cid:{}'.format(str(file['filename']))
i = i + 1
then the data is written to the file as follows:
tree = ET.ElementTree(jobticket)
filename = '{}\\{}.xml'.format(os.getcwd(), get_af_value(project_data, id, 'filename'))
tree.write(filename, encoding="UTF-8", xml_declaration=True)
In the end file, the data are displayed as follows:
<cpf:Resources>
<cpf:ContentFile endPageNo="1" fileNo="1" format="cpk:JPEG" startPageNo="1">cid:page_0005.jpg
</cpf:ContentFile>
<cpf:ContentFile endPageNo="2" fileNo="2" format="cpk:JPEG" startPageNo="2">cid:page_0009.jpg
</cpf:ContentFile>
</cpf:Resources>
Is there a way to display the closing of the tag </cpf:ContentFile>in the same line?
<cpf:ContentFile endPageNo="2" fileNo="2" format="cpk:JPEG" startPageNo="2">cid:page_0009.jpg</cpf:ContentFile>
After a few curses I managed to create something like this:
tree = tree.getroot()
tree = ET.tostring(tree)
xmlstr = minidom.parseString(tree).toprettyxml(indent = " ", encoding='UTF-8')
with open("filename.xml", "w") as f:
f.write(xmlstr)
Maybe somebody could use it.
I have Python script that parses a number of XML files with the same structure, finds relevant elements and prints all tags and attibutes (and writes to file, but I would like to create some structured data instead).
This works perfectly fine, but I would like create a new XML file mirroring the structure of the original, but only with the elements matching the patterns I specified.
Here's the function that searches through the files:
import xml.etree.cElementTree as ET
import glob
filename = "media_code2_output.txt"
def find_mediacode2(inputfile, outputfile):
#find parent node
for parent in root.iter("musa"):
#parent node attribute "dr-production" must be true (as string)
if parent.attrib["dr-production"] == "true":
#each child element must have media-code element be 2.
for mediekode in parent.iter("media-code"):
if mediekode.text == "2":
#pint all fields
for field in parent.iter():
print(field.tag, field.attrib, field.text)
#write all fields to file
outputfile.write(str(field.tag) + " " + str(field.attrib) + " " + str(field.text) + "\n")
#print spacer line
outputfile.write("\n"+"-"*80+"\n")
print("\n"+"-"*80+"\n")
for inputfile in glob.glob('*/*.xml'):
tree = ET.parse(inputfile)
root = tree.getroot()
with open(filename, "a+") as outputfile:
find_mediacode2(root, outputfile)
Here's a sample of the data from the files:
https://pastebin.com/AHEcDv36
Ideally, I would like to represent the data in an Access database.
I'm try to iterate through tables in html by a searchlabel, then update the found value to a dictionary, then write those values to a csv. The output currently works for both the url and the headline, but the name output will either be blank or show "None." If i print the output of blog["name'] however, it is correctly pulling the information I want. I suspect that it's an indentation error but I can't figure out where to line things up. I've tried moving things around but nothing seems to work to get the name assignment to work inside that loop.
import os
from bs4 import BeautifulSoup
import my_csv_writer
def td_finder(tr, searchLabel):
value = ""
index = tr.text.find(searchLabel)
if index>-1:
tds = tr.findAll('td')
if len(tds)>1:
value = tds[1].text
return value
def main():
topdir = 'some_directory'
writer = my_csv_writer.CsvWriter("output.csv")
writer.writeLine(["url", "headline", "name"])
"""Main Function"""
blog = []
for root, dirs, files in os.walk(topdir):
for f in files:
url = os.path.join(root, f)
url = os.path.dirname(url).split('some_file')[1]
if f.lower().endswith((".html")):
file_new = open(os.path.join(root, f), "r").read()
soup = BeautifulSoup(file_new)
blog = {}
#Blog Title
blog["title"] = soup.find('title').text
for table in soup.findAll("table"):
for tr in table.findAll("tr"):
#name
blog["name"] = td_finder(tr, "name:")
seq = [url, unicode(blog["title"]), unicode(blog.get("name"))]
writer.writeLine(seq)
#return ""
if __name__ == '__main__':
main()
print "Finished main"
You're writing unicode strings to a csv file which according to the official docs "The csv module doesn’t directly support reading and writing Unicode...".
It does offer alternative classes to enable different encodings via UnicodeWriter. The following answer from Boud on SO highlights the need to set the desired encoding in the CSV file.
I need to avoid creating double branches in an xml tree when parsing a text file. Let's say the textfile is as follows (the order of lines is random):
branch1:branch11:message11
branch1:branch12:message12
branch2:branch21:message21
branch2:branch22:message22
So the resulting xml tree should have a root with two branches. Both of those branches have two subbranches. The Python code I use to parse this textfile is as follows:
import string
fh = open ('xmlbasic.txt', 'r')
allLines = fh.readlines()
fh.close()
import xml.etree.ElementTree as ET
root = ET.Element('root')
for line in allLines:
tempv = line.split(':')
branch1 = ET.SubElement(root, tempv[0])
branch2 = ET.SubElement(branch1, tempv[1])
branch2.text = tempv[2]
tree = ET.ElementTree(root)
tree.write('xmlbasictree.xml')
The problem with this code is, that a branch in xml tree is created with each line from the textfile.
Any suggestions how to avoid creating another branch in xml tree if a branch with this name exists already?
with open("xmlbasic.txt") as lines_file:
lines = lines_file.read()
import xml.etree.ElementTree as ET
root = ET.Element('root')
for line in lines:
head, subhead, tail = line.split(":")
head_branch = root.find(head)
if not head_branch:
head_branch = ET.SubElement(root, head)
subhead_branch = head_branch.find(subhead)
if not subhead_branch:
subhead_branch = ET.SubElement(branch1, subhead)
subhead_branch.text = tail
tree = ET.ElementTree(root)
ET.dump(tree)
The logic is simple -- you already stated it in your question! You merely need to check whether a branch already exists in the tree before creating it.
Note that this is likely inefficient, since you are searching up to the entire tree for each line. This is because ElementTree is not designed for uniqueness.
If you require speed (which you may not, especially for smallish trees!), a more efficient way would be to use a defaultdict to store the tree structure before converting it to an ElementTree.
import collections
import xml.etree.ElementTree as ET
with open("xmlbasic.txt") as lines_file:
lines = lines_file.read()
root_dict = collections.defaultdict( dict )
for line in lines:
head, subhead, tail = line.split(":")
root_dict[head][subhead] = tail
root = ET.Element('root')
for head, branch in root_dict.items():
head_element = ET.SubElement(root, head)
for subhead, tail in branch.items():
ET.SubElement(head_element,subhead).text = tail
tree = ET.ElementTree(root)
ET.dump(tree)
something along these lines? You keep the level of the branches to be reused in a dict.
b1map = {}
for line in allLines:
tempv = line.split(':')
branch1 = b1map.get(tempv[0])
if branch1 is None:
branch1 = b1map[tempv[0]] = ET.SubElement(root, tempv[0])
branch2 = ET.SubElement(branch1, tempv[1])
branch2.text = tempv[2]