Python xmlutils, formatting xml to csv conversion - python

I am converting a generated xml file to a csv file using xmlutils. However, the nodes that I tagged in the xml file sometimes have an extra child node which messes up the formatting of the converted csv file.
For instance,
<issue>
<name>project1</name>
<key>733</key>
</issue>
<issue>
<name>project2</name>
<key>123</key>
<debt>233</debt>
</issue>
I tagged "issue" and the xml file was converted to a csv. However, once I opened the csv file, the formatting is wrong. Since there was an extra "debt" node in the second issue element, the columns for the second were shifted.
For instance,
name key
project1 733
project2 123 233
How can I tell xmlutil to generate a new column "debt"?
Also, if xmlutils cannot do the job, can you recommend me a program that is better suited?

Related

Parsing inconsistent XML file in Python to upload it to DB with Django

I am trying to parse an XML file containing multiple elements of the following structure:
<root attribute_xmlns=date_as_str>
<element1 attrib1="str" attrib2="str" attrib3="str" attrib4="str">
<element2>
<element21>Some text</element21>
<element22>Some other text</element22>
</element2>
<element3>
<element31 attrib1="str" attrib2="str" attrib3="str"/>
<element32 attrib1="str" attrib3="str"/>
<element33 attrib1="str" attrib2="str" attrib3="str" attrib4="str"/>
</element3>
</element1>
</root>
Some information I want to retrieve is held within attributes' texts of elements and some in the element text.
I was trying to save it in CSV with xml.etree.ElementTree by iterating over each level of the file and getting consecutive attributes' and elements text. The problem however is that the file is internally inconsistent. There may be missing attributes in some of the elements or elements have no value. In such cases - my final CSV file is just a big mess.
How can I get that done? My plan is to upload this data into SQL DB in Django app.

Best format for saving list of JSON strings

I'm trying to run an ETL over a lot of XML files that exist in my datalake. The first step is to translate those XML files into JSON files because it way easier to load JSON files into databases rather than XML strings.
I'm trying to understand what format is better :
Format 1
[{'key':val}, {'key':val}, {'key':val}, {'key':val}]
Format 2:
{'key':val}, {'key':val}, {'key':val}, {'key':val}
Format 3 (as one column CSV):
{'key':val}
{'key':val}
{'key':val}
The advantage of file 1 is that I'm able to load that file back with json.load which I can't do to the second example (I'll get json.decoder.JSONDecodeError: Extra data: ) .
The advantage of file2 is that I can load that file as it is to many databases.
Another option is saving all 3 versions of the file but feels like a waste of space if there is any other good solution.

How to update data to CSV file by specific field using Robot Framework

Now I use keyword
Append Data
${list}= Create List Test1 Test2
${data}= create list ${list}
Append To Csv File ${File_Path} ${list}
but it cannot specific the data's position that I want to update, In my test case I have to update new data everytimes after finished case to use new data in next case. (I kept the test data is in CSV file)
Looks like you are already making use of CSVLibrary
in this library you have only the following KWS, what we can notice from here is that, we do not have replace CSV line/file anything, hence, we need to come up with our own procedure.
Append To Csv File
Csv File From Associative
Empty Csv File
Read Csv File To Associative
Read Csv File To List
APPROACH#1
In my test case I have to update new data everytimes after finished
case to use new data in next case.
One of the ways which can be employed to solve your problem, is by converting all of the csv file data into list of dicts.
Read the cvs into list of dicts using Read Csv File To
Associative
make a copy of the original list of dicts
Start of Testcase#1
make the modification to the list of dicts, just in case you would like to go back in time for a quick referral
End of Testcase#1
Start of Testcase#2
make and use the modified content of list of dists from Testcase#1
End of Testcase#2
So on for the rest of the test cases.
Here no need to use CSV library.
If we want to create new csv file with new data always then we can use Create File keyword from OperatingSystem library
Create File filename.csv content=content_added_in_csvFile
e.g. Create File ${CURDIR}/Demo.csv content=675432561
If we want to add multiple data in CSV then
Create File ${CURDIR}/Demo.csv content=68868686,85757464,5757474
Here when we will run this code then old file will be replace by new file with provided content .
Hope It will resolve this issue

pdfminer odd result with a two columns pdf

I'm converting somes pdf documents to text and for it I'm using pdfminer (using pdf2txt.py). I'm not converting directly from pdf to txt, because I want to signal formats such as italics or bold. Therefore I first convert the pdf to xml.
I'm converting pdf to xml using:
pdf2txt.py -t xml -o out_file.xml in_file.pdf
My problem is that I found an odd error in the xml file when converting this pdf. If you convert it to xml, check the following:
In page 21 of the pdf the second column starts with "Recentemente...".
The first paragraph of the first column (of the same page) ends with "...lhes falta".
The resulting converting xml file contains the item 1. (and full column) just after item 2. You can check it in line 128370 of the xml file. Then in line 131782 the correct order starts again, i.e., the paragraph that starts with "O terceiro..." follows.
So, my question is if there is a solution to avoid this error.

python excel processing error

I am working on the excel processing using python.
I am using xlrd module (version 0.6.1) for the same.
I am abe to fetch most of the excel files but for some excel files it gives me error as :
XLRDError: Expected BOF record; found 0x213c
Can anyone let me know about how to solve this issue?
thanks in advance.
What you have is most probably an "XML Spreadsheet 2003 (*.xml)" file ... "<!" aka "\x3c\x21" (which is what XML streams start with) is being interpreted as the little-endian number 0x213c.
Notepad: First two lines:
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
You can also check this by opening the file with Excel and then click on Save As and look at the file-type that is displayed. While you are there, save it as an XLS file so that your xlrd can read it.
Note: this XML file is NOT the Excel 2007+ XLSX file. An XLSX is actually a ZIP file (starts with "PK", not "<?") containing a bunch of XML streams.

Categories