parse xml file when element contains smth. special with python - python

i would like to parse an XML file and write some parts into a csv file. I will do it with python. I am pretty new to programming and XML. I read a lot, but i couldn't found a useful example for my problem.
My XML file looks like this:
<Host name="1.1.1.1">
<Properties>
<tag name="id">1</tag>
<tag name="os">windows</tag>
<tag name="ip">1.11.111.1</tag>
</Properties>
<Report id="123">
<output>
Host is configured to get updates from another server.
Update status:
last detected: 2015-12-02 18:48:28
last downloaded: 2015-11-17 12:34:22
last installed: 2015-11-23 01:05:32
Automatic settings:.....
</output>
</Report>
<Report id="123">
<output>
Host is configured to get updates from another server.
Environment Options:
Automatic settings:.....
</output>
</Report>
</Host>
My XML file contains 500 of this entries! I just want to parse XML blocks where the output contains Update status, because i want to write the 3 dates (last detected, last downloaded and last installed in my CSV file. I would also add the id, os and ip.
I tried it with ElementTree library but i am not able to filter element.text where the output contains Update status. For the moment i am able to extract all text and attributes from the whole file but i am not able to filter blocks where my output contains Update status, last detected, last downloaded or last installed.
Can anyone give some advice how to achieve this?
desired output:
id:1
os:windows
ip:1.11.111.1
last detected: 2015-12-02 18:48:28
last downloaded: 2015-11-17 12:34:22
last installed:2015-11-23 01:05:32
all of this infos written in a .csv file
At the moment my code looks like this:
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("file.xml")
root = tree.getroot()
# open csv file for writing
data = open('test.csv', 'w')
# create csv writer object
csvwriter = csv.writer(data)
# filter xml file
for tag in root.findall(".Host/Properties/tag[#name='ip']"):print(tag.text) # gives all ip's from whole xml
for output in root.iter('output'):print(plugin.text) # gives all outputs from whole xml
data.close()
Best regards

It's relatively straightforward when you start at the <Host> element and work your way down.
Iterate all the nodes, but only output something when the substring "Update status:" occurs in the value of <output>:
for host in tree.iter("Host"):
host_id = host.find('./Properties/tag[#name="id"]')
host_os = host.find('./Properties/tag[#name="os"]')
host_ip = host.find('./Properties/tag[#name="ip"]')
for output in host.iter("output"):
if output.text is not None and "Update status:" in output.text:
print("id:" + host_id.text)
print("os:" + host_os.text)
print("ip:" + host_ip.text)
for line in output.text.splitlines():
if ("last detected:" in line or
"last downloaded" in line or
"last installed" in line):
print(line.strip())
outputs this for your sample XML:
id:1
os:windows
ip:1.11.111.1
last detected: 2015-12-02 18:48:28
last downloaded: 2015-11-17 12:34:22
last installed: 2015-11-23 01:05:32
Minor point: That's not really CSV, so writing that to a *.csv file as-is wouldn't be very clean.

Related

Goodreads API Error: List Indices must be integers or slices, not str

So, I'm trying to program a Goodreads Information Fetcher App in Python using Goodreads' API. I'm currently working on the first function of the app which will fetch information from the API, the API returns an XML file.
I parsed the XML file and converted it to a JSON file, then I further converted it to a dictionary. but I still can't seem to extract the information from it, I've looked up other posts here, but nothing works.
main.py
def get_author_books(authorId):
url = "https://www.goodreads.com/author/list/{}?format=xml&key={}".format(authorId, key)
r = requests.get(url)
xml_file = r.content
json_file = json.dumps(xmltodict.parse(xml_file))
data = json.loads(json_file)
print("Book Name: " + str(data[0]["GoodreadsResponse"]["author"]["books"]["book"]))
I expect the output to give me the name of the first book in the dictionary.
Here is a sample XML file provided by Goodreads.
I think you lack understanding of how xml works, or at the very least, how the response you're getting is formatted.
The xml file you linked to has the following format:
<GoodreadsResponse>
<Request>...</Request>
<Author>
<id>...</id>
<name>...</name>
<link>...</link>
<books>
<book> [some stuff about the first book] </book>
<book> [some stuff about the second book] </book>
[More books]
</books>
</Author>
</GoodreadsResponse>
This means that in your data object, data["GoodreadsResponse"]["author"]["books"]["book"] is a collection of all the books in the response (all the elements surrounded by the <book> tags). So:
data["GoodreadsResponse"]["author"]["books"]["book"][0] is the first book.
data["GoodreadsResponse"]["author"]["books"]["book"][1] is the second book, and so on.
Looking back at the xml, each book element has an id, isbn, title, description, among other tags. So you can print the title of the first book by printing:
data["GoodreadsResponse"]["author"]["books"]["book"][0]["title"]
For reference, I'm running the following code using the xml file you linked to, you'd normally fetch this from the API:
import json
import xmltodict
f = open("source.xml", "r") # xml file in OP
xml_file = f.read()
json_file = json.dumps(xmltodict.parse(xml_file))
data = json.loads(json_file)
books = data["GoodreadsResponse"]["author"]["books"]["book"]
print(books[0]["title"]) # The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary

How to replace xml lines using 'if statements' in python?

Hi I'm new to xml files in general, but I am trying to replace specific lines in a xml file using 'if statements' in python 3.6. I've been looking at suggestions to use ElementTree, but none of the posts online quite fit the problem I have, so here I am.
My file is as followed:
<?xml version="1.0" encoding="UTF-8"?>
-<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
I want to replace
url value="http://example.org/fhir/StructureDefinition/MyObservation"/
to something like
url value="http://example.org/fhir/StructureDefinition/NewObservation"/
by using conditional statements - because these are repeated multiple times in other files.
I have tried for-looping through the xml find to find the exact string match (which I've succeeded), but I wasn't able to delete, or replace the line (probably having to do with the fact that this isn't a .txt file).
Any help is greatly appreciated!
Your sample file contains a "-"-token in ln 3 that may be overlooked when copy/pasting in order to find a solution.
Input File
<?xml version="1.0" encoding="UTF-8"?>
<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
Script
from xml.dom.minidom import parse # use minidom for this task
dom = parse('june.xml') #read in your file
search = "http://example.org/fhir/StructureDefinition/MyObservation" #set search value
replace = "http://example.org/fhir/StructureDefinition/NewObservation" #set replace value
res = dom.getElementsByTagName('url') #iterate over url tags
for element in res:
if element.getAttribute('value') == search: #in case of match
element.setAttribute('value', replace) #replace
with open('june_updated.xml', 'w') as f:
f.write(dom.toxml()) #update the dom, save as new xml file
Output file
<?xml version="1.0" ?><StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/NewObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>

ParseError while parsing AndroidManifest.xml in python

I'm trying to parse an AndroiManifest.xml file to get informations and I have this error when I'm charging my file
xml.etree.ElementTree.ParseError: not well-formed (invalid token):
line 1, column 0
Here is my code :
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='AndroidManifest.xml')
root = tree.getroot()
My XML file seems well formed :
<?xml version="1.0" encoding="utf-8"?>
<manifest
xmlns:android="http://schemas.android.com/apk/res/android"
android:versionCode="132074037"
android:versionName="193.0.0.21.98"
android:installLocation="0"
package="com.facebook.orca">
How can I fix that and parse my XML to get a 'android:versionName' tag ?
Solved
I was trying to parse an AndroidManifest.xml after I've unzipped an apk but with this method, the AndroidManifest.xml is encoded so it's impossible to open, read or parse it. I was able to read it only by using Android Studio that automatically decodes an AndroidManifest file.
To parse an AndroidManifest.xml after unzipping an apk, the best way is to use aapt command line :
/Users/{Path_to_your_sdk}/sdk/build-tools/28.0.3/aapt dump
badging com.squareup.cash.apk | sed -n
"s/.*versionName='\([^']*\).*/\1/p"
And you will obtain the versionName of your app. Hope it will help.

How to remove all " \n" in xml payload by using lxml library

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.
It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

Parsing Weather XML with Python

I'm a beginner but with a lot of effort I'm trying to parse some data about the weather from an .xml file called "weather.xml" which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Weather>
<locality name="Rome" alt="21">
<situation temperature="18°C" temperatureF="64,4°F" humidity="77%" pression="1016 mb" wind="5 SSW km/h" windKN="2,9 SSW kn">
<description>clear sky</description>
<lastUpdate>17:45</lastUpdate>
/>
</situation>
<sun sunrise="6:57" sunset="18:36" />
</locality>
I parsed some data from this XML and this is how my Python code looks now:
#!/usr/bin/python
from xml.dom import minidom
xmldoc = minidom.parse('weather.xml')
entry_situation = xmldoc.getElementsByTagName('situation')
entry_locality = xmldoc.getElementsByTagName('locality')
print entry_locality[0].attributes['name'].value
print "Temperature: "+entry_situation[0].attributes['temperature'].value
print "Humidity: "+entry_situation[0].attributes['humidity'].value
print "Pression: "+entry_situation[0].attributes['pression'].value
It's working fine but if I try to parse data from "description" or "lastUpdate" node with the same method, I get an error, so this way must be wrong for those nodes which actually I can see they are differents.
I'm also trying to write output into a log file with no success, the most I get is an empty file.
Thank you for your time reading this.
It is because "description" and "lastUpdate" are not attributes but child nodes of the "situation" node.
Try:
d = entry_situation[0].getElementsByTagName("description")[0]
print "Description: %s" % d.firstChild.nodeValue
You should use the same method to access the "situation" node from its parent "locality".
By the way you should take a look at the lxml module, especially the objectify API as yegorich said. It is easier to use.

Categories