Convert XML to python dictionary - python

I am sending a request to an API:
import requests as rq
resp_tf = rq.get("https://api.t....")
tf_text = resp_tf.text
Which prints:
<?xml version="1.0" encoding="UTF-8"?>
<flowSegmentData version="traffic-service 4.0.011">
<frc>FRC0</frc>
<currentSpeed>78</currentSpeed>
<freeFlowSpeed>78</freeFlowSpeed>
<currentTravelTime>19</currentTravelTime>
<freeFlowTravelTime>19</freeFlowTravelTime>
<confidence>0.980000</confidence>
<roadClosure>false</roadClosure>
<coordinates>
<coordinate>
.....
Now how can I get the values of the tags for example "currentSpeed"

This can be done using the BeautifulSoup module.
The code is self-explanatory:
Search for the tag by its name using the find_all() method.
Create a dictionary where the key is the name of the tag found, and the value is the text of the tag.
from bs4 import BeautifulSoup
xml = """<?xml version="1.0" encoding="UTF-8"?>
<flowSegmentData version="traffic-service 4.0.011">
<frc>FRC0</frc>
<currentSpeed>78</currentSpeed>
<freeFlowSpeed>78</freeFlowSpeed>
<currentTravelTime>19</currentTravelTime>
<freeFlowTravelTime>19</freeFlowTravelTime>
<confidence>0.980000</confidence>
<roadClosure>false</roadClosure>
<coordinates>
<coordinate>"""
soup = BeautifulSoup(xml, "html.parser")
print({tag.name: tag.text for tag in soup.find_all("currentspeed")})
Output:
{'currentspeed': '78'}

Related

Working with XML data from Python API request

I'm trying to retrieve data from an API, however it appears to be returning in XML format.
response = requests.get('https string')
print(response.text)
Output:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><RegisterSearch TotalResultsOnPage="500" TotalResults="15167" TotalPages="31" PageSize="500" CurrentPage="1"><SearchResults><Document DocumentId="1348828088640186163"/><Document DocumentId="1348828088751561003"/></SearchResults></RegisterSearch>
I've tried using ElementTree as suggested by other answers, but receive a file not found error. I think I'm missing something.
import xml.etree.ElementTree as ET
tree = ET.parse(response.text)
root = tree.getroot()
EDIT:
If you want to use ElementTree You need to parse from STRING
root = ET.fromstring(response.text)
You can parse it with Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'xml')
Then depending on what you want to find or extract you can use find
soup.find('DocumentId').text

In Python, using beautiful soup- how do I get the text of a XML tag that contains a hyphen

I'm trying to get the text content on tag 'Event-id' in the XML, but hyphen is not recognizing as an element on the file, I know script is working well because if a replace the hyphen for a underscore in the XML and run the script it works, anybody knows which could be the problem?
<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate>
from bs4 import BeautifulSoup
dir_path = '20211006085201.xml'
file = open(dir_path, encoding='UTF-8')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
events = soup.find_all('fullEventUpdate')
print(' \n-------', len(events), 'events calculated on ', dir_path, '--------\n')
idi = soup.find_all('event-reference')
for x in range(0, len(events)):
idText = (idi[x].event-id.get_text())
print(idText)
The problem is you are dealing with namespaced xml, and for that type of document, you should use css selectors instead:
events = soup.select('fullEventUpdate')
for event in events:
print(event.select_one('event-id').text)
Output:
24425412
24342548
More generally, in dealing with xml documents, you are probably better off using something which supports xpath (like lxml or ElementTree).
For XML parsing idiomatic approach is to use xpath selectors.
In python this can be easily achieved with parsel package which is similar to beautifulsoup but built on top of lxml for full xpath support:
body = ...
from parsel import Selector
selector = Selector(body)
for event in sel.xpath("//event-reference"):
print(event.xpath('event-id/text()').get())
results in:
24425412
24342548
Without any external lib (Just ElementTree)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate> '''
root = ET.fromstring(xml)
ids = [e.text for e in root.findall('.//event-id')]
print(ids)
output
['24425412', '24342548']

Parse xml in Python ( One correct way to do so) using xlml

I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems

BeautifulSoup does not parse content of the tag like first-name that contains '-'

Hi i have a response like below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<person>
<first-name>hede</first-name>
<last-name>hodo</last-name>
<headline>Python Developer at hede</headline>
<site-standard-profile-request>
<url>http://www.linkedin.com/profile/view?id=hede&authType=godasd*</url>
</site-standard-profile-request>
</person>
And I want to parse the content returned from linkedin api.
I am using beautifulsoup like below
ipdb> hede = BeautifulSoup(response.content)
ipdb> hede.person.headline
<headline>Python Developer at hede</headline>
But when i do
ipdb> hede.person.first-name
*** NameError: name 'name' is not defined
Any ideas ?
Python attribute names can not contain a hypen.
Instead use
hede.person.findChild('first-name')
Also, to parse XML with BeautifulSoup, use
hede = bs.BeautifulSoup(content, 'xml')
or if you have lxml installed,
hede = bs.BeautifulSoup(content, 'lxml')

BeautifulSoup XML Only printing first line

I'm using BeautifulSoup4 (And lxml) to parse an XML file, for some reason when I print soup.prettify() it only prints the first line:
from bs4 import BeautifulSoup
f = open('xmlDoc.xml', "r")
soup = BeautifulSoup(f, 'xml')
print soup.prettify()
#>>> <?xml version="1.0" encoding="utf-8"?>
Any idea why it's not grabbing everything?
UPDATE:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Data Junction generated file.
Macro type "1000" is reserved. -->
<djmacros>
<macro name="Test" type="5000" value="TestValue">
<description>test</description>
</macro>
<macro name="AnotherTest" type="0" value="TestValue2"/>
<macro name="TestLocation" type="1000" value="C:\RandomLocation">
<description> </description>
</macro>
<djmacros>
The file position is at EOF:
>>> soup = BeautifulSoup("", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'
Or the content is not valid xml:
>>> soup = BeautifulSoup("no <root/> element", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'
As per J.F.Sebastion's answer, the XML is invalid.
Your final tag is incorrect:
<djmacros>
The correct tag is:
</djmacros>
You can confirm this with an XML validator. Eg http://www.w3schools.com/xml/xml_validator.asp
I had the same problem with a valid XML file.
The problem was that the XML file is encoded in UTF-8 with BOM.
I discovered that by printing the raw content:
content = open(path, "r").read()
print(content)
And I got (see this thread: What's  sign at the beginning of my source file?):
<?xml version="1.0" encoding="utf-8"?>
If the encoding is UTF-8-BOM instead of UTF-8 it may have problems even if the XML is otherwise valid.

Categories