I'm trying to retrieve data from an API, however it appears to be returning in XML format.
response = requests.get('https string')
print(response.text)
Output:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><RegisterSearch TotalResultsOnPage="500" TotalResults="15167" TotalPages="31" PageSize="500" CurrentPage="1"><SearchResults><Document DocumentId="1348828088640186163"/><Document DocumentId="1348828088751561003"/></SearchResults></RegisterSearch>
I've tried using ElementTree as suggested by other answers, but receive a file not found error. I think I'm missing something.
import xml.etree.ElementTree as ET
tree = ET.parse(response.text)
root = tree.getroot()
EDIT:
If you want to use ElementTree You need to parse from STRING
root = ET.fromstring(response.text)
You can parse it with Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'xml')
Then depending on what you want to find or extract you can use find
soup.find('DocumentId').text
Related
I am writing a simple Python using Beautiful Soup to parse the data I need out of an xml file. It's working how I need it to, but I have one ask of you guys as I have tried to Google this but can't seem to find what I am looking for.
Sample of XML string:
<ProductAttribute MaintenanceType="C" AttributeID="Attachment Type" PADBAttribute="N" RecordNumber="1" LanguageCode="EN">Clamp-On</ProductAttribute>
I am needing the AttributeID within the ProductAttribute. When I write, the below I am able to grab the value "Clamp-On" but I need AttributeID to tell me what Clamp-On is referencing.
attributes[part.find('PartNumber').get_text()] = [x.get_text() for x in part.find_all('ProductAttribute')]
for key, value in attributes.items():
for v in value:
print(v)
Any guidance is appreciated before negative feedback. Thanks!
Simple solution using only lxml library:
from lxml import etree
xml_string = """<ProductAttribute MaintenanceType="C" AttributeID="Attachment Type" PADBAttribute="N" RecordNumber="1" LanguageCode="EN">Clamp-On</ProductAttribute>"""
xml = etree.XML(xml_string)
print(xml.get("AttributeID"))
Output:
Attachment Type
here is how u can get a tag attribute from an xml using BeautifulSoup and lxml,
from bs4 import BeautifulSoup
xml_string = '<ProductAttribute MaintenanceType="C" AttributeID="Attachment Type" PADBAttribute="N" RecordNumber="1" LanguageCode="EN">Clamp-On</ProductAttribute>'
soup = BeautifulSoup(xml_string, 'xml')
tag = soup.ProductAttribute
print(tag['AttributeID'])
this code print the value of the attribute AttributeID
While parsing an SVG file, I noticed that beautifulsoup adds html tags to it.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<svg></svg>', 'lxml')
print(soup)
results in:
<html><body><svg></svg></body></html>
Why is this so and can this be avoided?
You use lxml parser, which is HTML parser. To parse XML you should use xml parser:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<svg></svg>', 'xml')
print(soup) # ^^^^^^
From BeautifulSoup documentation:
Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers. Here’s a short
document, parsed as HTML:
BeautifulSoup("<a><b /></a>")
# <html><head></head><body><a><b></b></a></body></html>
Since an empty tag is not valid HTML, the parser turns it into a
tag pair.
Here’s the same document parsed as XML (running this requires that you
have lxml installed). Note that the empty tag is left alone, and
that the document is given an XML declaration instead of being put
into an tag.:
BeautifulSoup("<a><b /></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>
Source: Differences between parsers, emphasis mine.
I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems
My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.
#! /usr/bin/python
import urllib.request
import urllib.error
from io import StringIO
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly
With that working, I tried to feed it a page via urllib:
# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes
Trying to deal with the error message, I tried:
# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file
I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:
attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer
This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.
python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.
BeautifulSoup can use the lxml parser directly, no need to go to these lengths.
BeautifulSoup(doc, 'lxml')
I'm trying to learn Python. My only experience is Applescripting and it's not so easy to learn.. so far anyway.
I'm trying to parse an xml weather site and so far I have the data I need but I can't figure out how to get it into a list to process it further. Can anyone help?
from BeautifulSoup import BeautifulSoup
import xml.etree.cElementTree as ET
from xml.etree.cElementTree import parse
import urllib2
url = "http://www.weatheroffice.gc.ca/rss/city/ab-52_e.xml"
response = urllib2.urlopen(url)
local_file = open("\Temp\weather.xml", "w")
local_file.write(response.read())
local_file.close()
invalid_tags = ['b', 'br']
tree = parse("\Temp\weather.xml")
stuff = tree.findall("channel/item/description")
item = stuff[1]
parsewx = BeautifulSoup(stuff[1].text)
for tag in invalid_tags:
for match in parsewx.findAll(tag):
match.replaceWithChildren()
print parsewx
Since XML is structured data, BeautifulSoup returns a tree of Tags.
The documentation has extensive information on how to search and navigate in that tree.