Beautiful Soup parsing an XML file - python

I am writing a simple Python using Beautiful Soup to parse the data I need out of an xml file. It's working how I need it to, but I have one ask of you guys as I have tried to Google this but can't seem to find what I am looking for.
Sample of XML string:
<ProductAttribute MaintenanceType="C" AttributeID="Attachment Type" PADBAttribute="N" RecordNumber="1" LanguageCode="EN">Clamp-On</ProductAttribute>
I am needing the AttributeID within the ProductAttribute. When I write, the below I am able to grab the value "Clamp-On" but I need AttributeID to tell me what Clamp-On is referencing.
attributes[part.find('PartNumber').get_text()] = [x.get_text() for x in part.find_all('ProductAttribute')]
for key, value in attributes.items():
for v in value:
print(v)
Any guidance is appreciated before negative feedback. Thanks!

Simple solution using only lxml library:
from lxml import etree
xml_string = """<ProductAttribute MaintenanceType="C" AttributeID="Attachment Type" PADBAttribute="N" RecordNumber="1" LanguageCode="EN">Clamp-On</ProductAttribute>"""
xml = etree.XML(xml_string)
print(xml.get("AttributeID"))
Output:
Attachment Type

here is how u can get a tag attribute from an xml using BeautifulSoup and lxml,
from bs4 import BeautifulSoup
xml_string = '<ProductAttribute MaintenanceType="C" AttributeID="Attachment Type" PADBAttribute="N" RecordNumber="1" LanguageCode="EN">Clamp-On</ProductAttribute>'
soup = BeautifulSoup(xml_string, 'xml')
tag = soup.ProductAttribute
print(tag['AttributeID'])
this code print the value of the attribute AttributeID

Related

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

How do I access the inner tag with BeautifulSoup?

<wd:Employee_Name wd:Descriptor="John Doe"><wd:ID wd:type="WID">09300cd006150</wd:ID></wd:Employee_Name>
I would like to get John Doe. Even though it seems simple I am struggling therefore posting here.
soup.find_all('Employee_Name')[0].text
gives me the 09300cd006150
Thank you very much for your help
As wd:Descriptor is an attribute, get() method should be used to retrieve it:
xml = '''<root xmlns:wd="http://wd">
<wd:Employee_Name wd:Descriptor="John Doe">
<wd:ID wd:type="WID">09300cd006150</wd:ID>
</wd:Employee_Name></root>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')
name = soup.find_all('Employee_Name')[0]
print(name.get('wd:Descriptor'))

Scraping links in Pattern library for Python

I found code similar to this in a course I was taking. This code gets all of the links of a certain format that are mentioned in the source code of the webpage. I understand everything, except for the last line. The last line says the following:
print link.attrs.get('href', '')
This works, however I'm unsure as to how the instructor figured out how to do this. I've looked through the documentation and I can't figure out what .get does. Could someone please let me know how I can find this information.
Documentation for Pattern Library: http://www.clips.ua.ac.be/pages/pattern-web
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
pattern = 'http://www.realclearpolitics.com/epolls/????/governor/??/*-*.html'
dom = web.Element(xml)
all_links = dom.by_tag('a')
for link in all_links:
print link.attrs.get('href', '')
It would get all the href "hyperlinks" in that page. You can BeautifulSoup package which is more convinient
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html")
soup = BeautifulSoup(xml, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

Python BS4 with SDMX

I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code
import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")
which gives me an empty object.
Is BS4 the wrong tool, or (more likely) what am I doing wrong?
Thanks in advance
soup.findAll("bbk:series") would return the result.
In fact, in this case, even you use lxml as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series") works. See Other parser problems from the official doc.
If you want to parse it as xml, use soup = BeautifulSoup(html_source, 'xml') instead. It also uses lxml since lxml is the only xml parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series") to get the result as beautifulSoup will strip the namespace part bbk.

Python - getting information from nodes

I've been trying to get information from a site, and recently found out that is stored in childNodes[0].data.
I'm pretty new to python and never tried scripting against websites.
Somebody told me I could make a tmp.xml file, and extract the information from there, but as it's only getting the source code(which I think is of no use for me), I don't get any results.
Current code:
response = urllib2.urlopen(get_link)
html = response.read()
with open("tmp.xml", "w") as f:
f.write(html)
dom = parse("tmp.xml")
name = dom.getElementsByTagName("name[0].firstChild.nodeValue")
I've also tried using 'dom = parse(html)' without better result.
getElementsByTagName() takes an element name, not an expression. It is highly unlikely that there are tags in the page you are loading that contain <name[0].firstChild.nodeValue> tags.
If you are loading HTML, use a HTML parser instead, like BeautifulSoup. For XML, using the ElementTree API is a lot easier than using the (archaic and very verbose) DOM API.
Neither approach requires that you first save the source to disk, both APIs can parse directly from the response object returned by urllib2.
# HTML
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen(get_link)
soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))
print soup.find('title').text
or
# XML
import urllib2
from xml.etree import ElementTree as ET
response = urllib2.urlopen(get_link)
tree = ET.parse(response)
print tree.find('elementname').text

Categories