How to extract text from xml error message python - python

I want to determine if the return from a beautifulsoup request looks like this.
Out[32]:
<?xml version="1.0" encoding="utf-8"?>
<boardgames termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
<boardgame>
<error message="Item not found"/>
</boardgame>
</boardgames>
I can extract the center of the previous output using:
soup.find_all('boardgame')[0], which produces the following:
Out[24]:
<boardgame>
<error message="Item not found"/>
</boardgame>
I feel like this should be so easy, and I've tried the following, but I still can't determine if the "error message="Item not found" is in there. What am I missing here?
soup.findAll('boardgame')[0].getText()
Out[26]: '\n\n'

Use the attribute message to get the value.If you to find the error tag first and then use the attribute message
from bs4 import BeautifulSoup
data='''<?xml version="1.0" encoding="utf-8"?>
<boardgames termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
<boardgame>
<error message="Item not found"/>
</boardgame>
</boardgames>'''
soup=BeautifulSoup(data,'html.parser')
message=soup.find('boardgame').find('error')['message']
print(message)
Output:
Item not found
Or you can use css selector
from bs4 import BeautifulSoup
data='''<?xml version="1.0" encoding="utf-8"?>
<boardgames termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
<boardgame>
<error message="Item not found"/>
</boardgame>
</boardgames>'''
soup=BeautifulSoup(data,'html.parser')
message=soup.select_one('boardgame error')['message']
print(message)
Output:
Item not found

Related

Convert XML to python dictionary

I am sending a request to an API:
import requests as rq
resp_tf = rq.get("https://api.t....")
tf_text = resp_tf.text
Which prints:
<?xml version="1.0" encoding="UTF-8"?>
<flowSegmentData version="traffic-service 4.0.011">
<frc>FRC0</frc>
<currentSpeed>78</currentSpeed>
<freeFlowSpeed>78</freeFlowSpeed>
<currentTravelTime>19</currentTravelTime>
<freeFlowTravelTime>19</freeFlowTravelTime>
<confidence>0.980000</confidence>
<roadClosure>false</roadClosure>
<coordinates>
<coordinate>
.....
Now how can I get the values of the tags for example "currentSpeed"
This can be done using the BeautifulSoup module.
The code is self-explanatory:
Search for the tag by its name using the find_all() method.
Create a dictionary where the key is the name of the tag found, and the value is the text of the tag.
from bs4 import BeautifulSoup
xml = """<?xml version="1.0" encoding="UTF-8"?>
<flowSegmentData version="traffic-service 4.0.011">
<frc>FRC0</frc>
<currentSpeed>78</currentSpeed>
<freeFlowSpeed>78</freeFlowSpeed>
<currentTravelTime>19</currentTravelTime>
<freeFlowTravelTime>19</freeFlowTravelTime>
<confidence>0.980000</confidence>
<roadClosure>false</roadClosure>
<coordinates>
<coordinate>"""
soup = BeautifulSoup(xml, "html.parser")
print({tag.name: tag.text for tag in soup.find_all("currentspeed")})
Output:
{'currentspeed': '78'}

Remove unwanted tags from XML file

I working on a XML file that contains soap tags in it. I want to remove those soap tags as part of XML cleanup process.
How can I achieve it in either Python or Scala. Should not use shell script.
Sample Input :
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://sample.com/">
<soap:Body>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
</soap:Body>
</soap:Envelope>
Expected Output :
<?xml version="1.0" encoding="UTF-8"?>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
This could help you!
from lxml import etree
doc = etree.parse('test.xml')
for ele in doc.xpath('//soap'):
parent = ele.getparent()
parent.remove(ele)
print(etree.tostring(doc))

Paser XML in python

I am getting this xml response, can anybody help me in getting the token from the xml tags?
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/"><s:Body><LoginResponse xmlns="http://videoos.net/2/XProtectCSServerCommand"><LoginResult xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><RegistrationTime>2018-09-06T07:30:38.4571763Z</RegistrationTime><TimeToLive><MicroSeconds>3600000000</MicroSeconds></TimeToLive><TimeToLiveLimited>false</TimeToLiveLimited><Token>TOKEN#xxxxx#</Token></LoginResult></LoginResponse></s:Body></s:Envelope>
I have it as a string
Tried lxml and other libs too like ET but wasn't able to extract the token field. HELPPP
Update with a format xml to make you easy to read, FYI.
<?xml version="1.0" encoding="utf-8"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<LoginResponse xmlns="http://videoos.net/2/XProtectCSServerCommand">
<LoginResult xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<RegistrationTime>2018-09-06T07:30:38.4571763Z</RegistrationTime>
<TimeToLive>
<MicroSeconds>3600000000</MicroSeconds>
</TimeToLive>
<TimeToLiveLimited>false</TimeToLiveLimited>
<Token>TOKEN#xxxxx#</Token>
</LoginResult>
</LoginResponse>
</s:Body>
</s:Envelope>
text = """
<?xml version="1.0" encoding="utf-8"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<LoginResponse xmlns="http://videoos.net/2/XProtectCSServerCommand">
<LoginResult xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<RegistrationTime>2018-09-06T07:30:38.4571763Z</RegistrationTime>
<TimeToLive>
<MicroSeconds>3600000000</MicroSeconds>
</TimeToLive>
<TimeToLiveLimited>false</TimeToLiveLimited>
<Token>TOKEN#xxxxx#</Token>
</LoginResult>
</LoginResponse>
</s:Body>
</s:Envelope>
"""
from bs4 import BeautifulSoup
parser = BeautifulSoup(text,'xml')
for item in parser.find_all('Token'):
print(item.text)
Using lxml
Demo:
x = '''<?xml version="1.0" encoding="utf-8"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<LoginResponse xmlns="http://videoos.net/2/XProtectCSServerCommand">
<LoginResult xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<RegistrationTime>2018-09-06T07:30:38.4571763Z</RegistrationTime>
<TimeToLive>
<MicroSeconds>3600000000</MicroSeconds>
</TimeToLive>
<TimeToLiveLimited>false</TimeToLiveLimited>
<Token>TOKEN#xxxxx#</Token>
</LoginResult>
</LoginResponse>
</s:Body>
</s:Envelope>'''
from lxml import etree
xmltree = etree.fromstring(x)
namespaces = {'content': "http://videoos.net/2/XProtectCSServerCommand"}
items = xmltree.xpath('//content:Token/text()', namespaces=namespaces)
print(items)
Output:
['TOKEN#xxxxx#']

BeautifulSoup does not parse content of the tag like first-name that contains '-'

Hi i have a response like below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<person>
<first-name>hede</first-name>
<last-name>hodo</last-name>
<headline>Python Developer at hede</headline>
<site-standard-profile-request>
<url>http://www.linkedin.com/profile/view?id=hede&authType=godasd*</url>
</site-standard-profile-request>
</person>
And I want to parse the content returned from linkedin api.
I am using beautifulsoup like below
ipdb> hede = BeautifulSoup(response.content)
ipdb> hede.person.headline
<headline>Python Developer at hede</headline>
But when i do
ipdb> hede.person.first-name
*** NameError: name 'name' is not defined
Any ideas ?
Python attribute names can not contain a hypen.
Instead use
hede.person.findChild('first-name')
Also, to parse XML with BeautifulSoup, use
hede = bs.BeautifulSoup(content, 'xml')
or if you have lxml installed,
hede = bs.BeautifulSoup(content, 'lxml')

Python and XML Processing

I have used urllib to get the following data:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<videos xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:www="http://www.www.com"">
<video type="cl">
<cd>
<src lang="music">http://www.google.com/ </src>
</cd>
</video>
</videos>
I want to get http://www.google.com/ out, here is my code:
import xml.etree.ElementTree as etree
data='<?xml version="1.0" encoding="UTF-8" standalone="yes"?><videos xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:www="http://www.www.com""><video type="cl"><cd><src lang="music">http://www.google.com/ </src></cd></video></videos>'
tree = etree.fromstring(data)
geturl=tree.findtext('/video/cd/src').strip()
print geturl
I get error:
AttributeError: 'NoneType' object has no attribute 'strip'
Obviously, the findtext failed. I tried findtext('src'), also wont work.
Whats wrong?
Remove the first forward-slash from the path: video/cd/src:
import xml.etree.ElementTree as etree
data='''<?xml version="1.0" encoding="UTF-8" standalone="yes"?><videos xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:www="http://www.www.com"><video type="cl"><cd><src lang="music">http://www.google.com/ </src></cd></video></videos>'''
tree = etree.fromstring(data)
geturl=tree.findtext('video/cd/src').strip()
print geturl
yields
http://www.google.com/
The forward-slash indicates an absolute path, which is not allowed on elements.
PS. There is also a syntax error in the data you posted: xmlns:www="http://www.www.com"" has two double-quotes at the end...

Categories