Stripping (XML?) markup from a document using python - python

I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>
i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help

DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])
Use an xml/html parser, take a look at BeautifulSoup.

This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).
Here is an example:
from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""
tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
print scientist.text

As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).
Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).
In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML

Here is an simple example that should handle the xml tags for you
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData
If you find anything unclear just let me know

Related

Extract String/Value of XML Key Without Etree

I need to extract a string from an xml file, but without the use of etree.
Small part of the XML:
<key>FDisplayName</key>
<string>Dripo</string>
<key>CFBundleIdentifier</key>
<string>com.getdripo.dripo</string>
<key>DTXcode</key>
Say I wanted to extract com.getdripo.dripo, how could I do this, but without the use of etree?
I only know how to do it with etree, but in this case I cannot use it.
Couldn't find anything online, any ideas?
Using regex.
import re
s = """<key>FDisplayName</key>
<string>Dripo</string>
<key>CFBundleIdentifier</key>
<string>com.getdripo.dripo</string>
<key>DTXcode</key>"""
print re.findall("<string>(.*?)</string>", s) #finds all content between '<string>' tag
print re.findall("<string>(com.*?)</string>", s)
Output:
['Dripo', 'com.getdripo.dripo']
['com.getdripo.dripo']
Note: Highly suggest to use an XML parser.

Python BeautifulSoup giving different results

I am trying to parse an xml file using BeautifulSoup. Consider a sampleinpt xml file as follows:
<DOC>
<DOCNO>1</DOCNO>
....
</DOC>
<DOC>
<DOCNO>2</DOCNO>
....
</DOC>
...
This file consists for 130 <DOC> tags. However, when I tried to parse it using BeautifulSoup's findAll function, it retrieves a random number of tags (usually between 15 - 25) but never 130. The code I used was as follows:
from bs4 import BeautifulSoup
z = open("filename").read()
soup = BeautifulSoup(z, "lxml")
print len(soup.findAll('doc'))
#more code involving manipulation of results
Can anybody tell me what wrong am I doing? Thanks in advance!
You are telling BeautifulSoup to use the HTML parser provided by lxml. If you have an XML document, you should stick to the XML parser option:
soup = BeautifulSoup(z, 'xml')
otherwise the parser will attempt to 'repair' the XML to fit HTML rules. XML parsing in BeautifulSoup is also handled by the lxml library.
Note that XML is case sensitive so you'll need to search for the DOC element now.
For XML documents it may be that the ElementTree API offered by lxml is more productive; it supports XPath queries for example, while BeautifulSoup does not.
However, from your sample it looks like there is no one top level element; it is as if your document consists of a whole series of XML documents instead. This makes your input invalid, and a parser may just stick to only parsing the first element as the top-level document instead.

Single out tags in an xml document?

I have what I believe to be a fairly simple issue.
I've retrieved a file from gdata, this file: https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments
I'm attempting to single out the tex t between the
"< author >HERE< /author >"
tags so i'll be left with an output containing only usernames. Is python even the best way to go about this or should I use another language? I've been googling since 8:00am (4hrs) and i've yet to find anything for such a seemingly easy task.
Best regards,
- Mitch Powell
You have an atom feed there, so I'd use feedparser to handle that:
import feedparser
result = feedparser.parse('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
for entry in result.entries:
print entry.author
This prints:
FreebieFM
micromicros
FreebieFM
Sarah Grimstone
FreebieFM
# etc.
Feedparser is an external library, but easily installed. If you have to use only the standard library, you could use the ElementTree API, but to parse the Atom feed you need to include HTML entities in the parser, and you'll have to deal with namespaces (not ElementTree's strong point):
from urllib2 import urlopen
from xml.etree import ElementTree
response = urlopen('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
tree = ElementTree.parse(response)
nsmap = {'a': 'http://www.w3.org/2005/Atom'}
for author in tree.findall('.//a:author/a:name', namespaces=nsmap):
print author.text
The nsmap dictionary lets ElementTree translate the a: prefix to the correct namespace for those elements.

python parse xml text [duplicate]

This question already has answers here:
How to parse XML and get instances of a particular node attribute?
(19 answers)
Closed 9 years ago.
I would like to parse xml in python, but as a string, not taken from a file. Can someone help me do this?
From a file, you could normally do it as
from xml.dom import minidom
xmldoc = minidom.parse('~/diveintopython/common/py/kgp/binary.xml')
For a string, you can change it to
from xml.dom import minidom
xmldoc = minidom.parseString( Your string goes here )
You could use: xml.dom.minidom.parseString(text)
This method creates a StringIO object for the string and passes that on to parse().
You could also use the same technique of using StringIO for any other XML parser that expects a file-like object.
import StringIO
your_favourite_xml_parser.parse(StringIO.StringIO('<xml>...</xml>'))
You can use (xml.etree.cElementTree) also.
import xml.etree.cElementTree as ET
aElement = ET.fromstring('<Root id="UUID_1"><Item id="id_Item" /></Root>')
See Python help document
Each element has a number of properties associated with it:
a tag which is a string identifying what kind of data this element represents (the element type, in other words).
a number of attributes, stored in a Python dictionary.
a text string.
an optional tail string.
a number of child elements, stored in a Python sequence
You can also use lxml. My startup (http://dealites.com) involves a lot of XML processing everyday. I have tried almost every xml library available in python. lxml is the best library available for xml processing.
You can also try Beautiful soup. It is great for HTML parsing but a good alternative to lxml.
lxml example:
from lxml import etree;
parsedfeed = etree.xml('your xml here');
Beautiful Soup example:
from BeautifulSoup import BeautifulStoneSoup;
soup = BeautifulStoneSoup('your xml here');

Preventing BeautifulSoup from converting my XML tags to lowercase

I am using BeautifulStoneSoup to parse an XML document and change some attributes. I noticed that it automatically converts all XML tags to lowercase. For example, my source file has <DocData> elements, which BeautifulSoup converts to <docdata>.
This appears to be causing problems since the program I am feeding my modified XML document to does not seem to accept the lowercase versions. Is there a way to prevent this behavior in BeautifulSoup?
No, that's not a built-in option. The source is pretty straightforward, though. It looks like you want to change the value of encodedName in Tag.__str__.
Simple Answer
change (default html.parser) to xml parser
code: soup = BeautifulSoup(yourXmlStr, 'xml')
Detailed Explanation
refer my answer in another post

Categories