Parsing XML in Python with regex

Parsing XML in Python with regex - python

I'm trying to use regex to parse an XML file (in my case this seems the simplest way).
For example a line might be:
line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
To access the text for the tag City_State, I'm using:
attr = re.match('>.*<', line)
but nothing is being returned.
Can someone point out what I'm doing wrong?

You normally don't want to use re.match. Quoting from the docs:
If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).
Note:
>>> print re.match('>.*<', line)
None
>>> print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>> print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<
Also, why parse XML with regex when you can use something like BeautifulSoup :).
>>> from bs4 import BeautifulSoup as BS
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> soup = BS(line)
>>> print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906

Please, just use an XML parser like ElementTree
>>> from xml.etree import ElementTree as ET
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.
And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

Related

A regular expression in BeautifulSoup 4

I need to find element with 'random' id in html.
My code is look like:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html)
print soup.find(id="id_123456_name")
123456 - may changes every time, so I find this, but I cant understand, how use it.
I try:
soup.find(id="id_%s_name" % (re.compile("\d+")) )
But nothing find. Whats the problem?

You need to make the whole value a regular expression object:
soup.find(id=re.compile("id_\d+_name"))
In your version, you are still looking for a literal string, not a regular expression, because you converted the regular expression object into a string instead. The literal string has a very strange value:
>>> import re
>>> "id_%s_name" % (re.compile("\d+"))
'id_<_sre.SRE_Pattern object at 0x10f111750>_name'
This value of course is never found in your HTML document.

Extract string using regex

How can I extract the content (how are you) from the string:
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>.
Can I use regex for the purpose? if possible whats suitable regex for it.
Note: I dont want to use split function for extract the result. Also can you suggest some links to learn regex for a beginner.
I am using python2.7.2

You could use a regular expression for this (as Joey demonstrates).
However if your XML document is any bigger than this one-liner you could not since XML is not a regular language.
Use BeautifulSoup (or another XML parser) instead:
>>> from BeautifulSoup import BeautifulSoup
>>> xml_as_str = '<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. '
>>> soup = BeautifulSoup(xml_as_str)
>>> print soup.text
how are you.
Or...
>>> for string_tag in soup.findAll('string'):
... print string_tag.text
...
how are you

Try with following regex:
/<[^>]*>(.*?)</

(?<=<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">)[^<]+(?=</string>)
would match what you want, as a trivial example.
(?<=<)[^<]+
would, too. It all depends a bit on how your input is formatted exactly.

This will match a generic HTML tag (Replace "string" with the tag you want to match):
/<string[^<]*>(.*?)<\/string>/i
(i=case insensitive)

Match "without this"

I need to remove all <p></p> that are only <p>'s in <td>.
But how it can be done?
import re
text = """
<td><p>111</p></td>
<td><p>111</p><p>222</p></td>
"""
text = re.sub(r'<td><p>(??no</p>inside??)</p></td>', r'<td>\1</td>', text)
How can I match without</p>inside?

I would use minidom. I stole the following snippet from here which you should be able to modify and work for you:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
Thanks #Ivo Bosticky

While using regexps with HTML is bad, matching a string that does not contain a given pattern is an interesting question in itself.
Let's assume that we want to match a string beginning with an a and ending with a z and take out whatever is in between only when string bar is not found inside.
Here's my take: "a((?:(?<!ba)r|[^r])+)z"
It basically says: find a, then find either an r which is not preceded by ba, or something different than r (repeat at least once), then find a z. So, a bar cannot sneak in into the catch group.
Note that this approach uses a 'negative lookbehind' pattern and only works with lookbehind patterns of fixed length (like ba).

I would definitely recommend using BeautifulSoup for this. It's a python HTML/XML parser.
http://www.crummy.com/software/BeautifulSoup/

Not quite sure why you want to remove the P tags which don't have closing tags.
However, if this is an attempt to clean code, an advantage of BeautifulSoup is that is can clean HTML for you:
from BeautifulSoup import BeautifulSoup
html = """
<td><p>111</td>
<td><p>111<p>222</p></td>
"""
soup = BeautifulSoup(html)
print soup.prettify()
this doesn't get rid of your unmatched tags, but it fixes the missing ones.

How to regex in python?

I am trying to parse the keywords from google suggest, this is the url:
http://google.com/complete/search?output=toolbar&q=test
I've done it with php using:
'|<CompleteSuggestion><suggestion data="(.*?)"/><num_queries int="(.*?)"/></CompleteSuggestion>|is'
But that wont work with python re.match(pattern, string), I tried a few but some show error and some return None.
How can I parse that info? I dont want to use minidom because I think regex will be less code.

You could use etree:
>>> from xml.etree.ElementTree import XMLParser
>>> x = XMLParser()
>>> x.feed('<toplevel><CompleteSuggestion><suggestion data=...')
>>> tree = x.close()
>>> [(e.find('suggestion').get('data'), int(e.find('num_queries').get('int')))
for e in tree.findall('CompleteSuggestion')]
[('test internet speed', 31800000), ('test', 686000000), ...]
It is more code than a regex, but it also does more. Specifically, it will fetch the entire list of matches in one go, and unescape any weird stuff like double-quotes in the data attribute. It also won't get confused if additional elements start appearing in the XML.

RegEx match open tags except XHTML self-contained tags
This is an XML document. Please, reconsider an XML parser. It will be more robust and probably take you less time in the end, even if it is more code.

Extracting some HTML tag values in Python

How to get a value of nested <b> HTML tag in Python using regular expressions?
<b>LG</b> X110
# => LG X110

You don't.
Regular Expressions are not well suited to deal with the nested structure of HTML. Use an HTML parser instead.

Don't use regular expressions for parsing HTML. Use an HTML parser like BeautifulSoup. Just look how easy it is:
from BeautifulSoup import BeautifulSoup
html = r'<b>LG</b> X110'
soup = BeautifulSoup(html)
print ''.join(soup.findAll(text=True))
# LG X110

Your question was very hard to understand, but from the given output example it looks like you want to strip everything within < and > from the input text. That can be done like so:
import re
input_text = '<a bob>i <b>c</b></a>'
output_text = re.sub('<[^>]*>', '', input_text)
print output_text
Which gives you:
i c
If that is not what you want, please clarify.
Please note that the regular expression approach for parsing XML is very brittle. For instance, the above example would break on the input <a name="b>c">hey</a>. (> is a valid character in a attribute value: see XML specs)

Try this...
<a.*<b>(.*)</b>(.*)</a>
$1 and $2 should be what you want, or whatever means Python has for printing captured groups.

+1 for Jens's answer. lxml is a good library you can use to actually parse this in a robust fashion. If you'd prefer something in the standard library, you can use sax, dom or elementree.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML in Python with regex - python

Please, just use an XML parser like ElementTree >>> from xml.etree import ElementTree as ET >>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>' >>> ET.fromstring(line).text 'PLAINSBORO, NJ 08536-1906'

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search. And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

Related

A regular expression in BeautifulSoup 4

Extract string using regex

Match "without this"

How to regex in python?

Extracting some HTML tag values in Python

Categories

Resources