Skipping XML elements using Regular Expressions in Python 3 - python

I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("\nMatches found!\n")
for title in title_text:
print(title)
else:
print("\nNo matches found!\n\n")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!

Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning

Related

How to grab accurate titles from web pages without including site data

I found this link [and a few others] which talks a bit about BeautifulSoup for reading html. It mostly does what I want, grabs a title for a webpage.
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
The issue that I run into is that sometimes articles will come back with metadata attached at the end with " - some_data". A good example is this link to a BBC Sport article which reports the title as
Jack Charlton: 1966 England World Cup winner dies aged 85 - BBC Sport
I could do something simple like cut off anything after the last '-' character
title = title.rsplit(', ', 1)[0]
But that assumes that any meta exists after a "-" value. I don't want to assume that there will never be an article who's title ends in " - part_of_title"
I found the Newspaper3k library but it's definitely more than I need - all I need is to grab a title and ensure that it's the same as what the user posted. My friend who pointed me to Newspaper3k also mentioned it could be buggy and didn't always find titles correctly, so I would be inclined to use something else if possible.
My current thought is to continue using BeautifulSoup and just add on fuzzywuzzy which would honestly also help with slight misspellings or punctuation differences. But, I would certainly prefer to start from a place that included comparing against accurate titles.
Here is how reddit handles getting title data.
https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()

Selecting certain XML tags with criterion matching other tags

I have an XML file with a structure like the following:
<text>
<dialogue>
<pattern>
We're having a {nice|great} time.
</pattern>
<criterion>
<!-- match this tag, get the above pattern -->
average_person, tourist, delighted
</criterion>
</dialogue>
<pattern>
The service {here stinks|is terrible}!
</pattern>
<criterion>
tourist, disgruntled, average_person
</criterion>
<dialogue>
<pattern>
They have {smoothies|funny hats}. Neat!
</pattern>
<criterion>
tourist, smoothie_enthusiast
</criterion>
</dialogue>
<dialogue>
<pattern>
I wonder how {expensive|valuable} these resort tickets are?
</pattern>
<criterion>
merchant, average_person
</criterion>
</dialogue>
</text>
What I would like to do is go through the dialogue tags, look at the criterion tag, and match a list of words. If they match, I would then like to use the pattern in that dialogue tag. I'm using Python for this task.
What I'm currently doing is walking through the tags by utilizing an lxml "etree" which looks like this:
tree = etree.parse('tourists.xml')
root = tree.getroot()
g=0
for i in root.iterfind('dialogue/criterion'):
a = i.text.split(',')
# The "personality" variable has a value like "delighted" or "disgruntled".
# "tags_to_match" are the criterion that we want to, well, match. It may
# have criterion like "merchant", "tourist", or "delighted".
# When the tags match (in the "match_tags" function) returns true, it
# appends the pattern to the "tourist_patterns" list.
if personality is not 'average_person' and match_tags( tags_to_match, a):
tourist_patterns.append(root[g][0].text)
g+=1
# When we don't have a match, we just go with the "average_person" tag.
if len(tourist_patterns) == 0:
# Go through the tags again, choosing the ones that match the
# 'average_person' personality and put it in the "tourist_patterns" list.
I then go through the elements in the "tourist_patterns" list and pluck out what I want.
I'm trying to simplify this. How can I go through the tags, match the text I want in the criterion tags, and then take the pattern in the pattern tags? I've also been trying to set a default for when the criterion isn't matched (hence the "average_person" personality criterion).
Edit: Some commentators asked for the list of what to match. Basically, I would want it to match some or all of the words in the criterion tags, and it would give the text in the pattern tag underneath that dialogue tag. So if I was looking for "tourist" and "smoothie_enthusiast", it would get one match in my XML example. I would then like to get the pattern tag text "They have {smoothies|funny hats}. Neat!". If that fails to match any of the words in criterion tags, I would just try to match "average_person" and "tourist".
In turn, tourist_patterns would look like this when it matches:
>>> tourist_pattern
['They have {smoothies|funny hats}. Neat!']
And when it doesn't match, it would match this:
>>> tourist_pattern
['They have {smoothies|funny hats}. Neat!', 'The service {here stinks|is terrible}!']
Hope that clears things up.

Remove items in string paragraph if they belong to a list of strings?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')

Creating a regular expression for parsing IUPAC organic compound names

I am trying to create a parser in my free time that could parse out all the functional groups from the name of a organic compound. Side by side, I am also trying to make a display program which can read data from files to draw a visual representation of the compound on screen. Both are being done in python.
Right now the displayer is using a coordinate system to store the positions of atoms, which is why I am making the parser.
Here's the code so far:
import re
main_pattern = r"(.*)(meth|eth|prop|but|pent|hex|hept|oct|non|dec|isodec|dodec)-?([,?\d+,?]*)?-?(di|tri|tetra|penta)?(ane|ene|yne)(.*)"
prefix_patterns = [r"(?<!\d-\()(?<!-\()-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(methyl|ethyl|propyl|butyl|pentyl|hexyl|heptyl|octyl|nonyl|decyl)(?!\))",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(bromo|chloro|iodo|flouro)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(cyano)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(oxo|keto)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(alkoxy)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(hydroxy)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(formyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(carboxy)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(alkoxycabonyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(halocarbonyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(amino)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(carbamoyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(nitro)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(suplho)"]
branch_pattern = r"-?(\d+,?)*?-\((.*?)\)"
compound_name = r"1-methyl-2-pentyl-3,64,7-trihexyl-5-oxo-12,6,7-triketo-23-(siugvuis)-68-(asdlkhdrjnkln)-42-(3,4-dimethylpentyl)pent-5,2,7-triyne"
prefixes = list(prefix_patterns)
print compound_name
print '\n\n'
main=re.findall(main_pattern,compound_name)
print main
print '\n\n'
for x in prefix_patterns:
prefixes = re.findall(x,main[0][0])
print prefixes
branches = re.findall(branch_pattern,main[0][0])
print branches
In the example when the re matches the prefixes methyl in "1-methyl" , it also matches methyl from
-42-(3,4-dimethylpentyl). I looked up on negative lookahead/lookbehind. but couldn't get satisfying results.
Could you kindly point out the problem, and guide me to the answer?

Python regex to print all sentences that contain two identified classes of markup

I wish to read in an XML file, find all sentences that contain both the markup <emotion> and the markup <LOCATION>, then print those entire sentences to a unique line. Here is a sample of the code:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
The regex here grabs all sentences with "wonderful" and "omaha" in them, and returns:
Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.
Which is perfect, but I really want to print all sentences that contain both <emotion> and <LOCATION>. For some reason, though, when I replace "wonderful" in the regex above with "emotion," the regex fails to return any output. So the following code yields no result:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
My question is: How can I modify my regular expression in order to grab only those sentences that contain both <emotion> and <LOCATION>? I would be most grateful for any help others can offer on this question.
(For what it's worth, I'm working on parsing my text in BeautifulSoup as well, but wanted to give regular expressions one last shot before throwing in the towel.)
Your problem appears to be that your regex is expecting a space (\s) to follow the matching word, as seen with:
emotion(?=\s|\.|$)
Since when it's part of a tag, it's followed by a >, rather than a space, no match is found since that lookahead fails. To fix it, you can just add the > after emotion, like:
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
Upon testing, this seems to solve your problem. Make sure and treat "LOCATION" similarly:
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
If I do not understand bad what you are trying to do is remove <emotion> </emotion> <LOCATION></LOCATION> ??
Well if is that what you want to do you can do this
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
def remove_xml_tags(xml):
content = re.compile(r'<.*?>')
return content.sub('', xml)
data = remove_xml_tags(text)
out.write(data + '\n')
out.close()
I have just discovered that the regex may be bypassed altogether. To find (and print) all sentences that contain two identified classes of markup, you can use a simple for loop. In case it might help others who find themselves where I found myself, I'll post my code:
# read in your file
f = open('sampleinput.txt', 'r')
# use read method to convert the read data object into string
readfile = f.read()
#########################
# now use the replace() method to clean data
#########################
# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')
# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')
# replace all ? with .
noquestions = nocommas.replace('?', '.')
# replace all ! with .
noexclamations = noquestions.replace('!', '.')
# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')
######################
# now use replace() to get rid of periods that don't end sentences
######################
# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr')
#replace 'Mrs.' with 'Mrs' etc.
cleantext = nomisters
#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>
periodsplit = cleantext.split('.')
for x in periodsplit:
if 'Toby' in x and 'pipe' in x:
print x

Categories