This question already has answers here:
How to parse XML and get instances of a particular node attribute?
(19 answers)
Closed 7 years ago.
I have the following xml file
<p style="1">
A
</p>
<div xml:lang="unknown">
<p style="3">
B
C
</div>
<div xml:lang="English">
<p style="5">
D
</p>
<p style="1">
Picture number 3?
</p>
and I just want to get the text between <div xml:lang="unknown"> and </div>.
So I've tried this code :
import os, re
html = open("2.xml", "r")
text = html.read()
lon = re.compile(r'<div xml:lang="unknown">\n(.+)\n</div>', re.MULTILINE)
lon = lon.search(text).group(1)
print lon
but It doesn't seem to work.
1) Don't parse XML with regex. It just doesn't work. Use an XML parser.
2) If you do use regex for this, you don't want re.MULTILINE, which controls how ^ and $ work in a multiple-line string. You want re.DOTALL, which controls whether . matches \n or not.
3) You probably also want your pattern to return the shortest possible match, using the non-greedy +? operator.
lon = re.compile(r'<div xml:lang="unknown">\n(.+?)\n</div>', re.DOTALL)
you can parse a piece of block code like this , when you in a block and set a flag True, and when you out and set the flag False and break out.
def get_infobox(self):
"""returns Infobox wikitext from text blob
learning form https://github.com/siznax/wptools/blob/master/wp_infobox.py
"""
if self._rawtext:
text = self._rawtext
else:
text = self.get_rawtext()
output = []
region = False
braces = 0
lines = text.split("\n")
if len(lines) < 3:
raise RuntimeError("too few lines!")
for line in lines:
match = re.search(r'(?im){{[^{]*box$', line)
braces += len(re.findall(r'{{', line))
braces -= len(re.findall(r'}}', line))
if match:
region = True
if region:
output.append(line.lstrip())
if braces <= 0:
region = False
break
self._infobox = "\n".join(output)
assert self._infobox
return self._infobox
You can try splitting on the div and just matching on the list item. This works well for regex's on large data as well.
import re
html = """<p style="1">
A
</p>
<div xml:lang="unknown">
<p style="3">
B
C
</div>
<div xml:lang="English">
<p style="5">
D
</p>
<p style="1">
Picture number 3?
</p>
"""
for div in html.split('<div'):
m = re.search(r'xml:lang="unknown">.+(<p[^<]+)', div, re.DOTALL)
if m:
print m.group(1)
Related
I'm trying to scrape data from a listing website with the following html structure
<div class="ListingCell-AllInfo ListingUnit" data-bathrooms="1" data-bedrooms="1" data-block="21st Floor" data-building_size="31" data-category="condominium" data-condominiumname="Twin Lakes Countrywoods" data-price="6000000" data-subcategories='["condominium","single-bedroom"]'>
<div class="ListingCell-TitleWrapper">
<h3 class="ListingCell-KeyInfo-title" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<a class="js-listing-link" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay
</a>
</h3>
<div class="ListingCell-KeyInfo-address ellipsis">
<a class="js-listing-link ellipsis" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<span class="icon-pin">
</span>
<span>
Tagaytay Hi-Way
Dayap Itaas, Laurel
</span>
</a>
</div>
What I want to get is the info beside <div class="ListingCell-AllInfo ListingUnit"... which are data-bathrooms, data-bedrooms, data-block, etc.
I tried to scrape it using Python BeautifulSoup
details = container.find('div',class_="ListingCell-AllInfo ListingUnit").text if container.find('div',class_="ListingCell-AllInfo ListingUnit") else "-"
It's been returning "-" for all listings. Complete newbie here!
You can use Beautiful soup that would be better it has alway worked for me .
req = Request("put your url here",headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage)
title = soup.find_all('tag you want to scrape', class_='class of that tag')
visit the link for more info : https://pypi.org/project/beautifulsoup4/
there ! You could use regular expression to solve your issue
I have introduced a few comment in my solution but for more information,
take a look at the official documentation
or read this
import re # regular expression module
txt = """insert your html here"""
# we create a regex patern called p1 and this that will match a string starting with
# <div class="ListingCell-AllInfo ListingUnit"
# following by anything (any character) found 0 or more times
# and the string must end by '>'
p1 = re.compile(r'<div class="ListingCell-AllInfo ListingUnit".*>')
# findall return a list of strings that matches the patern p1 in txt
ls = p1.findall(txt)
# now, what you want is the data, so we can create another patern where the word
# "data" will be found
# match string starting with data following by '-' then by 0 or more alphanumeric char
# then with '=' then with any character found in after the '=' that is not not
# a space, a tab
p2 = re.compile(r'(data-\w*=\S*)')
data = p2.findall(ls[0])
print(data)
Note : Don't be scared by the funky symbols they look way worse than what they truly are
I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...
I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.
Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space
Given a string like
"<p> >this line starts with an arrow <br /> this line does not </p>"
or
"<p> >this line starts with an arrow </p> <p> this line does not </p>"
How can I find the lines that start with an arrow and surround them with a div
So that it becomes:
"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>
Since it is an HTML you are parsing, use the tool for the job - an HTML parser, like BeautifulSoup.
Use find_all() to find all text nodes that start with > and wrap() them with a new div tag:
from bs4 import BeautifulSoup
data = "<p> >this line starts with an arrow <br /> this line does not </p>"
soup = BeautifulSoup(data)
for item in soup.find_all(text=lambda x: x.strip().startswith('>')):
item.wrap(soup.new_tag('div'))
print soup.prettify()
Prints:
<p>
<div>
>this line starts with an arrow
</div>
<br/>
this line does not
</p>
You can try with >\s+(>.*?)< regex pattern.
import re
regex = re.compile("\\>\\s{1,}(\\>.{0,}?)\\<")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches
and replace matched group with <div> matched_group </div>. Here pattern look for anything that is enclosed inside > > and <.
Here is demo on debuggex
You could try this regex,
>(\w[^<]*)
DEMO
Python code would be,
>>> import re
>>> str = '"<p> >this line starts with an arrow <br /> this line does not </p>"'
>>> m = re.sub(r'>(\w[^<]*)', r"<div> >\1</div> ", str)
>>> m
'"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>"'
A certain page retrieved from a URL, has the following syntax :
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)
For that I tried using the following code:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.
Can someone suggest a solution?
You can use BeautifulSoup for parsing the HTML string.
Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
A tip:
If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input or id=10, That is, you keep all tags of the same type to be the same id or class.
Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr
You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:
fetched_data = source.read()
Then later...
givenName=(fetched_data.split(start))[1].split(end)[0]
and...
surname=(fetched_data.split(start))[1].split(end)[0]
That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.
Check out the docs for urllib2 and methods on file objects
If you want to be quick, regexes are more useful for this kind of task. It can be a harsh learning curve at first but regexes will save your butt one day.
Try this code:
# read the whole document into memory
full_source = source.read()
NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')
name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()
See here for more info on how to use regexes in python.
A more comprehensive solution would involve parsing the HTML (using a lib like BeautifulSoup), but that can be overkill depending on your particular application.
You can use HTQL:
page="""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))
# [('Name:', ' Pasan '),
# ('Surname: ', ' Wijesingher '),
# ('Former/AKA Name:', ' No Former/AKA Name '),
# ('Gender:', ' Male '),
# ('Language Fluency:', ' ENGLISH ')
# ]
I am using the below to get all of the html content of a section to save to a database
el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)
The product description has a tag that looks like this:
<div id='productDescription'>
<THE HTML CODE I WANT>
</div>
The code works great , gives me all of the html code but how do I remove the outer layer i.e. the <div id='productDescription'> and the closing tag </div> ?
You could convert each child to string individually:
text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))
Or in even more hackish way:
el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]
if your productDescription div div contains mixed text/elements content, e.g.
<div id='productDescription'>
the
<b> html code </b>
i want
</div>
you can get the content (in string) using xpath('node()') traversal:
s = ''
for node in el.xpath('node()'):
if isinstance(node, basestring):
s += node
else:
s += lxml.html.tostring(node, with_tail=False)
Here is a function that does what you want.
def strip_outer(xml):
"""
>>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd">
... <mrow>
... <msup>
... <mi>x</mi>
... <mn>2</mn>
... </msup>
... <mo> + </mo>
... <mi>x</mi>
... </mrow>
... </math>'''
>>> so = strip_outer(xml)
>>> so.splitlines()[0]=='<mrow>'
True
"""
xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute
xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element
rx = lxml.etree.XML(xml)
lxml.etree.strip_tags(rx,'math')#strip <math with all attributes
uc=lxml.etree.tounicode(rx)
uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again
return uc.strip()
Use regexp.
def strip_outer_tag(html_fragment):
import re
outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>$', re.DOTALL)
return outer_tag.search(html_fragment).group(1)
html_fragment = strip_outer_tag(tostring(el, encoding='unicode')) # `encoding` is optionaly