BeautifulSoup : Weird behavior with <p> - python

I've the following HTML content :
content = """
<div>
<div> <div>div A</div> </div>
<p>P A</p>
<div> <div>div B</div> </div>
<p> P B1</p>
<p> P B2</p>
<div> <div>div C</div> </div>
<p> P C1 <div>NODE</div> </p>
</div>
"""
Which can be seen like that (Not sure if it helps but I like diagram) :
If I use the following code :
soup = bs4.BeautifulSoup(content, "lxml")
firstDiv = soup.div
allElem = firstDiv.findAll( recursive = False)
for i, el in enumerate(allElem):
print "element ", i , " : ", el
I get this :
element 0 : <div> <div>div A</div> </div>
element 1 : <p>P A</p>
element 2 : <div> <div>div B</div> </div>
element 3 : <p> P B1</p>
element 4 : <p> P B2</p>
element 5 : <div> <div>div C</div> </div>
element 6 : <p> P C1 </p>
element 7 : <div>NODE</div>
As you can see unlike elements 0, 2 or 5, the element 6 doesn't contains its children. If I change its <p> to <b> or <div> then it acts as excepted. Why this little difference with <p> ? I'm still having that problem (if this is one?) upgrading from 4.3.2 to 4.4.6.

p elements can only contain phrasing content so what you have is actually invalid HTML. Here's an example of how it's parsed:
For example, a form element isn't allowed inside phrasing content,
because when parsed as HTML, a form element's start tag will imply a
p element's end tag. Thus, the following markup results in two
paragraphs, not one:
<p>Welcome. <form><label>Name:</label> <input></form>
It is parsed exactly like the following:
<p>Welcome. </p><form><label>Name:</label> <input></form>
You can confirm that this is how browsers parse your HTML (pictured is Chrome 64):
lxml is handling this correctly, as is html5lib. html.parser doesn't implement much of the HTML5 spec and doesn't care about these quirks.
I suggest you stick to lxml and html5lib if you don't want to be frustrated in the future by these parsing differences. It's annoying when what you see in your browser's DOM inspector differs from how your code parses it.

Related

extract tags from soup with BeautifulSoup

'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
according to picture I attached, I want to extract all "kt-post-card__body" attrs and then from each one of them, extract:
("kt-post-card__title", "kt-post-card__description")
like a list.
I tried this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
but with ads[0].div I only access to "kt-post-card__title" while "kt-post-card__body" has three other sub tags like: "kt-post-card__description" and "kt-post-card__bottom" ... , why is that?
Cause your question is not that clear - To extract the classes:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
Output:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
To get the texts you also have to iterate your ResultSet and could access each elements text to fill your list or use stripped_strings.
Example
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
Output:
['Example_1', 'Example_2']
or
print(list(e.stripped_strings))
Output:
['Example_1', 'Example_2', 'Example_4']
Try this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
I think you're getting only the first div because you called ads[0].div

How to keep all html elements with selector but drop all others?

I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.
Let's say I just want to keep all p and a tags inside the div with class="A".
Input:
<div class="A">
<p>Text1</p>
<img src="A.jpg">
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
</div>
<div class="B">
ContentDiv2
</div>
Expected output:
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>
If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.
Example with drop_tree():
import lxml.cssselect
import lxml.html
tree = lxml.html.fromstring(html_str)
elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
selector = lxml.cssselect.CSSSelector(j)
for e in selector(tree):
e.drop_tree()
output = lxml.html.tostring(tree)
I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:
target = tree.xpath('//div[#class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
if t not in to_keep:
target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())
The output I get is your expected output.
Try the below. The idea is to clean the root and add the required sub elements.
Note that no external lib is required.
import xml.etree.ElementTree as ET
html = '''<div class="A">
<p>Text1</p>
<img src="A.jpg"/>
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
ContentDiv2
</div>'''
root = ET.fromstring(html)
p_lst = root.findall('./p')
a_lst = root.findall('./a')
children = list(root)
for c in children:
root.remove(c)
for p in p_lst:
p.tail = ''
root.append(p)
for a in a_lst:
a.tail = ''
root.append(a)
root.text = ''
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>

invalid xpath expression python

I am trying to get all xpaths that start with something, but when I test it out, it says my xpath is invalid.
This is my xpath:
xpath = "//*[contains(#id,'ClubDetails_"+ str(team) +"']"
Which outputs when I print:
//*[contains(#id,'ClubDetails_1']
Here is HTML:
<div id="ClubDetails_1_386" class="fadeout">
<div class="MapListDetail">
<div>Bev Garman</div>
armadaswim#aol.com
</div>
<div class="MapListDetail">
<div>Rolling Hills Country Day School</div>
<div>26444 Crenshaw Blvd</div>
<div>Rolling Hills Estates, CA 90274</div>
</div>
<div class="MapListDetailOtherLocs_1">
<div>This club also swims at other locations</div>
<span class="show_them_link">show them...</span>
</div>
</div>
What am I missing ?
An alternative can be use pattern to match an ID starting with some text and get all those element in a list and then iterate list one by one and check the condition whether id attribute of that element contains the team is you wants.
You can match the pattern in CSS selector as below :
div[id^='ClubDetails_']
And this is how in xpath :
//div[starts-with(#id, 'ClubDetails_')]
Code sample:
expectedid = "1_386"
clubdetails = driver.find_elements_by_css_selector("div[id^='ClubDetails_']")
for item in clubdetails:
elementid = item.get_attribute('id')
expectedid in elementid
// further code

Using Xpath axes to extract preceding element

I'm trying to scrape data from a site with this structure below. I want to extract information in each of the <li id="entry">, but both of the entries should also extract the category information from <li id="category"> / <h2>
<ul class="html-winners">
<li id="category">
<h2>Redaktionell Print - Dagstidning</h2>
<ul>
<li id="entry">
<div class="entry-info">
<div class="block">
<img src="bilder/tumme10/4.jpg" width="110" height="147">
<span class="gold">Guld: Svenska Dagbladet</span><br>
<strong><strong>Designer:</strong></strong> Anna W Thurfjell och SvD:s medarbetare<br>
<strong><strong>Motivering:</strong></strong> "Konsekvent design som är lätt igenkänningsbar. Små förändringar förnyar ständigt och blldmotiven utnyttjas föredömligt."
</div>
</div>
</li>
<li id="entry">
<div class="entry-info">
<div class="block"><img src="bilder/tumme10/3.jpg" width="110" height="147">
<span class="silver">Silver: K2 - Kristianstadsbladet</span>
</div>
</div>
</li>
</ul>
</li>
I use a scrapy with the following code:
start_urls = [
"http://www.designpriset.se/vinnare.php?year=2010"
]
rules = (
Rule(LinkExtractor(allow = "http://www.designpriset.se/", restrict_xpaths=('//*[#class="html-winners"]')), callback='parse_item'),
)
def parse(self, response):
for sel in response.xpath('//*[#class="entry-info"]'):
item = ByrauItem()
annons_list = sel.xpath('//span[#class="gold"]/text()|//span[#class="silver"]/text()').extract()
byrau_list = sel.xpath('//div/text()').extract()
kategori_list = sel.xpath('/preceding::h2/text()').extract()
for x in range(0,len(annons_list)):
item['Annonsrubrik'] = annons_list[x]
item['Byrau'] = byrau_list[x]
item['Kategori'] = kategori_list[x]
yield item
annons_list and byrau_list works perfect, they use xpath to go down the heirarchy from the starting point //*[#class="entry-info"]. But kategori_list gives me "IndexError: list index out of range". Am I writing the xpath preceding axe the wrong way?
As mentioned by #kjhughes in comment, you need to add . just before / or // to make your XPath expression relative to current context element. Otherwise the expression will be considered relative to the root document. And that's why the expression /preceding::h2/text() returned nothing.
In the case of /, you can also remove it from the beginning of your XPath expression as alternative way to make to it relative to current context element :
kategori_list = sel.xpath('preceding::h2/text()').extract()
Just a note, preceding::h2 will return all h2 elements located before the <div class="entry-info">. According to the HTML posted, I think the following XPath expression is safer from returning unwanted h2 elements (false positive) :
query = 'parent::li/parent::ul/preceding-sibling::h2/text()'
kategori_list = sel.xpath(query).extract()

Beautiful Soup to Find and Regex Replace text 'not within <a></a> '

I am using Beautiful Soup to parse a html to find all text that is
1.Not contained inside any anchor elements
I came up with this code which finds all links within href but not the other way around.
How can I modify this code to get only plain text using Beautiful Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
EDIT:
Example:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within <a></a> : then find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Thanks for your time !
If I understand you correct, you want to get the text that is inside an a element that contains an href attribute. If you want to get the text of the element, you can use the .text attribute.
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('this is some text')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
Edit
This finds all the text elements, with identified in them:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
The returned objects are of type BeautifulSoup.NavigableString. If you want to check if the parent is an a element you can do txt.parent.name == 'a'.
Another edit:
Here's another example with a regex and a replacement.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> test1 </div>
<div><br /></div>
<div>test2</div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>

Categories