I am using BeautifulSoup for parsing A Page's HTML. Due to broken html the markup is not consistent. I have the following html:
<div id='VideoID'>
<a href=#><img src='file.png'></a>
</div>
While on other page it's broken as:
<div id='VideoID'>
<a href=#></a> [Image Tag not enclosed here]
<img src='file.png'>
</div>
Following Line works for first Snippet as expected:
imageURL = imageElement.contents[1].contents[0]['src'].strip()
But not for 2nd one which is obvious.
Is there anyway I detect IMAGE tag within DIV of id 'VideoID' no matter it is enclosed in anchor tag or out of it.?
Yes with .descendants.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#descendants
You iterate through descendants list and you check the .name
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#name
Or even easier with CSS selectors:
soup.select("div#VideoID img")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
you can use recursiveChildGenerator() to generate Nth child elements and can find image tag.
example:
for child in childs.recursiveChildGenerator():
image_file = child.findChildren("img")
This will help you to find image tag in any hierarchy.
Related
I'm trying to parse a html file. There are many nested divs in this html. I want to get all child divs, but not grandchildren etc.
Here is a pattern:
<div class='main_div'>
<div class='child_1'>
<div class='grandchild_1'></div>
</div>
<div class='child_2'>
...
...
</div>
So the command I'm looking for would return 2 elements - divs which classes are 'child_1' and 'child_2'.
Is it possible?
I've tried to use main_div.find_elements_by_tag_name('div') but it returned all nested divs in the div.
Here is a way to find the direct div children of the div with class name "main_div":
driver.find_elements_by_xpath('//div[#class="main_div"]/div')
The key here is the use of a single slash which would make the search inside the "main_div" non-recursive finding only direct div children.
Or, with a CSS selector:
driver.find_elements_by_css_selector("div.main_div > div")
I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)
You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.
If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)
when I learn BeautifulSoup library and try to crawl a webpage, I can limit the search result by limiting the attributes like: a, class name = user-name, which can be found by inspecting the HTML source.
Here is a success example:
<a href="https://thenewboston.com/profile.php?user=2" class="user-name">
Bucky Roberts </a>
I can easily tell
soup = BeautifulSoup(plain_text,'html.parser')
for link in soup.findAll('a', {'class': 'user-name'}):
However, when I try to get the profile photo's link, I see the code below by inspecting:
<div class="panel profile-photo">
<a href="https://thenewboston.com/profile.php?user=2">
<img src="/photos/users/2/resized/869b40793dc9aa91a438b1eb6ceeaa96.jpg" alt="">
</a>
</div>
In this case the .jpg link has nothing to refer to. Now what should I do to get the .jpg link for each user?
You can use the img element parent elements to create your locator. I would use the following CSS selector that would match img elements directly under the a elements directly under the element having profile-photo class:
soup.select(".profile-photo > a > img")
To get the src values:
for image in soup.select(".profile-photo > a > img"):
print(image['src'])
I'm trying to get a text from one tag using lxml etree.
<div class="litem__type">
<div>
Robbp
</div>
<div>Estimation</div>
+487 (0)639 14485653
•
<a href="mailto:herbrich#gmail.com">
Email Address
</a>
•
<a class="external" href="http://www.google.com">
Homepage
</a>
</div>
The problem is that I can't locate it because there are many differences between this kind of snippets. There are situations, when the first and second div is not there at all. As you can see, the telephone number is not in it's own div.
I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath.
Do you have any ideas? (email don't have to be there sometimes)
EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div>
You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees.
The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. To select this tail text using lxml etree you could do the following:
content = '''
<div class="litem__type">
<div>Robbp</div>
<div>Estimation</div>
+487 (0)639 14485653
Email Address
<a class="external" href="http://www.google.com">Homepage</a>
</div>'''
from lxml import etree
tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)
Output
'+487 (0)639 14485653'
The strip() function is used here to remove whitespace on either side of the tail text.
You can iterate and get text after div tag.
from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
# you can check here if it is a phone number
print node.tail
I am using bs4 and want to extract a href of a specified image.
For example in the html code I have:
<div style="text-align:center;"><img src="page_files/image.jpg" alt="Picture" border="0" width="150" height="150"></div>
</div>
And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:
soup = bs4.BeautifulSoup(page)
for img in soup('img'):
imgLink = img.find_previous("a")
This returns the whole tag:
<img alt="Tumblr" border="0" src="Here_is_source"/>
But I can't take the href content, because when I try to make:
imgLink = img.find_previous("a")['href']
I have an error.
The same thing is when I try to use find_parent like
imgLink = img.find_parent("a")['href']
How can I fix that? And what is better: find_previous() or find_parent()?
Make sure you are only looking for images that have a <a> parent tag with href attribute:
for img in soup.select('a[href] img'):
link = img.find_parent('a', href=True)
print link['href']
The CSS selector picks only images that have an <a href="..."> parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.
If you are searching for all images, chances are you are finding some that have a <a> tag parent or preceding tag that does not have the a href attribute; <a> tags can also be used for link targets with <a name="...">, for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given <img> tag.