Navigating DOM in BeautifulSoup

Navigating DOM in BeautifulSoup - python

I'm currently able to find certain elements using the findAll function. Is there a way to navigate to their child?
The code I have is:
data = soup.findAll(id="profile-experience")
print data[0].get_text()
And it returns a block of text (for example, some of the text isn't spaced out properly)
The DOM looks something like this
<div id="profile-experience>
<div class="module-body>
<li class="position">
<li class="position">
<li class="position">
If I just do a findAll on class="position I get way too much crap back. Is there a way using BeautifulSoup to just find the elements that are <li class="position"> that are nested underneath <div id="profile-experience">
I want to do something like this:
data = soup.findAll('li',attrs={'class':'position'})
(Where I'm only getting the nested data)
d in data:
print d.get_text()

Sure, you can "chain" the find* calls:
profile_experience = soup.find(id="profile-experience")
for li in profile_experience.find_all("li", class_="position"):
print(li.get_text())
Or, you can solve it in one go with a CSS selector:
for li in soup.select("#profile-experience li.position"):
print(li.get_text())

Related

Scrapy Selector only getting the first element in for loop

I don't understand why the following code doesn't work when using Scrapy Selector.
In scrapy shell (to be easily replicable, but the issue remains the same in a spider):
from scrapy.selector import Selector
body = '''<html>
<body>
<li>
<p>1</p>
<p>2</p>
<p>3</p>
</li>
<li>
<p>4</p>
<p>5</p>
<p>6</p>
</li>
<li>
<p>7</p>
<p>8</p>
<p>9</p>
</li>
</body>
</html>'''
sel = Selector(text=body, type="html")
for elem in sel.xpath('//body'):
first = elem.xpath('.//li/p[1]/text()').get()
print(first)
And it prints:
1
while it should be printing:
1
4
7
Any idea on how to solve this problem ?
Thanks

There maybe a chance that you're using the .get() method to fetch the data which you can replace with .getall(). This method will give you all the data in list format through which you can get your desired data with help of python slicing.
Or in other way there maybe a change that class name is differ in each "li" tag or you may have to use pass the class="" in your xpath URL.
Note: Rather then fetching data with path: "elem.xpath('.//li/p[1]/text()').get()" you can simply get all data by using "elem.xpath('.//li/p/text()').getall()" and then you can put the manipulation logic over the list data which is the easiest way if you don't get your desired output.

Creating a css selector to locate multiple ids in a single-shot

I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)

You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.

If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)

Python parse all elements in a specific tag

I would like to know how to extract all of the elements under a specific tag.
For example:
<div class="text">
<h2>...</h2>
<p>...</p>
<p>...</p>
<h2>...</h2>
</div>
I would like to get these elements in a list
list = ['<h2>...</h2>',
'<p>...</p>',
'<p>...</p>',
'<h2>...</h2>']
The reason I need this, I want to know under what category (header) the text is written and extract the text.

from bs4 import BeautifulSoup
l = soup.find('div', {'class':'text'}).findChildren()

Xpath to get text following label

I want to get items according to their (preceding) <label> attributes, like this:
<div>
<ul>
<li class="phone">
<label>Mobile</label>
312-999-0000
<div>
<ul>
<li class="phone">
<label>Home</label>
312-999-0001
I want to put the first number in the "Mobile" column/list, and the second in the Home list. I currently have code grabbing both of them, but I don't know the proper syntax for getting the label as it is in the source. This is what I'm using now:
for target in targets:
item = CrawlerItem()
item['phonenumbers'] = target.xpath('div/ul/li[#class="phone"]/text()').extract()
How should I rewrite that for item['mobilephone'] and item['homephone'], using the labels?

I found the answer while finishing up the question, and thought I should share it:
item['mobilephone'] = target.xpath('div/ul/li/label[contains (text(),"Mobile")]/following-sibling::text()').extract()
item['officephone']= target.xpath('div/ul/li/label[contains (text(),"Office")]/following-sibling::text()').extract()

Get all tagged text under li tags

I have list like:
<ul>
<li><strong>Text 1</strong></li>
<li>Text 2</li>
<li>Text 3</li>
<li><strong>Text 4</strong></li>
</ul>
How i can get only values under strong tag using selenium webdriver in python?

Assuming the data is always in the presented form, a simple regex will do:
import re
re.findall(r'<li><strong>([^<]*)</strong></li>', my_text)

This simplest way using Webdriver, which is what I'm assuming you're wanting due to your tags is:
list = driver.find_element_by_xpath("//ul")
elements = list.find_elements_by_xpath("./li/strong")
for element in elements:
print element.text
EDIT: Edited for better answer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Navigating DOM in BeautifulSoup - python

Related

Scrapy Selector only getting the first element in for loop

Creating a css selector to locate multiple ids in a single-shot

Python parse all elements in a specific tag

Xpath to get text following label

Get all tagged text under li tags

Categories

Resources