Navigating the DOM tree with BeautifulSoup

Navigating the DOM tree with BeautifulSoup - python

I'm scraping a website for prices of listings, and am not figuring out how to navigate the tree structure.
In the best of worlds I would have a for loop to iterate over all the lis and do some data analysis, hence I would love to have an iterator iterate over the specific elements that are nested way down.
I tried to call nested elements à la .div.div. I think I'm just new to this, some lines of help would be greatly appreciated!
myurl = 'https://www.2ememain.be/l/velos-velomoteurs/q/velo/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "lxml")
containers = page_soup.findAll(
"li", {"class": "mp-Listing mp-Listing--list-item"})
Here is the tree structure:
<li class="mp-Listing mp-Listing--list-item">
<figure class="mp-Listing-image-container"><a
data-tracking="mucLxVHX8FbvYBHPHfGkOCRq9VFszDlhSxgIClJUJRXbTYMnnOw8kI1NFuitzMperXfQZoyyS2Mx8VbGSZB7_jITV8iJZErGmgWsWp4Arvmpog9Hw3EO8q45U-6chavRHHXbOGPOeNci_683vlir1_SAK-XDa7Znjl22XHOxxH_n3QwloxZSRCxAKGjVYg8aQGTfUgZd2b9DDBdUR2fqyUEUXqnMGZ5hjKlTKTR67obF26tTc8kc1HAsv_fvTEfJW-UxpJCuVhXjKi3pcuL99F8QesdivVy1p_jhs7KL-528jJXZ-LGNSz6cloZlO3yEsAdN_NxI4vz76mTfPY-fiRuAlSPfcjP8KYuDw9e8Qz-QyhUNfhIzOZyU6r1suEfcihY9w_HYY-Qn6vmZ8Bw9ZZn4CEV7odI4_7RzYe8OBw4UmTXAODFxJgS-7fnlWgUAZqX8wu_WydbQLqDqpMXEMsbzKFxaerTLhhUGBqNlBEzpJ0jBIm7-hafuMH5v3IRU0Iha8fUbu7soVLYTuTcbBG2dUgEH-O2-bALjnkMB8XWlICCM14klxeRyOAFscVKg2m6p5aanRR38dgEXuvVE9UcSjHW43JeNSv3gJ7GwJww"
href="/a/velos-velomoteurs/velos-ancetres-oldtimers/a34926285-peugeot-velo-de-course-1970.html?c=17f70af2bde4a155c6d568ce3cad9ab7&previousPage=lr">
<div class="mp-Listing-image-item mp-Listing-image-item--main"
style="background-image:url(//i.ebayimg.com/00/s/NTI1WDcwMA==/z/LlYAAOSw3Rdc-miZ/$_82.JPG)"><img
alt="Peugeot - V�lo de course - 1970" data-img-src="Peugeot - V�lo de course - 1970"
src="//i.ebayimg.com/00/s/NTI1WDcwMA==/z/LlYAAOSw3Rdc-miZ/$_82.JPG"
title="Peugeot - V�lo de course - 1970" /></div>
</a></figure>
<div class="mp-Listing-content">
<div class="mp-Listing-group mp-Listing-group--main">
<h3 class="mp-Listing-title"><a
data-tracking="mucLxVHX8FbvYBHPHfGkOCRq9VFszDlhSxgIClJUJRXbTYMnnOw8kI1NFuitzMperXfQZoyyS2Mx8VbGSZB7_jITV8iJZErGmgWsWp4Arvmpog9Hw3EO8q45U-6chavRHHXbOGPOeNci_683vlir1_SAK-XDa7Znjl22XHOxxH_n3QwloxZSRCxAKGjVYg8aQGTfUgZd2b9DDBdUR2fqyUEUXqnMGZ5hjKlTKTR67obF26tTc8kc1HAsv_fvTEfJW-UxpJCuVhXjKi3pcuL99F8QesdivVy1p_jhs7KL-528jJXZ-LGNSz6cloZlO3yEsAdN_NxI4vz76mTfPY-fiRuAlSPfcjP8KYuDw9e8Qz-QyhUNfhIzOZyU6r1suEfcihY9w_HYY-Qn6vmZ8Bw9ZZn4CEV7odI4_7RzYe8OBw4UmTXAODFxJgS-7fnlWgUAZqX8wu_WydbQLqDqpMXEMsbzKFxaerTLhhUGBqNlBEzpJ0jBIm7-hafuMH5v3IRU0Iha8fUbu7soVLYTuTcbBG2dUgEH-O2-bALjnkMB8XWlICCM14klxeRyOAFscVKg2m6p5aanRR38dgEXuvVE9UcSjHW43JeNSv3gJ7GwJww"
href="/a/velos-velomoteurs/velos-ancetres-oldtimers/a34926285-peugeot-velo-de-course-1970.html?c=17f70af2bde4a155c6d568ce3cad9ab7&previousPage=lr">Peugeot
- V�lo de course - 1970</a></h3>
<p class="mp-Listing-description mp-text-paragraph">Cet objet est vendu par Catawiki. Cliquez sur le lien
pour �tre redirig� vers le site Catawiki et placer votre ench�re.v�lo de cou<span><input
class="mp-Listing-show-more" id="a34926285" type="checkbox" /><span
class="mp-Listing-description mp-Listing-description--extended">rse peugeot des ann�es 70,
�quip� de pneus neufs (michelin dynamic sport), freins Mafac racer, d�railleur allvit, 3
plateaux, 21 vitesses.selle Basano</span><label for="a34926285">...<span
class="mp-Icon mp-Icon--xs mp-svg-arrow-down"></span><span
class="mp-Icon mp-Icon--xs mp-svg-arrow-up"></span></label></span></p>
<div class="mp-Listing-attributes"></div>
</div>
<div class="mp-Listing-group mp-Listing-group--aside">
<div class="mp-Listing-group mp-Listing-group--top-block"><span
class="mp-Listing-price mp-text-price-label">Voir description</span><span
class="mp-Listing-seller-name"><a class="mp-TextLink"
href="/u/catawiki/38096837/">Catawiki</a></span><span
class="mp-Listing-date">Aujourd'hui</span><span class="mp-Listing-location">Toute la
Belgique<br /></span></div>
<div class="mp-Listing-group mp-Listing-group--bottom-block"><span class="mp-Listing-priority">Annonce au
top</span><span class="mp-Listing-seller-link"><a class="mp-TextLink undefined"
href="https://admarkt.2dehands.be/buyside/url/RK-f5Gyr8TS9VKWPn06TDHk8zCWeSU5-PsQDuvr5tYpoRXQYzjmhI4E8OX9dXcZb0TEQOFSDMueu3s5kqHSihdgWdlYIhSdweDBq0ckhYm7kU8NzKSx7FWvKA8-ZSJUz6PW439SHCTDUa2er4_kqge-fyr8zJemRXzISpFdvVIzVufagipJY-9jozmgnesM_bfBJxR6r0IvKWR8GYnfgv0bPsg1Ny5CQMsw4LsI33lUP_g6cYuGIcGOeEupRpJtf1sXv11G7BTj3gZAo5fvVk35hdfr5LVSJxJYsDUOxS7pdcFtkVO-0EEbZwLG3FlDYaPqLnComuKbmrSwzIW6EwfWXvr1lvifS5cOPflPSsVE319HKQ06w2vk4-4N9-E-cSXye9Yj_YHhNCJdEynvHV0XWkMkdLE_flG421UIIHVbDZdKHV429Ka7HQQSdpbyU6nQ94UsVzRfi2gEgXM18WuI96qkT8oFtqZwGrrE4wlyLuDJnPWkzaYmEwsSoPslrkv_mY66yEOLYsLolpTF3aTRU3sqv0GvZwnPkR04uZJY8GeL70uz3XaP5mYPxKz-pmCFbnJN_i9oiA_LjEIrEzSmvCEM_jViUfPB4FIib7VEi_gag5qWNYYxfkIyT4mC9Y0EKx0JbNHzyBs1062ETCiFvtPaAgconmyqW2ztnw4it_D10qAEemDppNOXKMmX_Jg-feuFKwq-MdIxiyJK3yoiKPXzMEEBa2WXqchDAPF52YmcVjq8HDORqYFkq5-iLumz6Y8ut-smKs_-vMG7k52nO3RW3RzuO0syMLBlZGiqUnADJtj0hmGmzqHXRqflq4QCTEE2vmG2flfMSIz9XJ7ECg73CP5OSNPg5VlzWfCVgd7o1TYd-rFBFXWM5Xz-ZlCA03LOZtP3BeQR3-TnSL6MNWo46vEtHq5ntcF-TrFTl4h01C5DNF_7R4W36CqQ4"
rel="noopener noreferrer nofollow" target="_blank">Visiter le site internet</a></span></div>
</div>
</div>
</li>
The idea is to fetch
<span class="mp-Listing-seller-name"><a class="mp-TextLink">
through referencing. Like containers.div.span....

I believe this is what you're looking for:
from bs4 import BeautifulSoup as bs
target = [your code above - note that it's missing the opening <li>]
page_soup = bs(target, "lxml")
containers = page_soup.find_all('li')
for container in containers:
item = container.find_all("span", class_= "mp-Listing-seller-name")
print(item)
Output:
[<span class="mp-Listing-seller-name"><a class="mp-TextLink" href="/u/catawiki/38096837/">Catawiki</a></span>]

Related

Web scrape second number between tags

I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.

Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)

If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?

There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam

How to exclude unwanted tags when using Beautifulsoup in Python

I'm practicing python scraping on indeed.com with Beautifulsoup.
While extracting 'job location' with [div class companyLocation],
what I want is to get the location string right after 'div class="companyLocation"'.
(in below html, "United States")
But for some cases, there are extra 'a aria-label' or 'span' clauses which contains unwanted strings such as "+1 location" or etc.
I couldn't figure out how to get rid of these.
So I ask for your advice.
<div class="companyLocation">United States
<span><a aria-label="Same Python Developer job in 1 other location" class="more_loc" href="/addlLoc/redirect?tk=1fgg7b6pa306m001&jk=d724dab9a2d2af2c&dest=%2Fjobs%3Fq%3Dpython%26limit%3D50%26grpKey%3DkAO5nvwVmAPOkxWgAwHyBwN0Y2w%253D" rel="nofollow">
+1 location</a></span>
<span class="remote-bullet">•</span><span>Remote</span></div>, United States+1 location•Remote
Here's my Python codes for your reference.
The problem arises 'if a.string is None:' case.
you could see above div + span html clauses with this code:
print(f"{a}, {a.text}")
import requests
from bs4 import BeautifulSoup
url = "https://www.indeed.com/jobs?q=python&limit=50"
extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})
for soup_job in soup_jobs:
for a in soup_job.select("div.companyLocation"):
if a.string is not None:
pass
#problem(below)
if a.string is None:
print(f"{a}, {a.text}")

You've mixed up the if statements, try the following instead:
import requests
from bs4 import BeautifulSoup
url = "https://www.indeed.com/jobs?q=python&limit=50"
extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})
for soup_job in soup_jobs:
for a in soup_job.select("div.companyLocation"):
if a.string is not None:
print(f"{a}, {a.text}")
Output:
<div class="companyLocation">United States</div>, United States
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Boulder, CO</div>, Boulder, CO
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Allen, TX</div>, Allen, TX
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York State</div>, New York State
<div class="companyLocation">Austin, TX</div>, Austin, TX
<div class="companyLocation">Research Triangle Park, NC</div>, Research Triangle Park, NC
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">Cary, NC</div>, Cary, NC
<div class="companyLocation">Raleigh, NC</div>, Raleigh, NC
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Bellevue, WA</div>, Bellevue, WA
<div class="companyLocation">Houston, TX</div>, Houston, TX
Now it works just fine.

is this working?
#problem(below)
if a.string is None:
data=''
for child in a.children:
if not child.name and child != '':
data+=child
print(data)

Selenium does not load <li> inside <ul> inside <div>

I am new to Selenium, Python, and programming in general but I am trying to write a small web scraper. I have encountered a website that has multiple links but their HTML code is not available for me using
soup = bs4.BeautifulSoup(html, "lxml")
The HTML-Code is:
<div class="content">
<div class="vertical_page_list is-detailed">
<div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">[event]
<ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
<li class="vertical-page-list--item is-detailed infite-nodes--list-item" style="display: list-item;">
<li class="...>
...
</ul>
</div>
</div>
</div>
But soup only contains this part, missing the li classes:
<div class="content">
<div class="vertical_page_list is-detailed">
<div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">
<ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
</ul>
</div>
</div>
</div>
It has somthing to do with the [event] after the div but I can't figure out what to do. My guess was that it is some lazy-loaded code but using
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
or directly moving to the element
actions = ActionChains(driver)
actions.move_to_element(driver.find_element_by_xpath("//div['infinite-nodes=']")).perform()
did not yield any results. This is the code I am using:
# Enable headless firefox for Serenium
options = Options()
#options.headless = True
options.add_argument("--headless")
options.page_load_strategy = 'normal'
driver = webdriver.Firefox(options=options, executable_path=r'C:\bin\geckodriver.exe')
print ("Headless Firefox Initialized")
# Load html source code from webpage
driver = webdriver.PhantomJS(executable_path=r'C:\phantomjs\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get("https://www.volkswagen-newsroom.com/de/pressemitteilungen?container_context=lg%2C1.0")
SCROLL_PAUSE_TIME = 2
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
print("Scrolled down to bottom")
# Extract html code
driver.find_element_by_xpath("//div['infinite-nodes=']").click() #just testing
time.sleep(SCROLL_PAUSE_TIME)
html = driver.page_source.encode('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
Could anyone help me please?

When you visit the page in a browser, and log your network traffic, every time the page loads (or you press the Mehr Pressemitteilungen anzeigen button) an XHR (XmlHttpRequest) request is made to some kind of API(?) - the response of which is JSON, which also contains HTML. It's this HTML that contains the list-item elements you're looking for. You don't need selenium for this:
def get_article_titles():
import requests
from bs4 import BeautifulSoup as Soup
url = "https://www.volkswagen-newsroom.com/de/pressemitteilungen"
params = {
"container_context": "lg,1.0",
"next": "1"
}
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
while True:
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
data = response.json()
params["next"] = data["next"]
soup = Soup(data["html"], "html.parser")
for tag in soup.select("h3.page-preview--title > a"):
yield tag.get_text().strip()
def main():
from itertools import islice
for num, title in enumerate(islice(get_article_titles(), 10), start=1):
print("{}.) {}".format(num, title))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
1.) Volkswagen Konzern, BASF, Daimler AG und Fairphone starten Partnerschaft für nachhaltigen Lithiumabbau in Chile
2.) Verkehrsausschuss-Vorsitzender Cem Özdemir informiert sich über Transformation im Elektro-Werk in Zwickau
3.) Astypalea: Start der Transformation zur smarten, nachhaltigen Insel
4.) Vor 60 Jahren: Fußball-Legende Pelé zu Besuch im Volkswagen Werk Wolfsburg
5.) Novum unter den Kompakten: Neuer Polo ist mit „IQ.DRIVE Travel Assist“ teilautomatisiert unterwegs
6.) Der neue Tiguan Allspace – ab sofort bestellbar
7.) Volkswagen startet Vertriebsoffensive im deutschen Markt
8.) Vor 70 Jahren: Volkswagen erhält ersten Beirat
9.) „Experience our Volkswagen Way to Zero“ – neue Ausstellung im DRIVE. Volkswagen Group Forum für Gäste geöffnet
10.) Jetzt bestellbar: Der neue ID.4 GTX
>>>

Webscraping Python Trying to pull changing "id"

Following is my code.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://steamcommunity.com/market/listings/730/Souvenir%20P2000%20%7C%20Chainmail%20%28Factory%20New%29'
# open and read
uClient = uReq(my_url)
page_html = uClient.read()
#close
uClient.close()
#html parse
page_soup = soup(page_html,"html.parser")
#grab all listings
containers = page_soup.findAll("div",{"class":"market_listing_item_name_block"})
for container in containers:
block_container = container.findAll("span",{"class":"market_listing_item_name"})
The block_container returns multiple results all the same, except they have in the <span> and
id = "listing_#_name" where the # is a combination of numbers that changes for each <span>
For example -
</br></div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2060891817875196312_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/>
<span class="market_listing_game_name">Counter-Strike: Global Offensive</span>
</div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2076653149485426829_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/>
Can anyone explain how I can grab the id out of all the spans?

You can get the id from the span tags.
Try:
for container in containers:
for block_container in container.findAll("span", class_="market_listing_item_name"):
print(block_container.attrs['id'])
From Beautiful Soup Docs
A tag may have any number of attributes. The tag has
an attribute id whose value is boldest. You can access a tag’s
attributes by treating the tag like a dictionary:
tag['id']
# u'boldest'
You can access that dictionary directly as .attrs:
tag.attrs
# {u'id': 'boldest'}
Reference:
Beautiful Soup Docs - Attributes

Can this help you
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''</br></div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2060891817875196312_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/>
<span class="market_listing_game_name">Counter-Strike: Global Offensive</span>
</div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2076653149485426829_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/></div>'''
doc = SimplifiedDoc(html)
containers = doc.getElements(tag='div',value='market_listing_item_name_block')
for container in containers:
block_container = container.span
# or
# block_container = container.getElement(tag='span',value='market_listing_item_name')
# block_containers = container.getElements(tag='span',value='market_listing_item_name').contains('listing_',attr='id')
print (block_container.id, doc.replaceReg(block_container.id,"(listing_|_name)",""))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Navigating the DOM tree with BeautifulSoup - python

Related

Web scrape second number between tags

Web scraping from the span element

How to exclude unwanted tags when using Beautifulsoup in Python

Selenium does not load <li> inside <ul> inside <div>

Webscraping Python Trying to pull changing "id"

Categories

Resources