Scraping issue with embedded span. Don't print the nested child - python

I have a code (part of it) where I use beautifulsoup to scrape the text from H3:
company_name = job.find('h3', class_= 'joblist-comp-name').text.strip()
HTML looks like this:
<h3 class="joblist-comp-name">
ARK INFOSOFT
<span class="comp-more">(More Jobs) </span>
</h3>
My Result Looks like this:
Comapny Name: ARK INFOSOFT
(More Jobs)
As I understand, this code grabs the text inside the a tag which is inside the span which is inside the h3. I only wanted the text "ARK INFOSOFT. How can I avoid grabbing any other text within span's or a tags in the h3?

In order to not get the nested span:
Find the class you want.
Call the find_next() method on the found class, which will only return the first found match, and exclude the nested span.
from bs4 import BeautifulSoup
html = """<h3 class="joblist-comp-name">
ARK INFOSOFT
<span class="comp-more">(More Jobs) </span>
</h3>
"""
soup = BeautifulSoup(html, "html.parser")
company_name = soup.find("h3", class_="joblist-comp-name").find_next(text=True).strip()
Another option: use .contents:
company_name = soup.find("h3", class_="joblist-comp-name").contents[0].strip()
Output (in both examples):
>>> print(company_name)
ARK INFOSOFT

Related

BeautifulSoup Web Scraping - Can't access and extract element

this is my first question at stack overflow.
I am working on a web scraping project and I try to access html elements with beautiful soup.
Please can someone give me advice how to extract the following elements?
The task is to scrape all job listings from a search result page.
The job listing elements are inside the "ResultsSectionContainer".
I want to access each "article class" and
extract its id e.g job-item-7460756
extract its href where data-at="job-item-title"
extract its h2 text (solved)
How to loop through the ResultsSectionContainer and access/extract the information for each 'article class' element / id job-item ?
The name of the article class is somehow dynamic/unique and changes (I guess) every time a new search is done.
<div class="ResultsSectionContainer-gdhf14-0 cxyAav">\n
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">\n
<h2 class="sc-fzqARJ iyolKq">\n Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme\n
</h2>\n
</a>\n
<article class="sc-fzowVh cUgVEH" id="job-item-7465958">\n
...
You can do like this.
Select the <div> with class name as ResultsSectionContainer-gdhf14-0
Find all the <article> tags inside the above <div> using .find_all()- This will give you a list of all article tags
Iterate over the above list and extract the data you need.
from bs4 import BeautifulSoup
s = '''<div class="ResultsSectionContainer-gdhf14-0 cxyAav">
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">
<h2 class="sc-fzqARJ iyolKq"> Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme
</h2>
</a>
</div>'''
soup = BeautifulSoup(s, 'lxml')
d = soup.find('div', class_='ResultsSectionContainer-gdhf14-0')
for i in d.find_all('article'):
job_id = i['id']
job_link = i.find('a', {'data-at': 'job-item-title'})['href']
print(f'JOB_ID: {job_id}\nJOB_LINK: {job_link}')
JOB_ID: job-item-7460756
JOB_LINK: /stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html
If all article classes are same try this
articles = data.find_all("article", attrs={"class": "sc-fzowVh cUgVEH"})
for article in articles:
print(article.get("id"))
print(article.a.get("href"))
print(article.h2.text.strip())
You could do something like this:
results = soup.findAll('article', {'class': 'sc-fzowVh cUgVEH'})
for result in results:
id = result.attrs['id']
href = result.find('a').attrs['href']
h2 = result.text.strip()
print(f' Job id: \t{id}\n Job link: \t{href}\n Job desc: \t{h2}\n')
print('---')
you may also want to prefix href with the url where you're pulling the results from.

Get text from inside element without its children

I'm scraping a webpage with several p elements and I wanna get the text inside of them without including their children.
The page is structured like this:
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
When I use
parent.find_all("p", {"class": "default").get_text() this is the result I get:
I don't want this text
I want this text
I'm using BeautifulSoup 4 with Python 3
Edit: When I use
parent.find_all("p", {"class": "public item-cost"}, text=True, recursive=False)
It returns an empty list
You can use .find_next_sibling() with text=True parameter:
from bs4 import BeautifulSoup
html_doc = """
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one(".default > div").find_next_sibling(text=True))
Prints:
I want this text
Or using .contents:
print(soup.find("p", class_="default").contents[-1])
EDIT: To strip the string:
print(soup.find("p", class_="default").contents[-1].strip())
You can use xpath, which is a bit complex but provides much powerful querying.
Something like this will work for you:
soup.xpath('//p[contains(#class, "default")]//text()[normalize-space()]')

Scraping webpage with Python: how to return a list of titles of certain elements?

I had luck getting a list of telephone numbers using this code:
from lxml import html
import requests
lnk='https://docs.legis.wisconsin.gov/2019/legislators/assembly'
page=requests.get(lnk)
tree=html.fromstring(page.content)
ph_nums=tree.xpath('//span[#class="info telephone"]/text()')
print(ph_nums)
which is scraping info from an HTML element that looks like this:
<span class="info telephone">
<span class="title"><strong>Telephone</strong>:<br></span>
(608) 266-8580<br>(888) 534-0097
</span>
However, I can't do the same for this element when I change info telephone to info...
<span class="info" style="width:16em;">
<span>
<a id="A">
<strong></strong></a><strong>Jenkins, Leroy t</strong> <small>(R - Madison)</small>
</span>
<br>
<span style="width:8em;"><small>District 69</small></span>
<br>
<span style="width:8em;">Details</span>
<br>
<span style="width:8em;">
Website
</span>
<br>
<br>
</span>
since there's multiple titles in this element, whereas "info telephone" only had one. How would I return separate lists, each with a different piece of info (i.e. a list of names, and a list of Districts, in this scenario)?
FYI - I am not educated in HTML (and hardly experienced in Python) so I would appreciate a simplified explanation.
For this task I would recommend the BeautifulSoup Package for Python.
You don't have to deeply understand HTML to use it (I don't!), and it offers a very friendly approach to find certain items from a web page.
Your first example could be rewritten as follows:
from bs4 import BeautifulSoup
#soup element contains the xml data
soup = BeautifulSoup(page.content, 'lxml')
# the find_all method finds all nodes in page.content whose type is 'span'
# and whose class is 'info telephone'
info_tels = soup.find_all('span', {"class": "info telephone"})
The info_tels element contains all instances of <span class="info telephone"> on your document. We can then parse it to find what's relevant:
list_tels = []
for tel in info_tels:
tel_text = tel.text #extracts text from info_telephone node
tel_text = tel_text.replace("\nTelephone:\n","").replace('\n', "") #removes "Telephone:" part and line breaks
tel_text = tel_text.strip() #removes trailing space
list_tels.append(tel_text)
You can do something similar for the 'info' class:
info_class = soup.find_all('span', {"class": "info"})
And then find the elements you want to put into lists:
info_class[0].find_all('a')[1].text #returns you the first name
The challenge here is to identify which types/classes do these names/districts/etc. have. In your first example, it is relatively clear (('span', {"class": "info telephone"})), but the "info" class has various data points inside of it with no specific, identifiable type.
For instance, the '' tag appears multiple times in your file, also with distinct data points (District, Details, etc.)
I came up with a small solution for the District problem - you might get inspired to tackle the other information too!!
list_districts = []
for info in info_class:
try:
district_contenders = info.find_all('span', {'style': "width:8em;"})
for element in district_contenders:
if 'District' in element.text:
list_districts.append(element.text)
except:
pass

How to output a list of links from the following code

I want to output a series of links I've scraped from a website. The html is in a pretty standard hierarchy: div, h4, a, href.
Using Python and BeautifulSoup I've pulled the list out using the following script:
for record in soup.findAll('div',{"class":"title"}):
print(record)
which outputs the following info as a repeating series:
<div class="title">
<h4>
[the text]
</h4>
So far, so good.
I then want to pull out the links alone. For some reason I can't separate them from the surrounding text.
I've tried the following script:
print(record.href) #outputs "None"
print(record.findAll('a',{"href"})) #outputs "[]"
print(record.findAll('h4',{"a":"href"})) #outputs "[]"
Any pointers as to where I'm going wrong?
You can just use findAll again and then access the href value via ["href"]:
from bs4 import BeautifulSoup
html = """<div class="title">
<h4>
[the text]
</h4>"""
soup = BeautifulSoup(html, "html.parser")
for record in soup.findAll("div", {"class": "title"}):
print(record.findAll("a")[0]["href"])
Which prints:
[the link]
If there is more than one <a> inside the <div>, you can use a loop again, of course.

How to extract data(text) using beautiful soup when they are in the same class?

I'm working on a personal project where I scrape data from a website. I'm trying to use beautiful soup to do this but I came across data in the same class but a different attribute. For example:
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
How do I just get $11.99/kg? Right now I'm getting
$11.99 /kg
$5.44 /lb.
I've done x.select('.pi--secondary-price') but it returns both prices. How do I only get 1 price ($11.99 /kg)?
You could first get the <abbr> tag and then search for the respective parent tag. Like this:
from bs4 import BeautifulSoup
html = '''
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
kg = soup.find(title="Kilogram")
print(kg.parent.text)
This gives you the desired output $11.99 /kg. For more information, see the BeautifulSoup documentation.

Categories