this is my first question at stack overflow.
I am working on a web scraping project and I try to access html elements with beautiful soup.
Please can someone give me advice how to extract the following elements?
The task is to scrape all job listings from a search result page.
The job listing elements are inside the "ResultsSectionContainer".
I want to access each "article class" and
extract its id e.g job-item-7460756
extract its href where data-at="job-item-title"
extract its h2 text (solved)
How to loop through the ResultsSectionContainer and access/extract the information for each 'article class' element / id job-item ?
The name of the article class is somehow dynamic/unique and changes (I guess) every time a new search is done.
<div class="ResultsSectionContainer-gdhf14-0 cxyAav">\n
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">\n
<h2 class="sc-fzqARJ iyolKq">\n Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme\n
</h2>\n
</a>\n
<article class="sc-fzowVh cUgVEH" id="job-item-7465958">\n
...
You can do like this.
Select the <div> with class name as ResultsSectionContainer-gdhf14-0
Find all the <article> tags inside the above <div> using .find_all()- This will give you a list of all article tags
Iterate over the above list and extract the data you need.
from bs4 import BeautifulSoup
s = '''<div class="ResultsSectionContainer-gdhf14-0 cxyAav">
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">
<h2 class="sc-fzqARJ iyolKq"> Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme
</h2>
</a>
</div>'''
soup = BeautifulSoup(s, 'lxml')
d = soup.find('div', class_='ResultsSectionContainer-gdhf14-0')
for i in d.find_all('article'):
job_id = i['id']
job_link = i.find('a', {'data-at': 'job-item-title'})['href']
print(f'JOB_ID: {job_id}\nJOB_LINK: {job_link}')
JOB_ID: job-item-7460756
JOB_LINK: /stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html
If all article classes are same try this
articles = data.find_all("article", attrs={"class": "sc-fzowVh cUgVEH"})
for article in articles:
print(article.get("id"))
print(article.a.get("href"))
print(article.h2.text.strip())
You could do something like this:
results = soup.findAll('article', {'class': 'sc-fzowVh cUgVEH'})
for result in results:
id = result.attrs['id']
href = result.find('a').attrs['href']
h2 = result.text.strip()
print(f' Job id: \t{id}\n Job link: \t{href}\n Job desc: \t{h2}\n')
print('---')
you may also want to prefix href with the url where you're pulling the results from.
Related
How does one access a link from HTML divs?
Here is the HTML I am trying to scrape, I want to get the href value:
<div class="item-info-wrap">
<div class="news-feed_item-meta icon-font-before icon-espnplus-before"> <span class="timestamp">5d</span><span class="author">Field Yates</span> </div>
<h1> <a name="&lpos=nfl:feed:5:news" href="/nfl/insider/story/_/id/31949666/six-preseason-nfl-trades-teams-make-imagining-deals-nick-foles-xavien-howard-more" class=" realStory" data-sport="nfl" data-mptype="story">
Six NFL trades we'd love to see in August: Here's where Foles could help, but it's not the Colts</a></h1>
<p>Nick Foles is running the third team in Chicago. Xavien Howard wants out of Miami. Let's project six logical deals.</p></div>
Here is the code I have been trying to use to access the href value:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.espn.com/nfl/team/_/name/phi/philadelphia-eagles').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('div', class_='item-info-wrap'):
headline = article.h1.a.text
print(headline)
summary = article.p.text
print(summary)
try:
link_src = article.h1.a.href # Having difficulty getting href value
print(link_src)
link = f'https://espn.com/{link_src}'
except Exception as e:
link = None
print(link)
The output I am getting is https://espn.com/None for every ESPN article. Appreciate any help and feedback!
If you change the code in line 12 like the code below, it should work.
link_src = article.h1.a["href"]
FYI https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
I have a code (part of it) where I use beautifulsoup to scrape the text from H3:
company_name = job.find('h3', class_= 'joblist-comp-name').text.strip()
HTML looks like this:
<h3 class="joblist-comp-name">
ARK INFOSOFT
<span class="comp-more">(More Jobs) </span>
</h3>
My Result Looks like this:
Comapny Name: ARK INFOSOFT
(More Jobs)
As I understand, this code grabs the text inside the a tag which is inside the span which is inside the h3. I only wanted the text "ARK INFOSOFT. How can I avoid grabbing any other text within span's or a tags in the h3?
In order to not get the nested span:
Find the class you want.
Call the find_next() method on the found class, which will only return the first found match, and exclude the nested span.
from bs4 import BeautifulSoup
html = """<h3 class="joblist-comp-name">
ARK INFOSOFT
<span class="comp-more">(More Jobs) </span>
</h3>
"""
soup = BeautifulSoup(html, "html.parser")
company_name = soup.find("h3", class_="joblist-comp-name").find_next(text=True).strip()
Another option: use .contents:
company_name = soup.find("h3", class_="joblist-comp-name").contents[0].strip()
Output (in both examples):
>>> print(company_name)
ARK INFOSOFT
I am trying to save the contents of each article in its own text file. What I am having trouble with is coming up with a beautiful soup approach that returns articles of the type News only while ignoring the other article types.
Website in question: https://www.nature.com/nature/articles
Info
Every article is enclosed in a pair of <article> tags
Each article type is hidden inside a <span> tag containing the data-test attribute with the article.type value.
Title to the article is placed inside the <a> tag with the data-track-label="link" attribute.
The article body wrapped in the <div> tag (look for "body" in the class attribute).
Current code
I was able to get up to the point where I can query the <span> for articles of the News type, but am struggling to take the next steps to return the other article specific information.
How can I take this further? For the articles of the the type News, I'd like to also be able to return that article's title and body while ignoring the other articles that are not of type News?
# Send HTTP requests
import requests
from bs4 import BeautifulSoup
class WebScraper:
#staticmethod
def get_the_source():
# Obtain the URL
url = 'https://www.nature.com/nature/articles'
# Get the webpage
r = requests.get(url)
# Check response object's status code
if r:
the_source = open("source.html", "wb")
soup = BeautifulSoup(r.content, 'html.parser')
type_news = soup.find_all("span", string='News')
for i in type_news:
print(i.text)
the_source.write(r.content)
the_source.close()
print('\nContent saved.')
else:
print(f'The URL returned {r.status_code}!')
WebScraper.get_the_source()
Sample HTML for an article that is of type News
The source code has 19 other articles with similar and different article types.
<article class="u-full-height c-card c-card--flush" itemscope itemtype="http://schema.org/ScholarlyArticle">
<div class="c-card__image">
<picture>
<source
type="image/webp"
srcset="
//media.springernature.com/w165h90/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 160w,
//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 290w"
sizes="
(max-width: 640px) 160px,
(max-width: 1200px) 290px,
290px">
<img src="//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg"
alt=""
itemprop="image">
</picture>
</div>
<div class="c-card__body u-display-flex u-flex-direction-column">
<h3 class="c-card__title" itemprop="name headline">
<a href="/articles/d41586-021-00485-2"
class="c-card__link u-link-inherit"
itemprop="url"
data-track="click"
data-track-action="view article"
data-track-label="link">Mars arrivals and Etna eruption — February's best science images</a>
</h3>
<div class="c-card__summary u-mb-16 u-hide-sm-max"
itemprop="description">
<p>The month’s sharpest science shots, selected by <i>Nature's</i> photo team.</p>
</div>
<div class="u-mt-auto">
<ul data-test="author-list" class="c-author-list c-author-list--compact u-mb-4">
<li itemprop="creator" itemscope="" itemtype="http://schema.org/Person"><span itemprop="name">Emma Stoye</span></li>
</ul>
<div class="c-card__section c-meta">
<span class="c-meta__item c-meta__item--block-at-xl" data-test="article.type">
<span class="c-meta__type">News</span>
</span>
<time class="c-meta__item c-meta__item--block-at-xl" datetime="2021-03-05" itemprop="datePublished">05 Mar 2021</time>
</div>
</div>
</div>
</article>
</div>
</li>
<li class="app-article-list-row__item">
<div class="u-full-height" data-native-ad-placement="false">
The simplest way, and you get more results per hit, is to add News into the query string as a param
https://www.nature.com/nature/articles?type=news
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles?type=news')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item')
for n in news_articles:
print(n.select_one('.c-card__link').text)
A variety of params for page 2 of news:
https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&page=2
If you monitor the browser network tab whilst manually filtering on the page, or
selecting different pages numbers, you can see the logic of how the querystrings are constructed and tailor your requests accordingly e.g.
https://www.nature.com/nature/articles?type=news&year=2021
Otherwise, you could do more convoluted (in/ex)clusion with css selectors, based on whether based on whether article nodes have a specific child containing "News" (inclusion); exclusion beings News with another word/symbol (as per categories list):
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item:has(.c-meta__type:contains("News"):not( \
:contains("&"), \
:contains("in"), \
:contains("Career"), \
:contains("Feature")))') #exclusion n
for n in news_articles:
print(n.select_one('.c-card__link').text)
You can remove categories from the :not() list if you want News & or News In etc...
if you don't want to filter the URL, loop through <article> then check element text for class c-meta__type
articles = soup.select('article')
for article in articles:
article_type = article.select_one('.c-meta__type').text.strip()
if article_type == 'News':
# or if type contain News
# if 'News' in article_type:
title = article.select_one('a').text
summary = article.select_one('.c-card__summary p').text
print("{}: {}\n{}\n\n".format(article_type, title, summary))
I want scrape players name list from website, but names are on labels. I don't know how to scrape text on labels.
Here is the link
https://athletics.baruch.cuny.edu/sports/mens-swimming-and-diving/roster
For example, from html we have
How to scrape text from labels?
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/mens-swimming-and-diving/roster/gregory-becker/3555" aria-label="Gregory Becker - View Full Bio" title="View Full Bio">
<img class="lazyload" data-src="/images/2018/10/19/GREGORY_BECKER.jpg?width=80" alt="GREGORY BECKER">
</a>
</div>
You can use .get() method in BeautifulSoup. First select your element in elem or any other variable using any selector or find/find_all. Then try:
print(elem.get('aria-label'))
Below is the code that will help you to extract name from the a tag
from bs4 import BeautifulSoup
with open("<path-to-html-file>") as fp:
soup = BeautifulSoup(fp, 'html.parser') #parse the html
tags = soup.find_all('a') # get all the a tag
for tag in tags:
print(tag.get('aria-label')) #get the required text
I've tried to get the link from google map which the element is:
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
which I only would like to get https://www.google.com/maps/dir//11111/#22222
My code is
gpslocation = []
for gps in (secondpage_parser.find("a", {"data-track-id":"Google Map"})):
gpslocation.append(gps.attrs["href"])
I'm using 2 url pages (main and secondpage) for scraping a blog website which this is in the secondpage. The other info like Story-Title or Author Name work as it appears as text so I can use get_text().
But this case, I could not get the link after href. Please help.
Ps. In the case I only want Latitude and Longtitude in the link which are (11111 and 22222) is there is a way to use str.rplit?
Thank you so much
You can use the following :
secondpage_parser.find("a", {"data-track-id":"Google Map"})['href']
Use soup.find(...)['href'] for finding all links with an href or soup.find_all('a' ... , href=True)for all links
Yes you can use split to only get lat and long
First split on // and get the latest [-1]
Then split on /# to get both lat and long
from bs4 import BeautifulSoup
data = """
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for gps in soup.find_all('a', href=True):
href = gps['href']
print(href)
lati, longi = href.split("//")[-1].split('/#')
print(lati)
print(longi)