How to scrape aria-label text in python? - python

I want scrape players name list from website, but names are on labels. I don't know how to scrape text on labels.
Here is the link
https://athletics.baruch.cuny.edu/sports/mens-swimming-and-diving/roster
For example, from html we have
How to scrape text from labels?
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/mens-swimming-and-diving/roster/gregory-becker/3555" aria-label="Gregory Becker - View Full Bio" title="View Full Bio">
<img class="lazyload" data-src="/images/2018/10/19/GREGORY_BECKER.jpg?width=80" alt="GREGORY BECKER">
</a>
</div>

You can use .get() method in BeautifulSoup. First select your element in elem or any other variable using any selector or find/find_all. Then try:
print(elem.get('aria-label'))

Below is the code that will help you to extract name from the a tag
from bs4 import BeautifulSoup
with open("<path-to-html-file>") as fp:
soup = BeautifulSoup(fp, 'html.parser') #parse the html
tags = soup.find_all('a') # get all the a tag
for tag in tags:
print(tag.get('aria-label')) #get the required text

Related

How to scrape <span > and next <p>?

I am trying to scrape some information from a webpage using Selenium. In <span id='text'>, I want to extract the id value (text) and in the same div I want to extract <p> element.
here is what I have tried:
import requests
from bs4 import BeautifulSoup
# Send an HTTP request to the website and retrieve the HTML code of the webpage
response = requests.get('https://www.osha.gov/laws-regs/regulations/standardnumber/1926/1926.451#1926.451(a)(6)')
html = response.text
# Parse the HTML code using Beautiful Soup to extract the desired information
soup = BeautifulSoup(html, 'html.parser')
# find all <a> elements on the page with name attribute
links = soup.find_all('a', attrs={'name': True})
print(links)
linq = []
for link in links:
#print(link['name'])
linq.append(link['name'])
information = soup.find_all('p') # find all <p> elements on the page
# This is how I did it
with open('osha.txt', 'w') as f:
for i in range(len(linq)):
f.write(linq[i])
f.write('\n')
f.write(infoo[i])
f.write('\n')
f.write('-' * 50)
f.write('\n')
Below is the HTML code.
What I want is to save this in a separate text file is this information:
1926.451(a)
Capacity
<div class="field--item">
<div class="paragraph paragraph--type--regulations-standard-number paragraph--view-mode--token">
<span id="1926.451(a)">
<a href="/laws-regs/interlinking/standards/1926.451(a)" name="1926.451(a)">
1926.451(a)
</a>
</span>
<div class="field field--name-field-standard-paragraph-body-p">
<p>"Capacity"</p>
</div>
</div>
</div>
Some of the a tag and paragraph you might missing on the page.
Use try except block to handle that.
Use css selector to get the parent node and then get respective child nodes.
user dataframe to store the value and export it to csv file.
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
# Send an HTTP request to the website and retrieve the HTML code of the webpage
response = requests.get('https://www.osha.gov/laws-regs/regulations/standardnumber/1926/1926.451#1926.451(a)(6)')
html = response.text
code=[]
para=[]
# Parse the HTML code using Beautiful Soup to extract the desired information
soup = BeautifulSoup(html, 'html.parser')
for item in soup.select(".field.field--name-field-reg-standard-number .field--item"):
try:
code.append(item.find("a").text.strip())
except:
code.append(item.find("span").text.strip())
try:
para.append(item.find("p").text.strip())
except:
para.append("Nan")
df=pd.DataFrame({"code" : code, "paragraph" : para})
print(df)
df.to_csv("path/to/filenme")
Output:

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

I am trying to get information from the <a> tag in between these two span tags
<span class="mentioned">
<a class="mentioned-123" onclick="information('123');" href="#28669">>>28669</a>
</span>
For example I would like to be able to get the value of the href in it. How can I do this?
You can look for the mentioned-123 class and then access the href with:
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a", class_="mentioned-123")["href"])

BeautifulSoup Web Scraping - Can't access and extract element

this is my first question at stack overflow.
I am working on a web scraping project and I try to access html elements with beautiful soup.
Please can someone give me advice how to extract the following elements?
The task is to scrape all job listings from a search result page.
The job listing elements are inside the "ResultsSectionContainer".
I want to access each "article class" and
extract its id e.g job-item-7460756
extract its href where data-at="job-item-title"
extract its h2 text (solved)
How to loop through the ResultsSectionContainer and access/extract the information for each 'article class' element / id job-item ?
The name of the article class is somehow dynamic/unique and changes (I guess) every time a new search is done.
<div class="ResultsSectionContainer-gdhf14-0 cxyAav">\n
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">\n
<h2 class="sc-fzqARJ iyolKq">\n Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme\n
</h2>\n
</a>\n
<article class="sc-fzowVh cUgVEH" id="job-item-7465958">\n
...
You can do like this.
Select the <div> with class name as ResultsSectionContainer-gdhf14-0
Find all the <article> tags inside the above <div> using .find_all()- This will give you a list of all article tags
Iterate over the above list and extract the data you need.
from bs4 import BeautifulSoup
s = '''<div class="ResultsSectionContainer-gdhf14-0 cxyAav">
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">
<h2 class="sc-fzqARJ iyolKq"> Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme
</h2>
</a>
</div>'''
soup = BeautifulSoup(s, 'lxml')
d = soup.find('div', class_='ResultsSectionContainer-gdhf14-0')
for i in d.find_all('article'):
job_id = i['id']
job_link = i.find('a', {'data-at': 'job-item-title'})['href']
print(f'JOB_ID: {job_id}\nJOB_LINK: {job_link}')
JOB_ID: job-item-7460756
JOB_LINK: /stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html
If all article classes are same try this
articles = data.find_all("article", attrs={"class": "sc-fzowVh cUgVEH"})
for article in articles:
print(article.get("id"))
print(article.a.get("href"))
print(article.h2.text.strip())
You could do something like this:
results = soup.findAll('article', {'class': 'sc-fzowVh cUgVEH'})
for result in results:
id = result.attrs['id']
href = result.find('a').attrs['href']
h2 = result.text.strip()
print(f' Job id: \t{id}\n Job link: \t{href}\n Job desc: \t{h2}\n')
print('---')
you may also want to prefix href with the url where you're pulling the results from.

How to extract href url from html anchor using lxml?

I try to extract the next page href string using lxml.
For example I try to extract the "/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" from the html in the following example:
<nav rel="nav" class="pagination-container AjaxPager">
<a href="/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" data-page-number="next-page" class="button button--primary next-page" rel="next" data-track-link="{'target': 'Company profile', 'name': 'navigation', 'navigationType': 'next'}">
Next page
</a>
</nav>
I have tried the following but it returns a list not the string that I am looking for:
import requests
import lxml.html as html
URL = https://uk.trustpilot.com/review/bulb.co.uk
page = requests.get(URL)
tree = html.fromstring(page.content)
href = tree.xpath('//a/#href')
Any idea what I am doing wrong?
Making this change to your code
href = tree.xpath('//a[#class="button button--primary next-page"]/#href')
href[0]
Gives me this output:
'/review/bulb.co.uk?b=MTYxOTk1ODMxMzAwMHw2MDhlOWEyOWY5ZjQ4NzA4ZTA4MjMxNTE'
which is close to the output in your question (its value may change dynamically).

How can I get the link from href in "a" with class name by using python 3

I've tried to get the link from google map which the element is:
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
which I only would like to get https://www.google.com/maps/dir//11111/#22222
My code is
gpslocation = []
for gps in (secondpage_parser.find("a", {"data-track-id":"Google Map"})):
gpslocation.append(gps.attrs["href"])
I'm using 2 url pages (main and secondpage) for scraping a blog website which this is in the secondpage. The other info like Story-Title or Author Name work as it appears as text so I can use get_text().
But this case, I could not get the link after href. Please help.
Ps. In the case I only want Latitude and Longtitude in the link which are (11111 and 22222) is there is a way to use str.rplit?
Thank you so much
You can use the following :
secondpage_parser.find("a", {"data-track-id":"Google Map"})['href']
Use soup.find(...)['href'] for finding all links with an href or soup.find_all('a' ... , href=True)for all links
Yes you can use split to only get lat and long
First split on // and get the latest [-1]
Then split on /# to get both lat and long
from bs4 import BeautifulSoup
data = """
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for gps in soup.find_all('a', href=True):
href = gps['href']
print(href)
lati, longi = href.split("//")[-1].split('/#')
print(lati)
print(longi)

Categories