I want to find an a element in a soup object by a substring present in its class name. This particular element will always have JobTitle inside the class name, with random preceding and trailing characters, so I need to locate it by its substring of JobTitle.
You can see the element here:
It's safe to assume there is only 1 a element to find, so using find should work, however my attempts (there have been more than the 2 shown below) have not worked. I've also included the top elements in case it's relevant for location for some reason.
I'm on Windows 10, Python 3.10.5, and BS4 4.11.1.
I've created a reproducible example below (I thought the regex way would have worked, but I guess not):
import re
from bs4 import BeautifulSoup
# Parse this HTML, getting the only a['href'] in it (line 22)
html_to_parse = """
<li>
<div class="cardOutline tapItem fs-unmask result job_5ef6bf779263a83c sponsoredJob resultWithShelf sponTapItem desktop vjs-highlight">
<div class="slider_container css-g7s71f eu4oa1w0">
<div class="slider_list css-kyg8or eu4oa1w0">
<div class="slider_item css-kyg8or eu4oa1w0">
<div class="job_seen_beacon">
<div class="fe_logo">
<img alt="CyberCoders logo" class="feLogoImg desktop" src="https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/f0b43dcaa7850e2110bc8847ebad087b" />
</div>
<table cellpadding="0" cellspacing="0" class="jobCard_mainContent big6_visualChanges" role="presentation">
<tbody>
<tr>
<td class="resultContent">
<div class="css-1xpvg2o e37uo190">
<h2 class="jobTitle jobTitle-newJob css-bdjp2m eu4oa1w0" tabindex="-1">
<a aria-label="full details of REMOTE Senior Python Developer" class="jcs-JobTitle css-jspxzf eu4oa1w0" data-ci="385558680" data-empn="8690912762161442" data-hide-spinner="true" data-hiring-event="false" data-jk="5ef6bf779263a83c" data-mobtk="1g9u19rmn2ea6000" data-tu="https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=521&rx_c=&rx_campaign=indeed16&rx_group=110383&rx_source=Indeed&job=KE2-168714218&rx_r=none&rx_ts=20220808T034442Z&rx_pre=1&indeed=sp" href="/pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3" id="sj_5ef6bf779263a83c" role="button" target="_blank">
<span id="jobTitle-5ef6bf779263a83c" title="REMOTE Senior Python Developer">REMOTE Senior Python Developer</span>
</a>
</h2>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</li>
"""
# Soupify it
soup = BeautifulSoup(html_to_parse, "html.parser")
# Start by making sure "find_all("a")" works
all_links = soup.find_all("a")
print(all_links)
# Good.
# Attempt 1
job_url = soup.find('a[class*="JobTitle"]').a['href']
print(job_url)
# Nope.
# Attempt 2
job_url = soup.find("a", {"class": re.compile("^.*jobTitle.*")}).a['href']
print(job_url)
# Nope...
To find an element with partial class name you need to use select, not find. The will give you the <a> tag, the href will be in it
job_url = soup.select_one('a[class*="JobTitle"]')['href']
print(job_url)
# /pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3
The CSS selector only works with the .select() method.
See documentation here.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Change your code to something like
job_links = soup.select('a[class*="JobTitle"]')
print(job_links)
for job_link in job_links:
print(job_link.get("href"))
job_url = soup.select("a", {"class": "jobTitle"})[0]["href"]
Related
I have a html code that looks kind of like this (shortened);
<div id="activities" class="ListItems">
<h2>Standards</h2>
<ul>
<li>
<a class="Title" href="http://www.google.com" >Guidelines on management</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
</li>
</ul>
</div>
<div class="DocList">
<h3>Reports</h3>
<p class="SupLink">+ <a href="http://www.google.com/test" >View More</a></p>
<ul>
<li class="pdf">
<a class="Title" href="document.pdf" target="_blank" >Document</a>
<span class="Size">
[1,542.3KB]
</span>
<div class="Info">
<p>
text <a href="http://www.google.com" >Read more</a>
</p>
<p class="Date">
14/03/2018
</p>
</div>
</li>
</ul>
</div>
I am trying to select the value in 'href=' under 'a class="Title"' by using this code:
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
But I get two returns, the one under 'div class="DocList"' is also returned.
I am trying to change my xpath expressions so that I would only look within the node but I cannot get it to work.
Could someone please help me understand how to "search" within a specific node. I have gone through multiple xpath documentations but I cannot seem to figure it out.
Using // you are already selecting all the a elements in the document.
To search in a specific div try specifying the parent with // and then use //a again to look anywhere in the div
//div[#class="ListItems"]//a[#class="Title"]
for node in tree.xpath('//div[#class="ListItems"]//a[#class="Title"]'):url2.append(node.get("href"))
Try this xpath expression to select the div with a specific id recursively :
'//div[#id="activities"]//a[#class="Title"]'
so :
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//div[#id="activities"]//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
Note :
It's ever better to select an id than a class because an id should be unique (in real life, there's sometimes bad code with multiple same id in the same page, but a class can be repeated N times)
I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.
Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)
My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?
this is my first question if I have explained anything wrong please forgive me.
I am trying scrape url's from a specific website in python and parse the links to a csv. The thing is when i parse the website in BeautifulSoup I can't extract the url's because when I parse it in python I can only get <div id="dvScores" style="min-height: 400px;">\n</div>, and nothing under that branch. But when I open the console and copy the table where the links are and paste it to a text editor it pastes 600 pages of html. What I want to do is to write a for loop that shows the links. The structure of the html is below:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#shadow-root (open)
<head>...</head>
<body>
<div id="body">
<div id="wrapper">
#multiple divs but i don't need them
<div id="live-master"> #what I need is under this div
<span id="contextual">
#multiple divs but i don't need them
<div id="live-score-master"> #what I need is under this div
<div ng-app="live-menu" id="live-score-rightcoll">
#multiple divs but i don't need them
<div id="left-score-lefttemp" style="padding-top: 35px;">
<div id="dvScores">
<table cellspacing=0 ...>
<colgroup>...</colgroup>
<tbody>
<tr class="row line-bg1"> #this changes to bg2 or bg3
<td class="row">
<span class="row">
<a href="www.example.com" target="_blank" class="td_row">
#I need to extract this link
</span>
</td>
#Multiple td's
</tr>
#multiple tr class="row line-bg1" or "row line-bg2"
.
.
.
</tbody>
</table>
</div>
</div>
</div>
</div>
</span>
</div>
</div>
</body>
</html>
What am I doing wrong? I need to automate a system for python to do rather than pasting the html to text and extracting links with a regex.
My python code is below also:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://example.com/example")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("span",id="contextual")
span=all[0].find_all("tbody")
if you are trying scrape urls then you should get hrefs :
urls = soup.find_all('a', href=True)
This site uses JavaScript for populating its content, therefore, you can't get url via beautifulsoup. If you inspect network tab in your browser you can spot a this link. It contains all data what you need. You can simply parse it and extract all desired value.
import requests
req = requests.get('http://goapi.mackolik.com/livedata?group=0').json()
for el in req['m'][4:100]:
index = el[0]
team_1 = el[2].replace(' ', '-')
team_2 = el[4].replace(' ', '-')
print('http://www.mackolik.com/Mac/{}/{}-{}'.format(index, team_1, team_2))
It seems like the html is being dynamically generated by js. You would need to crawl it with a crawler to mimic a browser. Since you are using requests, it already has a crawler session.
session = requests.session()
data = session.get ("http://website.com").content #usage xample
After this you can do the parsing, additional scraping, etc.
Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text
You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup