Matching specific table within HTML, BeautifulSoup

Matching specific table within HTML, BeautifulSoup - python

I have this problem. There're several similar tables on the page I'm trying to scrape.
<h2 class="tabellen_ueberschrift al">Points</h2>
<div class="fl" style="width:49%;">
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
The only difference between them is the text within h2 tags, here: Points
How can I specifiy which table I need to search in?
I have this code and need to adjust the h2 tag factor:
my_tab = soup.find('table', {'class':'tabelle_grafik lh'})
Need some help guys.

This works for me. Find the "previousSiblings" and if you find a h2 with the text "Points" before an h2 tag with a different text contents, you've found a good table
from BeautifulSoup import BeautifulSoup
t="""
<h2 class="tabellen_ueberschrift al">Points</h2>
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
<th><td>yes me!</th></td></table>
<h2 class="tabellen_ueberschrift al">Bad</h2>
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
<th><td>woo woo</td></th></table>
"""
soup = BeautifulSoup(t)
for ta in soup.findAll('table'):
for s in ta.findPreviousSiblings():
if s.name == u'h2':
if s.text == u'Points':
print ta
else:
break;

Looks like this is a job for xpath. But, BeautifulSoup doesn't support XPath expressions.
Consider switching to lxml or scrapy.
FYI, for test xml like:
<html>
<h2 class="tabellen_ueberschrift al">Points</h2>
<div class="fl" style="width:49%;">
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">a</table>
</div>
<h2 class="tabellen_ueberschrift al">Illegal</h2>
<div class="fl" style="width:49%;">
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">b</table>
</div>
</html>
XPath expression to find table with class "tabelle_grafik lh" in div after h2="Points" is:
//table[#class="tabelle_grafik lh" and ../preceding-sibling::h2[1][text()="Points"]]

Related

Find "a" element in BS4 by partial class name not working?

I want to find an a element in a soup object by a substring present in its class name. This particular element will always have JobTitle inside the class name, with random preceding and trailing characters, so I need to locate it by its substring of JobTitle.
You can see the element here:
It's safe to assume there is only 1 a element to find, so using find should work, however my attempts (there have been more than the 2 shown below) have not worked. I've also included the top elements in case it's relevant for location for some reason.
I'm on Windows 10, Python 3.10.5, and BS4 4.11.1.
I've created a reproducible example below (I thought the regex way would have worked, but I guess not):
import re
from bs4 import BeautifulSoup
# Parse this HTML, getting the only a['href'] in it (line 22)
html_to_parse = """
<li>
<div class="cardOutline tapItem fs-unmask result job_5ef6bf779263a83c sponsoredJob resultWithShelf sponTapItem desktop vjs-highlight">
<div class="slider_container css-g7s71f eu4oa1w0">
<div class="slider_list css-kyg8or eu4oa1w0">
<div class="slider_item css-kyg8or eu4oa1w0">
<div class="job_seen_beacon">
<div class="fe_logo">
<img alt="CyberCoders logo" class="feLogoImg desktop" src="https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/f0b43dcaa7850e2110bc8847ebad087b" />
</div>
<table cellpadding="0" cellspacing="0" class="jobCard_mainContent big6_visualChanges" role="presentation">
<tbody>
<tr>
<td class="resultContent">
<div class="css-1xpvg2o e37uo190">
<h2 class="jobTitle jobTitle-newJob css-bdjp2m eu4oa1w0" tabindex="-1">
<a aria-label="full details of REMOTE Senior Python Developer" class="jcs-JobTitle css-jspxzf eu4oa1w0" data-ci="385558680" data-empn="8690912762161442" data-hide-spinner="true" data-hiring-event="false" data-jk="5ef6bf779263a83c" data-mobtk="1g9u19rmn2ea6000" data-tu="https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=521&rx_c=&rx_campaign=indeed16&rx_group=110383&rx_source=Indeed&job=KE2-168714218&rx_r=none&rx_ts=20220808T034442Z&rx_pre=1&indeed=sp" href="/pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3" id="sj_5ef6bf779263a83c" role="button" target="_blank">
<span id="jobTitle-5ef6bf779263a83c" title="REMOTE Senior Python Developer">REMOTE Senior Python Developer</span>
</a>
</h2>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</li>
"""
# Soupify it
soup = BeautifulSoup(html_to_parse, "html.parser")
# Start by making sure "find_all("a")" works
all_links = soup.find_all("a")
print(all_links)
# Good.
# Attempt 1
job_url = soup.find('a[class*="JobTitle"]').a['href']
print(job_url)
# Nope.
# Attempt 2
job_url = soup.find("a", {"class": re.compile("^.*jobTitle.*")}).a['href']
print(job_url)
# Nope...

To find an element with partial class name you need to use select, not find. The will give you the <a> tag, the href will be in it
job_url = soup.select_one('a[class*="JobTitle"]')['href']
print(job_url)
# /pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3

The CSS selector only works with the .select() method.
See documentation here.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Change your code to something like
job_links = soup.select('a[class*="JobTitle"]')
print(job_links)
for job_link in job_links:
print(job_link.get("href"))

job_url = soup.select("a", {"class": "jobTitle"})[0]["href"]

Element finding with repeated tags in python selenium

This is the html I have on a website:
<table class="table table-fixed table-header-right text-medium">
<tbody><tr><th class="no-border">Certification Number</th><td class="no-border">48487270</td></tr>
<tr>
<th>Label Type</th>
<td>
<img width="69" height="38" class="margin-right-min" alt="" aria-hidden="true" src="https://i.psacard.com/psacard/images/cert/table-image-ink.png" style="">
<span class="inline-block padding-top-min">with fugitive ink technology</span>
</td>
</tr>
<tr><th>Reverse Cert Number/Barcode</th><td>Yes</td></tr>
<tr><th>Year</th><td>2020</td></tr>
<tr><th>Brand</th><td>TOPPS</td></tr>
<tr><th>Sport</th><td>BASEBALL CARDS</td></tr>
<tr><th>Card Number</th><td>20</td></tr>
<tr><th>Player</th><td>ARISTIDES AQUINO</td></tr>
<tr><th>Variety/Pedigree</th><td></td></tr>
<tr><th>Grade</th><td>NM-MT 8</td></tr>
</tbody></table>
I am trying to figure out a way to get and set the year to a variable, the normal way I find elements is with XPath but since these tags are repeated so many times with no other indicators I am unsure how to go about this. The year will change so I cant search by text. Any help would be appreciated.

Use BeautifulSoup to find the <th> tag with the text 'Year'. Then find the next <td> tag and extract the text from that:
from bs4 import BeautifulSoup
html = '''<table class="table table-fixed table-header-right text-medium">
<tbody><tr><th class="no-border">Certification Number</th><td class="no-border">48487270</td></tr>
<tr>
<th>Label Type</th>
<td>
<img width="69" height="38" class="margin-right-min" alt="" aria-hidden="true" src="https://i.psacard.com/psacard/images/cert/table-image-ink.png" style="">
<span class="inline-block padding-top-min">with fugitive ink technology</span>
</td>
</tr>
<tr><th>Reverse Cert Number/Barcode</th><td>Yes</td></tr>
<tr><th>Year</th><td>2020</td></tr>
<tr><th>Brand</th><td>TOPPS</td></tr>
<tr><th>Sport</th><td>BASEBALL CARDS</td></tr>
<tr><th>Card Number</th><td>20</td></tr>
<tr><th>Player</th><td>ARISTIDES AQUINO</td></tr>
<tr><th>Variety/Pedigree</th><td></td></tr>
<tr><th>Grade</th><td>NM-MT 8</td></tr>
</tbody></table>'''
soup = BeautifulSoup(html, 'html.parser')
year = soup.find('th', text='Year').find_next('td').text
print(year)
Output:
'2020'

Firstly we need to find out webelements using driver.findelements function using that classname
And then we can get elements from that list
By list.get(index)
Or,
You can store all the td/th elements in a list and than search the list for year you are looking for.

Extract Text from HTML Python (BeautifulSoup, RE, Other Option?)

I am familiar with BeautifulSoup and Regular Expressions as a means of extracting text from HTML but not as familiar with others, such as ElementTree, Minidom, etc.
My question is fairly straightforward. Given the HTML snippet below, which library is best for extracting the text below? The text being the integer.
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>

With BeautifulSoup it is fairly straight-forward:
from bs4 import BeautifulSoup
data = """
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>
"""
soup = BeautifulSoup(data)
print(soup.td['data-tooltip'])
If you have multiple td elements and you need to extract the data-tooltip from each one:
for td in soup.find_all('td', {'data-tooltip': True}):
print(td['data-tooltip'])

Right way to strip tags except some in python

For example, i have HTML code, where contains codes like this
anchor
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
And i want remove all tags attributes and save only some tags (for example, remove table, tr, tr, th tags), so, i want get something like this.
anchor
<table>
<tr>
<td>
</td>
</tr>
</table>
<p>content</p>
I do it using for loop, but my code retrieves each tag and cleans it. I think that my way slow.
What you can suggest me? Thanks.
Update #1
In my solution i use this code for removing tags (stealed from django)
def remove_tags(html, tags):
"""Returns the given HTML with given tags removed."""
tags = [re.escape(tag) for tag in tags.split()]
tags_re = '(%s)' % '|'.join(tags)
starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
endtag_re = re.compile('</%s>' % tags_re)
html = starttag_re.sub('', html)
html = endtag_re.sub('', html)
return html
And this code to clean HTML attributes
# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean
html = 'Some html code'
safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)

Use beautifulsoup.
html = """
anchor
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
del soup.table.tr.td.attrs
del soup.table.attrs
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table>
<tr>
<td>
</td>
</tr>
</table>
<p class="" style="">
content
</p>
</body>
</html>
To clear tags:
soup = BeautifulSoup(html)
soup.table.clear()
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table id="some">
</table>
<p class="" style="">
content
</p>
</body>
</html>
To delete particulat attribute:
soup = BeautifulSoup(html)
td_tag = soup.table.td
del td_tag['class']
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table id="some">
<tr>
<td>
</td>
</tr>
</table>
<p class="" style="">
content
</p>
</body>
</html>

What you are looking for is called parsing.
BeautifulSoup is one of most popular / most used libraries for parsing html.
You can use it to remove tags and it is pretty well documented.
If you (because of some reason) can not use BeautifulSoup then look into python re module.

Add parent tags with beautiful soup

I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.

How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching specific table within HTML, BeautifulSoup - python

Related

Find "a" element in BS4 by partial class name not working?

Element finding with repeated tags in python selenium

Extract Text from HTML Python (BeautifulSoup, RE, Other Option?)

Right way to strip tags except some in python

Add parent tags with beautiful soup

Categories

Resources