A href catching - python

I'm using BeautifulSoup for parsing some html. Here is the content:
<tr>
<th>Your provider:</th>
<td>
<img src="/isp_logos/la-la-la.ico" alt=""/>
<a href="/isp/SomeProvider">
Provider name </a>
<a href="http://*/isp-comparer/?isp=000000">
</a>
</td>
</tr>
I have to get SomeProvider text from the link . My code is:
contentSoup = BeautifulSoup(ThatHtml)
print contentSoup.findAll('a', href=re.compile('/isp/(.*)'))
The result is empty array, why? Maybe there are another ways?

With your posted code and input, I'm getting:
[ Provider name ]
As the return of the array. Are you using the newest 3.1.x version of BeautifulSoup? I actually had the same problem, but it turns out I downloaded the 2.x version of BeautifulSoup thinking that the 2.x meant it was compatible with python 2.x.
Assuming that the first contains the SomeProvider, you could just use:
contentSoup.a
to extract that tag.

Related

Find "a" element in BS4 by partial class name not working?

I want to find an a element in a soup object by a substring present in its class name. This particular element will always have JobTitle inside the class name, with random preceding and trailing characters, so I need to locate it by its substring of JobTitle.
You can see the element here:
It's safe to assume there is only 1 a element to find, so using find should work, however my attempts (there have been more than the 2 shown below) have not worked. I've also included the top elements in case it's relevant for location for some reason.
I'm on Windows 10, Python 3.10.5, and BS4 4.11.1.
I've created a reproducible example below (I thought the regex way would have worked, but I guess not):
import re
from bs4 import BeautifulSoup
# Parse this HTML, getting the only a['href'] in it (line 22)
html_to_parse = """
<li>
<div class="cardOutline tapItem fs-unmask result job_5ef6bf779263a83c sponsoredJob resultWithShelf sponTapItem desktop vjs-highlight">
<div class="slider_container css-g7s71f eu4oa1w0">
<div class="slider_list css-kyg8or eu4oa1w0">
<div class="slider_item css-kyg8or eu4oa1w0">
<div class="job_seen_beacon">
<div class="fe_logo">
<img alt="CyberCoders logo" class="feLogoImg desktop" src="https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/f0b43dcaa7850e2110bc8847ebad087b" />
</div>
<table cellpadding="0" cellspacing="0" class="jobCard_mainContent big6_visualChanges" role="presentation">
<tbody>
<tr>
<td class="resultContent">
<div class="css-1xpvg2o e37uo190">
<h2 class="jobTitle jobTitle-newJob css-bdjp2m eu4oa1w0" tabindex="-1">
<a aria-label="full details of REMOTE Senior Python Developer" class="jcs-JobTitle css-jspxzf eu4oa1w0" data-ci="385558680" data-empn="8690912762161442" data-hide-spinner="true" data-hiring-event="false" data-jk="5ef6bf779263a83c" data-mobtk="1g9u19rmn2ea6000" data-tu="https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=521&rx_c=&rx_campaign=indeed16&rx_group=110383&rx_source=Indeed&job=KE2-168714218&rx_r=none&rx_ts=20220808T034442Z&rx_pre=1&indeed=sp" href="/pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3" id="sj_5ef6bf779263a83c" role="button" target="_blank">
<span id="jobTitle-5ef6bf779263a83c" title="REMOTE Senior Python Developer">REMOTE Senior Python Developer</span>
</a>
</h2>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</li>
"""
# Soupify it
soup = BeautifulSoup(html_to_parse, "html.parser")
# Start by making sure "find_all("a")" works
all_links = soup.find_all("a")
print(all_links)
# Good.
# Attempt 1
job_url = soup.find('a[class*="JobTitle"]').a['href']
print(job_url)
# Nope.
# Attempt 2
job_url = soup.find("a", {"class": re.compile("^.*jobTitle.*")}).a['href']
print(job_url)
# Nope...
To find an element with partial class name you need to use select, not find. The will give you the <a> tag, the href will be in it
job_url = soup.select_one('a[class*="JobTitle"]')['href']
print(job_url)
# /pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3
The CSS selector only works with the .select() method.
See documentation here.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Change your code to something like
job_links = soup.select('a[class*="JobTitle"]')
print(job_links)
for job_link in job_links:
print(job_link.get("href"))
job_url = soup.select("a", {"class": "jobTitle"})[0]["href"]

Python XPath keeps returning empty list

XPath via lxml in Python has been making me run in circles. I can't get it to extract text from an HTML table despite having what I believe to be the correct XPath. I'm using Chrome to inspect and extract the XPath, then using it in my code.
Here is the HTML table taken directly from the page:
<div id="vehicle-detail-model-specs-container">
<table id="vehicle-detail-model-specs" class="table table-striped vdp-feature-table">
<!-- Price -->
<tr>
<td><strong>Price:</strong></td>
<td>
<strong id="vehicle-detail-price" itemprop="price">$ 2,210.00</strong> </td>
</tr>
<!-- VIN -->
<tr><td><strong>VIN</strong></td><td> *0343</td></tr>
<!-- MILEAGE -->
<tr><td><strong>Mileage</strong></td><td>0 mi</td></tr>
</table>
I'm trying to extract the Mileage. The XPath I'm using is:
//*[#id="vehicle-detail-model-specs"]/tbody/tr[3]/td[2]
And the Python code that I'm using is:
page = requests.get(URL)
tree = html.fromstring(page.content)
mileage = tree.xpath('//*[#id="vehicle-detail-model-specs"]/tbody/tr[3]/td[2]')
print mileage
Note: I've tried adding /text() to the end and I still get nothing back, just an empty list [].
What am I doing wrong and why am I not able to extract the table value from the above examples?
As Amber has pointed out, you should omit the tbody part.
You use tbody in your xpath when there is no <tbody> tag in the html code for your table.
Using the html you posted, I am able to extract the mileage value with the following xpath:
tree.xpath('//*[#id="vehicle-detail-model-specs"]/tr[3]/td[2]')[0].text_content()

Scrape a form on incorrect web page

I'm trying to scrape a html form using robobrowser with python 3.4. I use the default html parser:
self._browser = RoboBrowser(history=True, parser="html.parser")
It works fine for correct web pages but now I have to parse incorrectly written page. Here is the html fragment:
<form method="post" action="decide.php?act=submit_advance">
<table class="td_advanced">
<tr class="td_advance">
<td colspan="4" class="td_advance"></strong><br></td>
<td colspan="3" class="td_left">Case sensitive:<br><br></td>
<td><input type="checkbox" name="case_sensitive" /><br><br></td>
[...]
</form>
The closing strong tag is incorrect. This error prevents the parser from read all inputs following this incorrect tag:
form = self._browser.get_form()
print(form)
>>> <RoboForm>
Any suggestions?
I have found the solution myself. The comment about beautifulsoup was helpful and took my search to a proper way.
The solution is : use another html parser. I tried with lxml and it works for me.
self._browser = RoboBrowser(history=True, parser="lxml")
As PyPI doesn't currently have lxml installer working with my python version, I downloaded it from here: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

python BeautifulSoup4 break for loop when tag found

I have a problem breaking a for loop when going trough a html with bs4.
I want to save a list separated with headings.
The HTML code can look something like below, however it contains more information between the desired tags:
<h2>List One</h2>
<td class="title">
<a title="Title One">This is Title One</a>
</td>
<td class="title">
<a title="Title Two">This is Title Two</a>
</td>
<h2>List Two</h2>
<td class="title">
<a title="Title Three">This is Title Three</a>
</td>
<td class="title">
<a title="Title Four">This is Title Four</a>
</td>
I would like to have the results printed like this:
List One
This is Title One
This is Title Two
List Two
This is Title Three
This is Title Four
I have come this far with my script:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('some webiste')
soup = BeautifulSoup(html, "lxml")
quote1 = soup.h2
print quote1.text
quote2 = quote1.find_next_sibling('h2')
print quote2.text
for quotes in soup.findAll('h2'):
if quotes.find(text=True) == quote2.text:
break
if quotes.find(text=True) == quote1.text:
for anchor in soup.findAll('td', {'class':'title'}):
print anchor.text
print quotes.text
I have tried to break the loop when "quote2" (List Two) is found. But the script gets all the td-content and ignoring the next h2-tags.
So how do I break the for loop with next h2-tag?
In my opinion the problem lies in your HTML syntax. According to https://validator.w3.org it's not legal to mix "td" and "h3" (or generally any header tag). Also, implementing list with tables is most likely not a good practice.
If you can manipulate your input files, the list you seem to need could be implemented with "ul" and "li" tags (first 'li' in 'ul' containing the header) or, if you need to use tables, just put your header inside of "td" tag, or even more cleanly with "th"s:
<table>
<tr>
<th>Your title</th>
</tr>
<tr>
<td>Your data</td>
</tr>
</table>
If the input is not under your control, your script could perform search and replace on the input text anyway, putting the headers into table cells or list items.

Using regex on python + beautiful soup

I have an html page like this:
<td class="subject windowbg2">
<div>
<span id="msg_152617">
<a href= SOME INFO THAT I WANT </a>
</span>
</div>
<div>
<span id="msg_465412">
<a href= SOME INFO THAT I WANT</a>
</span>
</div>
as you can see the id="msg_465412" have a variable number, so this is my code:
import urllib.request, http.cookiejar,re
from bs4 import BeautifulSoup
contenturl = "http://megahd.me/peliculas-microhd/"
htmll=urllib.request.urlopen(contenturl).read()
soup = BeautifulSoup(htmll)
print (soup.find('span', attrs=re.compile(r"{'id': 'msg_\d{6}'}")))
in the last line I tried to find all the "span" tags that contain an id that can be msg_###### (with any number) but something is wrong in my code and it doesn't find anything.
P.S: all the code I want is in a table with 6 columns and I want the third column of all rows, but I thought that it was easier to use regex
You're a bit mixed up with your attrs argument ... at the moment it's a regex which contains the string representation of a dictionary, when it needs to be a dictionary containing the attribute you're searching for and a regex for its value.
This ought to work:
print (soup.find('span', attrs={'id': re.compile(r"msg_\d{6}")}))
Try using the following:
soup.find_all("span" id=re.compile("msg_\d{6}"))

Categories