Using regex on python + beautiful soup

Using regex on python + beautiful soup - python

I have an html page like this:
<td class="subject windowbg2">
<div>
<span id="msg_152617">
<a href= SOME INFO THAT I WANT </a>
</span>
</div>
<div>
<span id="msg_465412">
<a href= SOME INFO THAT I WANT</a>
</span>
</div>
as you can see the id="msg_465412" have a variable number, so this is my code:
import urllib.request, http.cookiejar,re
from bs4 import BeautifulSoup
contenturl = "http://megahd.me/peliculas-microhd/"
htmll=urllib.request.urlopen(contenturl).read()
soup = BeautifulSoup(htmll)
print (soup.find('span', attrs=re.compile(r"{'id': 'msg_\d{6}'}")))
in the last line I tried to find all the "span" tags that contain an id that can be msg_###### (with any number) but something is wrong in my code and it doesn't find anything.
P.S: all the code I want is in a table with 6 columns and I want the third column of all rows, but I thought that it was easier to use regex

You're a bit mixed up with your attrs argument ... at the moment it's a regex which contains the string representation of a dictionary, when it needs to be a dictionary containing the attribute you're searching for and a regex for its value.
This ought to work:
print (soup.find('span', attrs={'id': re.compile(r"msg_\d{6}")}))

Try using the following:
soup.find_all("span" id=re.compile("msg_\d{6}"))

Related

Find "a" element in BS4 by partial class name not working?

I want to find an a element in a soup object by a substring present in its class name. This particular element will always have JobTitle inside the class name, with random preceding and trailing characters, so I need to locate it by its substring of JobTitle.
You can see the element here:
It's safe to assume there is only 1 a element to find, so using find should work, however my attempts (there have been more than the 2 shown below) have not worked. I've also included the top elements in case it's relevant for location for some reason.
I'm on Windows 10, Python 3.10.5, and BS4 4.11.1.
I've created a reproducible example below (I thought the regex way would have worked, but I guess not):
import re
from bs4 import BeautifulSoup
# Parse this HTML, getting the only a['href'] in it (line 22)
html_to_parse = """
<li>
<div class="cardOutline tapItem fs-unmask result job_5ef6bf779263a83c sponsoredJob resultWithShelf sponTapItem desktop vjs-highlight">
<div class="slider_container css-g7s71f eu4oa1w0">
<div class="slider_list css-kyg8or eu4oa1w0">
<div class="slider_item css-kyg8or eu4oa1w0">
<div class="job_seen_beacon">
<div class="fe_logo">
<img alt="CyberCoders logo" class="feLogoImg desktop" src="https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/f0b43dcaa7850e2110bc8847ebad087b" />
</div>
<table cellpadding="0" cellspacing="0" class="jobCard_mainContent big6_visualChanges" role="presentation">
<tbody>
<tr>
<td class="resultContent">
<div class="css-1xpvg2o e37uo190">
<h2 class="jobTitle jobTitle-newJob css-bdjp2m eu4oa1w0" tabindex="-1">
<a aria-label="full details of REMOTE Senior Python Developer" class="jcs-JobTitle css-jspxzf eu4oa1w0" data-ci="385558680" data-empn="8690912762161442" data-hide-spinner="true" data-hiring-event="false" data-jk="5ef6bf779263a83c" data-mobtk="1g9u19rmn2ea6000" data-tu="https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=521&rx_c=&rx_campaign=indeed16&rx_group=110383&rx_source=Indeed&job=KE2-168714218&rx_r=none&rx_ts=20220808T034442Z&rx_pre=1&indeed=sp" href="/pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3" id="sj_5ef6bf779263a83c" role="button" target="_blank">
<span id="jobTitle-5ef6bf779263a83c" title="REMOTE Senior Python Developer">REMOTE Senior Python Developer</span>
</a>
</h2>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</li>
"""
# Soupify it
soup = BeautifulSoup(html_to_parse, "html.parser")
# Start by making sure "find_all("a")" works
all_links = soup.find_all("a")
print(all_links)
# Good.
# Attempt 1
job_url = soup.find('a[class*="JobTitle"]').a['href']
print(job_url)
# Nope.
# Attempt 2
job_url = soup.find("a", {"class": re.compile("^.*jobTitle.*")}).a['href']
print(job_url)
# Nope...

To find an element with partial class name you need to use select, not find. The will give you the <a> tag, the href will be in it
job_url = soup.select_one('a[class*="JobTitle"]')['href']
print(job_url)
# /pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3

The CSS selector only works with the .select() method.
See documentation here.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Change your code to something like
job_links = soup.select('a[class*="JobTitle"]')
print(job_links)
for job_link in job_links:
print(job_link.get("href"))

job_url = soup.select("a", {"class": "jobTitle"})[0]["href"]

Python BeautifulSoup find_all with regex doesn't match text

I have the following HTML code:
<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>
I would like to get the anchor tag that has Shop as text disregarding the spacing before and after. I have tried the following code, but I keep getting an empty array:
import re
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
prog = re.compile('\s*Shop\s*')
print(soup.find_all("a", string=prog))
# Output: []
I also tried retrieving the text using get_text():
text = soup.find_all("a")[0].get_text()
print(repr(text))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'
and ran the following code to make sure my Regex was right, which seems to be to the case.
result = prog.match(text)
print(repr(result.group()))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'
I also tried selecting span instead of a but I get the same issue. I'm guessing it's something with find_all, I have read the BeautifulSoup documentation but I still can't find the issue. Any help would be appreciated. Thanks!

The problem you have here is that the text you are looking for is in a tag that contains children tags, and when a tag has children tags, the string property is empty.
You can use a lambda expression in the .find call and since you are looking for a fixed string, you may use a mere 'Shop' in t.text condition rather than a regex check:
soup.find(lambda t: t.name == "a" and 'Shop' in t.text)

The text Shop you are searching it is inside span tag so when you are trying with regular expression its unable to fetch the value using regex.
You can try regex to find text and then parent of that.
import re
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(text=re.compile('Shop')).parent.parent)
If you have BS 4.7.1 or above you can use following css selector.
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('a:contains("Shop")'))

extract text from html with python xpath

I want to extract the price of a player in futbin. Some part of the html is here:
<div class="pr pr_pc" id="pr_pc">PR: 10,250 - 150,000</div>
<div id="pclowest" class="hide">23500</div>
I´ve programmed this with python:
from lxml import html
import requests
page = requests.get('https://www.futbin.com/18/player/15660/Mar%C3%A7al/')
tree = html.fromstring(page.content)
player = tree.xpath('//*[#id="pclowest"]')
print 'player: ', player
I want to extract the value 23500 automatically, but i cannot. Someone can help me?
Edit:
Theres another piece of code where data maybe can be extracted:
<div class="bin_price lbin">
<span class="price_big_right">
<span id="pc-lowest-1" data-price="23,000">23,000 <img alt="c" class="coins_icon_l_bin" src="https://cdn.futbin.com/design/img/coins_bin.png">
</span>
</span>
</div>
Could be possible to extract data-price here?

Python list processing to extract substrings

I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.
I now have to clean out HTML strings from this list, leaving behind string tokens I need.
The list I start with looks like this:
[<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>]
The whitespaces are deliberate.
I need to reduce that list to:
[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]
What's an efficient way to parse out substrings like this?
I've tried using the split method (e.g. [item.split('<div class="info-1">\n',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.
I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.

Do not convert BS object to string unless you really need to do that.
Use CSS selector to find the class that starts with info
Use stripped_strings to get all the non-empty strings under a tag
Use tuple() to convert an iterable to tuple object
import bs4
html = '''<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
for div in soup.select('div[class^="info"]'):
t = tuple(text for text in div.stripped_strings)
print(t)
out:
('Name1a', 'Score1a')
('Name1b', 'Score1b')
('Name2a', 'Score2a')
('Name2b', 'Score2b')
('Name3a', 'Score3a')
('Name3b', 'Score3b')

how to get the context of a search in BeautifulSoup?

I am parsing a web page made up of various HTML entities, among them the fragment below:
<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : some text </strong></p>
<p style="text-align: center;"><strong>some other words : some other words</strong></p>
I am interested in the URL after My keywords (http://example.com/hello.html in the example above). The combination of My keywords and the link afterwards is unique in the page.
Right now I use a regex to extract the URL:
import requests
import re
def getfile(link):
r = requests.get(link).text
try:
link = re.search('My keyword : <a href="(.+)" target', r).group(1)
except AttributeError:
print("no direct link for {link}".format(link=link))
else:
return link
print(getfile('http://example.com'))
This method, while working, is very dependent on the exact format of the matched string. I would very much prefer to use BeautifulSoup to:
search for My keyword
get its context (by that I mean the whole value of the tag which contains that string, My keywords : some text in the case above)
run it again though BeautifulSoup in order to extract the URL in the <a>
I am failing on the second part, I only get
[u'My keywords : ']
when trying a string search
import bs4
import re
thehtml = '''
<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : some text </strong></p>
<p style="text-align: center;"><strong>some other words : some other words</strong></p>
'''
soup = bs4.BeautifulSoup(thehtml)
k = soup.find_all(text=re.compile("My keywords"))
print(k)
How can I get the whole content of the surrounding tag? (I cannot assume that this will always be <strong> as in the example above)

You can use find() instead find_all() because there is only one match. Then use next_sibling to find the <a> tag and href to get its value, example:
import bs4
import re
thehtml = '''
<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : some text </strong></p>
<p style="text-align: center;"><strong>some other words : some other words</strong></p>
'''
soup = bs4.BeautifulSoup(thehtml)
k = soup.find(text=re.compile("My keywords")).next_sibling['href']
print(k)
yields:
http://example.com/hello.html
UPDATE: Based in comments, to get the element that contains some text, use parent, like:
k = soup.find(text=re.compile("My keywords")).parent.text
That yields:
<strong>My keywords : some text </strong>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex on python + beautiful soup - python

Try using the following: soup.find_all("span" id=re.compile("msg_\d{6}"))

Related

Find "a" element in BS4 by partial class name not working?

Python BeautifulSoup find_all with regex doesn't match text

extract text from html with python xpath

Python list processing to extract substrings

how to get the context of a search in BeautifulSoup?

Categories

Resources