how to get the context of a search in BeautifulSoup? - python

I am parsing a web page made up of various HTML entities, among them the fragment below:
<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : some text </strong></p>
<p style="text-align: center;"><strong>some other words : some other words</strong></p>
I am interested in the URL after My keywords (http://example.com/hello.html in the example above). The combination of My keywords and the link afterwards is unique in the page.
Right now I use a regex to extract the URL:
import requests
import re
def getfile(link):
r = requests.get(link).text
try:
link = re.search('My keyword : <a href="(.+)" target', r).group(1)
except AttributeError:
print("no direct link for {link}".format(link=link))
else:
return link
print(getfile('http://example.com'))
This method, while working, is very dependent on the exact format of the matched string. I would very much prefer to use BeautifulSoup to:
search for My keyword
get its context (by that I mean the whole value of the tag which contains that string, My keywords : some text in the case above)
run it again though BeautifulSoup in order to extract the URL in the <a>
I am failing on the second part, I only get
[u'My keywords : ']
when trying a string search
import bs4
import re
thehtml = '''
<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : some text </strong></p>
<p style="text-align: center;"><strong>some other words : some other words</strong></p>
'''
soup = bs4.BeautifulSoup(thehtml)
k = soup.find_all(text=re.compile("My keywords"))
print(k)
How can I get the whole content of the surrounding tag? (I cannot assume that this will always be <strong> as in the example above)

You can use find() instead find_all() because there is only one match. Then use next_sibling to find the <a> tag and href to get its value, example:
import bs4
import re
thehtml = '''
<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : some text </strong></p>
<p style="text-align: center;"><strong>some other words : some other words</strong></p>
'''
soup = bs4.BeautifulSoup(thehtml)
k = soup.find(text=re.compile("My keywords")).next_sibling['href']
print(k)
yields:
http://example.com/hello.html
UPDATE: Based in comments, to get the element that contains some text, use parent, like:
k = soup.find(text=re.compile("My keywords")).parent.text
That yields:
<strong>My keywords : some text </strong>

Related

How to scrape just one text value on one p tag from bs4

Actually The website has one <p> but inside it there are two text values, I just want to scrape one of the texts. website HTML as below:
<p class="text-base font-medium text-gray-700 w-1/2" xpath="1">
Great Clips
<br><span class="text-blue-600 font-normal text-sm">Request Info</span>
</p>
On HTML above, there are two text values ("Great Clips" & "Request Info")if we target <p>. I just want to scrape "Great Clips" not both, how would I do that with bs4?
You could use .contents with indexing to extract only the first child:
soup.p.contents[0].strip()
Example
from bs4 import BeautifulSoup
html = '''
<p class="text-base font-medium text-gray-700 w-1/2" xpath="1">
Great Clips
<br><span class="text-blue-600 font-normal text-sm">Request Info</span>
</p>
'''
soup = BeautifulSoup(html)
soup.p.contents[0].strip()
Output
Great Clips

Find "a" element in BS4 by partial class name not working?

I want to find an a element in a soup object by a substring present in its class name. This particular element will always have JobTitle inside the class name, with random preceding and trailing characters, so I need to locate it by its substring of JobTitle.
You can see the element here:
It's safe to assume there is only 1 a element to find, so using find should work, however my attempts (there have been more than the 2 shown below) have not worked. I've also included the top elements in case it's relevant for location for some reason.
I'm on Windows 10, Python 3.10.5, and BS4 4.11.1.
I've created a reproducible example below (I thought the regex way would have worked, but I guess not):
import re
from bs4 import BeautifulSoup
# Parse this HTML, getting the only a['href'] in it (line 22)
html_to_parse = """
<li>
<div class="cardOutline tapItem fs-unmask result job_5ef6bf779263a83c sponsoredJob resultWithShelf sponTapItem desktop vjs-highlight">
<div class="slider_container css-g7s71f eu4oa1w0">
<div class="slider_list css-kyg8or eu4oa1w0">
<div class="slider_item css-kyg8or eu4oa1w0">
<div class="job_seen_beacon">
<div class="fe_logo">
<img alt="CyberCoders logo" class="feLogoImg desktop" src="https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/f0b43dcaa7850e2110bc8847ebad087b" />
</div>
<table cellpadding="0" cellspacing="0" class="jobCard_mainContent big6_visualChanges" role="presentation">
<tbody>
<tr>
<td class="resultContent">
<div class="css-1xpvg2o e37uo190">
<h2 class="jobTitle jobTitle-newJob css-bdjp2m eu4oa1w0" tabindex="-1">
<a aria-label="full details of REMOTE Senior Python Developer" class="jcs-JobTitle css-jspxzf eu4oa1w0" data-ci="385558680" data-empn="8690912762161442" data-hide-spinner="true" data-hiring-event="false" data-jk="5ef6bf779263a83c" data-mobtk="1g9u19rmn2ea6000" data-tu="https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=521&rx_c=&rx_campaign=indeed16&rx_group=110383&rx_source=Indeed&job=KE2-168714218&rx_r=none&rx_ts=20220808T034442Z&rx_pre=1&indeed=sp" href="/pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3" id="sj_5ef6bf779263a83c" role="button" target="_blank">
<span id="jobTitle-5ef6bf779263a83c" title="REMOTE Senior Python Developer">REMOTE Senior Python Developer</span>
</a>
</h2>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</li>
"""
# Soupify it
soup = BeautifulSoup(html_to_parse, "html.parser")
# Start by making sure "find_all("a")" works
all_links = soup.find_all("a")
print(all_links)
# Good.
# Attempt 1
job_url = soup.find('a[class*="JobTitle"]').a['href']
print(job_url)
# Nope.
# Attempt 2
job_url = soup.find("a", {"class": re.compile("^.*jobTitle.*")}).a['href']
print(job_url)
# Nope...
To find an element with partial class name you need to use select, not find. The will give you the <a> tag, the href will be in it
job_url = soup.select_one('a[class*="JobTitle"]')['href']
print(job_url)
# /pagead/clk?mo=r&ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eoBlI4EFrmor2FYZMP3muM35UEpv7D8dnBwRFuIf8XmtgYykaU5Nl3fSsXZ8xXiGdq3dZVwYJYR2-iS1SqyS7j4jGQ4Clod3n72L285Zn7LuBKMjFoBPi4tB5X2mdRnx-UikeGviwDC-ahkoLgSBwNaEmvShQxaFt_IoqJP6OlMtTd7XlgeNdWJKY9Ph9u8n4tcsN_tCjwIc3RJRtS1O7U0xcsVy5Gi1JBR1W7vmqcg5n4WW1R_JnTwQQ8LVnUF3sDzT4IWevccQb289ocL5T4jSfRi7fZ6z14jrR6bKwoffT6ZMypqw4pXgZ0uvKv2v9m3vJu_e5Qit1D77G1lNCk9jWiUHjWcTSYwhhwNoRzjAwd4kvmzeoMJeUG0gbTDrXFf3V2uJQwjZhTul-nbfNeFPRX6vIb4jgiTn4h3JVq-zw0woq3hTrLq1z9Xpocf5lIGs9U7WJnZM-Mh7QugzLk1yM3prCk7tQYRl3aKrDdTsOdbl5Afs1DkatDI7TgQgFrr5Iauhiv7I9Ss-fzPJvezhlYR4hjkkmSSAKr3Esz06bh5GlZKFONpq1I0IG5aejSdS_kJUhnQ1D4Uj4x7X_mBBN-fjQmL_CdyWM1FzNNK0cZwdLjKL-d8UK1xPx3MS-O-WxVGaMq0rn4lyXgOx7op9EHQ2Qdxy9Dbtg6GNYg5qBv0iDURQqi7_MNiEBD-AaEyqMF3riCBJ4wQiVaMjSTiH_DTyBIsYc0UsjRGG4a949oMHZ8yL4mGg57QUvvn5M_urCwCtQTuyWZBzJhWFmdtcPKCn7LpvKTFGQRUUjsr6mMFTQpA0oCYSO7E-w2Kjj0loPccA9hul3tEwQm1Eh58zHI7lJO77kseFQND7Zm9OMz19oN45mvwlEgHBEj4YcENhG6wdB6M5agUoyyPm8fLCTOejStoecXYnYizm2tGFLfqNnV-XtyDZNV_sQKQ2TQ==&xkcb=SoD0-_M3b-KooEWCyR0LbzkdCdPP&p=0&fvj=0&vjs=3
The CSS selector only works with the .select() method.
See documentation here.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Change your code to something like
job_links = soup.select('a[class*="JobTitle"]')
print(job_links)
for job_link in job_links:
print(job_link.get("href"))
job_url = soup.select("a", {"class": "jobTitle"})[0]["href"]

Python: scrape a part of source code and save it as html

Here is the case, I need to save a web page's source code as html file. But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself.
code:
from urllib.request import urlopen
page = urlopen('http://www.abcde.com')
page_content = page.read()
with open('page_content.html', 'wb') as f:
f.write(page_content)
I can save the whole source code from my code, but how can I just save the only part I want?
Explain:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>
I need to save the source code with and inside this tag , not extract the sentences in the tags.
The result I want is to save like this:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
<div class="col-md-12 col-xs-12" style="padding-left:10px;">
<h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
</div>
<!--Article Start-->
<section class="page_article_div" id="print">
<article itemprop="text" class="page_article_content">
<p>
<img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
<strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
<li>
Germanic paganism</li>
<li>
Greek mythology</li>
</ol>
<p style="text-align: right;">
【Jane】</p>
<p style="text-align: right;">
Credit : Wiki</p>
</article>
<div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
<br />
<div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
</section>
<!--Article End-->
</div>
My own solution here:
page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
list.append(str(tag))
list2= (', '.join(list))
#print(list2)
#print(type(list2))
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
f.write(list2)
I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :)
You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below.
driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me
tag_for_me will have your required code.
You can use Beautiful Soup to get any HTML source you need.
import requests
from bs4 import BeautifulSoup
target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")
for elem in soup.find_all(attrs={"class":target_class}):
if elem.text == target_text:
print(elem)
Output:
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. use insert() to generate new_tag. Overwrite to the original file.
from bs4 import BeautifulSoup
import requests
#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>
res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

Extract html data using regular expressions

I have an html page that looks like this
<tr>
<td align=left>
<a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
Th
</td>
</tr>
<tr align=right>
<td align=left>
<a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
aa
</td>
</tr>
I need to get the text
history/2c0b65635b3ac68a4d53b89521216d26.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
I wrote a script in python that uses regular expressions
import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))
where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?
Don't use regex for parsing HTML content.
Use a specialized tool - an HTML Parser.
Example (using BeautifulSoup):
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""Your HTML here"""
soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
print link['href']
Prints:
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html
Or, if you want to get the href values that follow a pattern, use:
import re
for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
print link['href']
where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.
DEMO.
You can also write something like
>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
... print a['href']
...
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Using regex on python + beautiful soup

I have an html page like this:
<td class="subject windowbg2">
<div>
<span id="msg_152617">
<a href= SOME INFO THAT I WANT </a>
</span>
</div>
<div>
<span id="msg_465412">
<a href= SOME INFO THAT I WANT</a>
</span>
</div>
as you can see the id="msg_465412" have a variable number, so this is my code:
import urllib.request, http.cookiejar,re
from bs4 import BeautifulSoup
contenturl = "http://megahd.me/peliculas-microhd/"
htmll=urllib.request.urlopen(contenturl).read()
soup = BeautifulSoup(htmll)
print (soup.find('span', attrs=re.compile(r"{'id': 'msg_\d{6}'}")))
in the last line I tried to find all the "span" tags that contain an id that can be msg_###### (with any number) but something is wrong in my code and it doesn't find anything.
P.S: all the code I want is in a table with 6 columns and I want the third column of all rows, but I thought that it was easier to use regex
You're a bit mixed up with your attrs argument ... at the moment it's a regex which contains the string representation of a dictionary, when it needs to be a dictionary containing the attribute you're searching for and a regex for its value.
This ought to work:
print (soup.find('span', attrs={'id': re.compile(r"msg_\d{6}")}))
Try using the following:
soup.find_all("span" id=re.compile("msg_\d{6}"))

Categories