Use Beautiful Soup findall to extract text between single quotations - python

I'm using Beautiful Soup and I want to extract the text within '' with the findall method.
content = urllib.urlopen(address).read()
soup = BeautifulSoup(content, from_encoding='utf-8')
soup.prettify()
x = soup.findAll(do not know what to write)
An extract from soup as an example:
<td class="leftCell identityColumn snap" onclick="fundview('Schroder
European Special Situations');" title="Schroder European Special
Situations"> <a class="coreExpandArrow" href="javascript:
void(0);"></a> <span class="sigill"><a class="qtpop"
href="/vips/ska/all/sv/quicktake/redirect?perfid=0P0000XZZ3&flik=Chosen">
<img
src="/vips/Content/corestyles/4pSigillGubbe.gif"/></a></span>
<span class="bluetext" style="white-space: nowrap; overflow:
hidden;">Schroder European Spe..</span>
I would like the result from soup.findAll(do not know what to write) to be: Schroder European Special Situations and the findall logic should be based on that it is the text between single quotation marks.

Locate the td element and get the onclick attribute value - the BeautifulSoup's job at this point would be completed. The next step would be to extract the desired text from the attribute value - let's use regular expressions for that. Implementation:
import re
onclick = soup.select_one("td.identityColumn[onclick]")["onclick"]
match = re.search(r"fundview\('(.*?)'\);", onclick)
if match:
print(match.group(1))
Alternatively, it looks like the span with bluetext class has the desired text inside:
soup.select_one("td.identityColumn span.bluetext").get_text()
Also, make sure you are using the 4th BeautifulSoup version and your import statement is:
from bs4 import BeautifulSoup

Related

Putting Links in Parenthesis with BeautifulSoup

BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.

Get text from inside element without its children

I'm scraping a webpage with several p elements and I wanna get the text inside of them without including their children.
The page is structured like this:
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
When I use
parent.find_all("p", {"class": "default").get_text() this is the result I get:
I don't want this text
I want this text
I'm using BeautifulSoup 4 with Python 3
Edit: When I use
parent.find_all("p", {"class": "public item-cost"}, text=True, recursive=False)
It returns an empty list
You can use .find_next_sibling() with text=True parameter:
from bs4 import BeautifulSoup
html_doc = """
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one(".default > div").find_next_sibling(text=True))
Prints:
I want this text
Or using .contents:
print(soup.find("p", class_="default").contents[-1])
EDIT: To strip the string:
print(soup.find("p", class_="default").contents[-1].strip())
You can use xpath, which is a bit complex but provides much powerful querying.
Something like this will work for you:
soup.xpath('//p[contains(#class, "default")]//text()[normalize-space()]')

Regex using Python

I am trying to catch from pattern that was downloaded from specific URL, specific values but without success.
Part of the pattern is:
"All My Loving"</td>\n<td style="text-align:center;">1963</td>\n<td><i>UK: With the Beatles<br />\nUS: Meet The Beatles!</i></td>\n<td>McCartney</td>\n<td>McCartney</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;"><span style="display:none" class="sortkey">7001450000000000000\xe2\x99\xa0</span>45</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Things Must Pass"</td>\n<td style="text-align:center;">1969</td>\n<td><i>Anthology 3</i></td>\n<td>Harrison</td>\n<td>Harrison</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Together Now"</td>\n<td style="text-align:center;">1967</td>\n<td><i>Yellow Submarine</i></td>\n<td>McCartney, with Lennon</td>\n<td>McCartney, with Lennon</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"
I want to catch the Title and the 1st <td>McCartney</td> with specific values from the file and to print it out as a JSON file.
Can I run with FOR loop with regex ? How I can do it using python ?
Thanks,
If you want to parse HTML use an HTML parser (such as BeautifulSoup), not regex.
from bs4 import BeautifulSoup
html = '''All My Loving"</td>\n<td style="text-align:center;">1963</td>\n<td><i>UK: With the Beatles<br />\nUS: Meet The Beatles!</i></td>\n<td>McCartney</td>\n<td>McCartney</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;"><span style="display:none" class="sortkey">7001450000000000000\xe2\x99\xa0</span>45</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Things Must Pass"</td>\n<td style="text-align:center;">1969</td>\n<td><i>Anthology 3</i></td>\n<td>Harrison</td>\n<td>Harrison</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Together Now"</td>\n<td style="text-align:center;">1967</td>\n<td><i>Yellow Submarine</i></td>\n<td>McCartney, with Lennon</td>\n<td>McCartney, with Lennon</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>
'''
soup = BeautifulSoup(html, 'html.parser')
a = soup.find('a') # will only find the first <a> tag
print(a.attrs['title'])
tds = soup.find_all('td') # will find all <td> tags
for td in tds:
if 'McCartney' in td.text:
print(td)
# All My Loving
# <td>McCartney</td>
# <td>McCartney</td>
# <td>McCartney, with Lennon</td>
# <td>McCartney, with Lennon</td>

BeautifulSoup replace_with for non-standard tags

I'm trying to write a parser that will take HTML and convert/output to Wiki syntax (<b> = ''', <i> = '', etc).
So far, BeautifulSoup seems only capable of replacing the contents within a tag, so <b> becomes <'''> instead of '''. I can use a re.sub() to swap these out, but since BS turns the document into a 'complex tree of Python objects', I can't figure out how to swap out these tags and re-insert them into the overall document.
Does anyone have ideas?
I am pretty sure there are already tools that would do that for you, but if you are asking about how to do that with BeautifulSoup, you can use replace_with(), but you would need to preserve the text of the element. Naive and simple example:
from bs4 import BeautifulSoup
data = """
<div>
<b>test1</b>
<i>test2</i>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
b.replace_with("'''%s'''" % b.text)
for i in soup.find_all("i"):
i.replace_with("''%s''" % i.text)
print(soup.prettify())
Prints:
<div>
'''test1'''
''test2''
</div>
To also handle nested tags, e.g. "<div><b>bold with some <i>italics</i></b></div>" you have to be a bit more careful.
I put together the following implementation when I needed to do something similar:
from bs4 import BeautifulSoup
def wikify_tag(tag, replacement):
tag.insert(0, replacement)
tag.append(replacement)
tag.unwrap()
data = """
<div>
<b>test1</b>
<i>test2</i>
<b>bold with some <i>italics</i></b>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
wikify_tag(b, "'''")
for i in soup.find_all("i"):
wikify_tag(i, "''")
print(soup)
Prints (note that .prettify() makes it look uglier):
<div>
'''test1'''
''test2''
'''bold with some ''italics'''''
</div>
If you also want to replace tags with wiki-templates you can extend wikify_tag to take a start and an end string.

What how I parse an html string with beautifulsoup that has inner tags within text

I have a the following html content in a variable and need a way to read the text from the html by removing the inner tags
html=<td class="row">India (ASIA) (india – photos)</td>
I just want to extract the string India (ASIA) out of this with BeautifulSoup. Is it possible or should I resort to use regular expressions for this.
This is one possible way using beautifulsoup, by extracting text content before child element <a> :
from bs4 import BeautifulSoup
html = """<td class="row">India (ASIA) (india – photos)</td>"""
soup = BeautifulSoup(html)
result = soup.find("a").previousSibling
print(result.decode('utf-8'))
output :
India (ASIA) (
tweaking the code further to remove trailing ( from result should be straightforward

Categories