I'm scraping a webpage with several p elements and I wanna get the text inside of them without including their children.
The page is structured like this:
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
When I use
parent.find_all("p", {"class": "default").get_text() this is the result I get:
I don't want this text
I want this text
I'm using BeautifulSoup 4 with Python 3
Edit: When I use
parent.find_all("p", {"class": "public item-cost"}, text=True, recursive=False)
It returns an empty list
You can use .find_next_sibling() with text=True parameter:
from bs4 import BeautifulSoup
html_doc = """
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one(".default > div").find_next_sibling(text=True))
Prints:
I want this text
Or using .contents:
print(soup.find("p", class_="default").contents[-1])
EDIT: To strip the string:
print(soup.find("p", class_="default").contents[-1].strip())
You can use xpath, which is a bit complex but provides much powerful querying.
Something like this will work for you:
soup.xpath('//p[contains(#class, "default")]//text()[normalize-space()]')
Related
BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
I want to output a series of links I've scraped from a website. The html is in a pretty standard hierarchy: div, h4, a, href.
Using Python and BeautifulSoup I've pulled the list out using the following script:
for record in soup.findAll('div',{"class":"title"}):
print(record)
which outputs the following info as a repeating series:
<div class="title">
<h4>
[the text]
</h4>
So far, so good.
I then want to pull out the links alone. For some reason I can't separate them from the surrounding text.
I've tried the following script:
print(record.href) #outputs "None"
print(record.findAll('a',{"href"})) #outputs "[]"
print(record.findAll('h4',{"a":"href"})) #outputs "[]"
Any pointers as to where I'm going wrong?
You can just use findAll again and then access the href value via ["href"]:
from bs4 import BeautifulSoup
html = """<div class="title">
<h4>
[the text]
</h4>"""
soup = BeautifulSoup(html, "html.parser")
for record in soup.findAll("div", {"class": "title"}):
print(record.findAll("a")[0]["href"])
Which prints:
[the link]
If there is more than one <a> inside the <div>, you can use a loop again, of course.
How can I extract the text after the "br/" tag?
I only what that text and not whatever would be inside the "strong"-tag.
<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>
Have tried code such as
text_content = paragraph.get_text(separator='strong/').strip()
But this will also include the text in the "strong" tag.
The "paragraph" variable is a bs4.element.Tag if that was not clear.
Any help appreciated!
If you have the <p> tag, then find the <br> within that and use .next_siblings
import bs4
html = '''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
paragraph = soup.find('p')
text_wanted = ''.join(paragraph.find('br').next_siblings)
print (text_wanted)
Output:
print (text_wanted)
Text I want which also
includes linebreaks.
Find <br> tag and use next_element
from bs4 import BeautifulSoup
data='''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('p').find('br').next_element
print(item)
I'm trying to write a parser that will take HTML and convert/output to Wiki syntax (<b> = ''', <i> = '', etc).
So far, BeautifulSoup seems only capable of replacing the contents within a tag, so <b> becomes <'''> instead of '''. I can use a re.sub() to swap these out, but since BS turns the document into a 'complex tree of Python objects', I can't figure out how to swap out these tags and re-insert them into the overall document.
Does anyone have ideas?
I am pretty sure there are already tools that would do that for you, but if you are asking about how to do that with BeautifulSoup, you can use replace_with(), but you would need to preserve the text of the element. Naive and simple example:
from bs4 import BeautifulSoup
data = """
<div>
<b>test1</b>
<i>test2</i>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
b.replace_with("'''%s'''" % b.text)
for i in soup.find_all("i"):
i.replace_with("''%s''" % i.text)
print(soup.prettify())
Prints:
<div>
'''test1'''
''test2''
</div>
To also handle nested tags, e.g. "<div><b>bold with some <i>italics</i></b></div>" you have to be a bit more careful.
I put together the following implementation when I needed to do something similar:
from bs4 import BeautifulSoup
def wikify_tag(tag, replacement):
tag.insert(0, replacement)
tag.append(replacement)
tag.unwrap()
data = """
<div>
<b>test1</b>
<i>test2</i>
<b>bold with some <i>italics</i></b>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
wikify_tag(b, "'''")
for i in soup.find_all("i"):
wikify_tag(i, "''")
print(soup)
Prints (note that .prettify() makes it look uglier):
<div>
'''test1'''
''test2''
'''bold with some ''italics'''''
</div>
If you also want to replace tags with wiki-templates you can extend wikify_tag to take a start and an end string.
I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a').
In the documentation, there is a:
find(text="ahh").replaceWith('Hooray')
So I would assume it would be along the lines of:
soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')
But that doesn't work and the python help() doesn't give much
This will insert a <br> tag after the end of each <a>...</a> element:
from BeautifulSoup import BeautifulSoup, Tag
# ....
soup = BeautifulSoup(data)
for a in soup.findAll('a'):
a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))
You can't use soup.findAll(tag = '</a>') because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.
If you wanted to put the <a> elements inside a <p> element as you ask in a comment, you can use this:
for a in soup.findAll('a'):
p = Tag(soup, 'p') #create a P element
a.replaceWith(p) #Put it where the A element is
p.insert(0, a) #put the A element inside the P (between <p> and </p>)
Again, you don't create the <p> and </p> separately because they are part of the same thing.
suppose you have an element which you know contains the "br" markup tags, one way to remove & replace the "br" tags with a different string is like this:
originalSoup = BeautifulSoup("your_html_file.html")
replaceString = ", " # replace each <br/> tag with ", "
# Ex. <p>Hello<br/>World</p> to <p>Hello, World</p>
cleanSoup = BeautifulSoup(str(originalSoup).replace("<br/>", replaceString))
You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.
What you want to do is insert a new <br> element immediately after the <a>...</a> element. To do so you'll need to find out the index of the <a> element inside its parent element, and insert the new element just after that index. eg.
soup= BeautifulSoup('<body>blah blah blah</body>')
for link in soup.findAll('a'):
br= Tag(soup, 'br')
index= link.parent.contents.index(link)
link.parent.insert(index+1, br)
# soup now serialises to '<body>blah blah<br /> blah</body>'