BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
Related
I'm trying to scrape imdb.com with BeautifulSoup in Python, but there are some html tag that contains hypens '-' in its text, so system could not read it.
The page I was trying to scrape: click here
I'm trying to extract "TV-MA" from tag below
<span class="certificate">TV-MA</span>
So, I will crawl using code like this :
item.find("span",{"class": "certificate"}).text
But, the code above will return NoneType object error. So that means, the span tag was not detected when I tried to ".find" the html tag. In the original html file, the span wasn't there as well (I knew this because I've tried to print the html code). But again, when I tried to inspect element of the page, the span tag was there...
I've tried crawling with other text that contain "-" with the same span, such as:
<span class="certificate">PG-13</span>
And the crawl code I've used above will work (which means it will return: "PG-13"). So I don't think the problem is with the code.
This worked for me:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.imdb.com/search/title/?genres=documentary')
soup = BeautifulSoup(r.text, 'html.parser')
text = soup.find("span",{"class": "certificate"}).text
print(text)
Prints:
TV-MA
I'm scraping a webpage with several p elements and I wanna get the text inside of them without including their children.
The page is structured like this:
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
When I use
parent.find_all("p", {"class": "default").get_text() this is the result I get:
I don't want this text
I want this text
I'm using BeautifulSoup 4 with Python 3
Edit: When I use
parent.find_all("p", {"class": "public item-cost"}, text=True, recursive=False)
It returns an empty list
You can use .find_next_sibling() with text=True parameter:
from bs4 import BeautifulSoup
html_doc = """
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one(".default > div").find_next_sibling(text=True))
Prints:
I want this text
Or using .contents:
print(soup.find("p", class_="default").contents[-1])
EDIT: To strip the string:
print(soup.find("p", class_="default").contents[-1].strip())
You can use xpath, which is a bit complex but provides much powerful querying.
Something like this will work for you:
soup.xpath('//p[contains(#class, "default")]//text()[normalize-space()]')
How can I extract the text after the "br/" tag?
I only what that text and not whatever would be inside the "strong"-tag.
<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>
Have tried code such as
text_content = paragraph.get_text(separator='strong/').strip()
But this will also include the text in the "strong" tag.
The "paragraph" variable is a bs4.element.Tag if that was not clear.
Any help appreciated!
If you have the <p> tag, then find the <br> within that and use .next_siblings
import bs4
html = '''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
paragraph = soup.find('p')
text_wanted = ''.join(paragraph.find('br').next_siblings)
print (text_wanted)
Output:
print (text_wanted)
Text I want which also
includes linebreaks.
Find <br> tag and use next_element
from bs4 import BeautifulSoup
data='''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('p').find('br').next_element
print(item)
I'm using Beautiful Soup and I want to extract the text within '' with the findall method.
content = urllib.urlopen(address).read()
soup = BeautifulSoup(content, from_encoding='utf-8')
soup.prettify()
x = soup.findAll(do not know what to write)
An extract from soup as an example:
<td class="leftCell identityColumn snap" onclick="fundview('Schroder
European Special Situations');" title="Schroder European Special
Situations"> <a class="coreExpandArrow" href="javascript:
void(0);"></a> <span class="sigill"><a class="qtpop"
href="/vips/ska/all/sv/quicktake/redirect?perfid=0P0000XZZ3&flik=Chosen">
<img
src="/vips/Content/corestyles/4pSigillGubbe.gif"/></a></span>
<span class="bluetext" style="white-space: nowrap; overflow:
hidden;">Schroder European Spe..</span>
I would like the result from soup.findAll(do not know what to write) to be: Schroder European Special Situations and the findall logic should be based on that it is the text between single quotation marks.
Locate the td element and get the onclick attribute value - the BeautifulSoup's job at this point would be completed. The next step would be to extract the desired text from the attribute value - let's use regular expressions for that. Implementation:
import re
onclick = soup.select_one("td.identityColumn[onclick]")["onclick"]
match = re.search(r"fundview\('(.*?)'\);", onclick)
if match:
print(match.group(1))
Alternatively, it looks like the span with bluetext class has the desired text inside:
soup.select_one("td.identityColumn span.bluetext").get_text()
Also, make sure you are using the 4th BeautifulSoup version and your import statement is:
from bs4 import BeautifulSoup
The page is: http://item.taobao.com/item.htm?id=13015989524
you can see its source code.
In its source code the following code exists
<a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank">
But when I use BeautifulSoup to read the source code and execute the following
soup.findAll('a', href="http://item.taobao.com/item.htm?id=13015989524")
It returns [] empty. What does it return '[]'?
As far as I can see, the <a> tag you are trying to find is inside a <textarea> tag. BS does not parse the contents of <textarea> as HTML, and rightly so since <textarea> should not contain HTML. In short, that page is doing something sketchy.
If you really need to get that, you might "cheat" and parse the contents of <textarea> again and search within them:
import urllib
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib.urlopen("http://item.taobao.com/item.htm?id=13015989524"))
a = []
for textarea in soup.findAll("textarea"):
textsoup = BS(textarea.text) # parse the contents as html
a.extend(textsoup.findAll("a", attrs={"href":"http://item.taobao.com/item.htm?id=13015989524"}))
for tag in a:
print tag
# outputs
# <a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank"><img ...
# <a href="http://item.taobao.com/item.htm?id=13015989524" title="901 ...
Use a dictionary to store the attribute:
soup.findAll('a', {
'href': "http://item.taobao.com/item.htm?id=13015989524"
})