Get Paragraph Content - python

I am a bit confused in getting content of the paragraph tag.
<div class="SomeID">
<p>What a voice! </p>
</div>
I reached at this point
list = soup.find_all("div","SomeID")
But how to get the paragraph content.(What a voice!)
The basic problem is to get the content of all paragraph tags from
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen('http://www.dawn.com/news/1267272/democracys-woes').read()
soup = BeautifulSoup(html, 'html.parser')
list = soup.find_all("div","comment__body cf")
print list

You can actually do it one go with a CSS selector:
for p in soup.select("div.SomeID > p"):
print(p.get_text(strip=True))
Or, if you need a single p element:
soup.select_one("div.SomeID > p").get_text(strip=True)
Note that > here means the direct parent-child relationship.

Related

Python / Beautifulsoup: HTML Path to the current element

For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen
Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml
Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)

How to find element based on text ignore child tags in beautifulsoup

I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example:
<div> <b>Ignore this text</b>Find based on this text </div>
How can I find this div? Thanks for you helps!
You can use .find with the text argument and then use findParent to the parent element.
Ex:
from bs4 import BeautifulSoup
s="""<div> <b>Ignore this text</b>Find based on this text </div>"""
soup = BeautifulSoup(s, 'html.parser')
t = soup.find(text="Find based on this text ")
print(t.findParent())
Output:
<div> <b>Ignore this text</b>Find based on this text </div>
try it , it is like example but it works
from bs4 import BeautifulSoup
html="""
<div> <b>Ignore this text</b>Find based on this text </div>
"""
soup = BeautifulSoup(html, 'lxml')
s = soup.find('div')
for child in s.find_all('b'):
child.decompose()
print(s.get_text())
Output
Find based on this text

BeautifulSoup replace_with for non-standard tags

I'm trying to write a parser that will take HTML and convert/output to Wiki syntax (<b> = ''', <i> = '', etc).
So far, BeautifulSoup seems only capable of replacing the contents within a tag, so <b> becomes <'''> instead of '''. I can use a re.sub() to swap these out, but since BS turns the document into a 'complex tree of Python objects', I can't figure out how to swap out these tags and re-insert them into the overall document.
Does anyone have ideas?
I am pretty sure there are already tools that would do that for you, but if you are asking about how to do that with BeautifulSoup, you can use replace_with(), but you would need to preserve the text of the element. Naive and simple example:
from bs4 import BeautifulSoup
data = """
<div>
<b>test1</b>
<i>test2</i>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
b.replace_with("'''%s'''" % b.text)
for i in soup.find_all("i"):
i.replace_with("''%s''" % i.text)
print(soup.prettify())
Prints:
<div>
'''test1'''
''test2''
</div>
To also handle nested tags, e.g. "<div><b>bold with some <i>italics</i></b></div>" you have to be a bit more careful.
I put together the following implementation when I needed to do something similar:
from bs4 import BeautifulSoup
def wikify_tag(tag, replacement):
tag.insert(0, replacement)
tag.append(replacement)
tag.unwrap()
data = """
<div>
<b>test1</b>
<i>test2</i>
<b>bold with some <i>italics</i></b>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
wikify_tag(b, "'''")
for i in soup.find_all("i"):
wikify_tag(i, "''")
print(soup)
Prints (note that .prettify() makes it look uglier):
<div>
'''test1'''
''test2''
'''bold with some ''italics'''''
</div>
If you also want to replace tags with wiki-templates you can extend wikify_tag to take a start and an end string.

I am struggling with python-html . I know the class of a certain header. I need the info from the generic <a href... in this h1

So, I have this:
<h1 class='entry-title'>
<a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>
How can I retrieve the URL (it is not always the same) and the title (also not always the same)?
Parse it with an HTML parser, e.g. with BeautifulSoup it would be:
from bs4 import BeautifulSoup
data = "your HTML here" # data can be the result of urllib2.urlopen(url)
soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]
print link.get("href")
print link.get_text()
where h1.entry-title > a is a CSS selector matching an a element directly under h1 element with class="entry-title".
Well, just working with strings, you can
>>> s = '''<h1 class='entry-title'>
... <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'
There are other (and better) options for parsing HTML though.

Beautiful Soup can't get news titles

from bs4 import BeautifulSoup
import requests
url ="http://www.basketnews.lt/lygos/59-nacionaline-krepsinio-asociacija/2013/naujienos.html"
r = requests.get(url)
soup = BeautifulSoup(r.text)
naujienos = soup.findAll('a', {'class':'title'})
print naujienos
Here is important part of HTML:
<div class="title">
<span class="feedbacks"></span>
</div>
I get empty list. Where is my mistake?
EDIT:
Thanks it worked. Now I want to print news titles. This is how I am trying to do it:
nba = soup.select('div.title > a')
for i in nba:
print ""+i.string+"\n"
I get max 5 titles and error occurs: cannot concatenate 'str' and 'NoneType' objects
soup.findAll('a', {'class':'title'})
This says, give me all a tags that also have class="title". That's obviously not what you're trying to do.
I think you want a tags that are the direct descendant of a tag with class="title". You can try using a css selector:
soup.select('div.title > a')
Out[58]:
[Blatche'as: âGarantuoju, kad laimÄsimeâ,
<a href="/news-73147-rockets-veikiausiai-pasiliks-mchalea.html">âRocketsâ veikiausiai pasiliks McHaleâÄ
</a>,
# snip lots of other links
]

Categories