How do I extract all the <p> from the following HTML code, using Beautifulsoup in python, see code below that I am trying
html code:
<div class="text_details">
<p>
Allah's Messenger (ﷺ) said: Islam is based on (the following) five (principles):
</p>
<p> 1. To testify that none has the right to be worshipped but Allah and Muhammad is Allah's Messenger (ﷺ).</p>
<p> 2. To offer the (compulsory congregational) prayers dutifully and perfectly.</p>
<p> 3. To pay Zakat (i.e. obligatory charity)</p>
<p> 4. To perform Hajj. (i.e. Pilgrimage to Mecca)</p>
<p> 5. To observe fast during the month of Ramadan.</p>
<p></p>
</div>
Code:
import requests
from bs4 import BeautifulSoup
url = "https://www.sunnah.com/bukhari/11"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for pp in soup.find_all(class_='text_details').p:
print pp.text
You can use select to fetch all child p that have text_details as a parent.
Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.sunnah.com/bukhari/11"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for pp in soup.select("div.text_details > p"):
print (pp.text)
you should find(return one tag) the div tag then find_all(return a list of tag) the p tag
In [59]: for pp in soup.find(class_='text_details').find_all('p'):
...: print(pp.text)
...:
I heard Allah's Messenger (ﷺ) (p.b.u.h) saying, "We (Muslims) are the last (to come) but (will be) the
foremost on the Day of Resurrection though the former nations were given the Holy Scriptures before
us. And this was their day (Friday) the celebration of which was made compulsory for them but they
differed about it. So Allah gave us the guidance for it (Friday) and all the other people are behind us in
this respect: the Jews' (holy day is) tomorrow (i.e. Saturday) and the Christians' (is) the day after
tomorrow (i.e. Sunday)."
The div tag contains only p tag, so you can get all the text by:
In [60]: soup.find(class_='text_details').text
Related
I am new to web scraping and using beautiful soup 4. I was trying to get just the text printed from inside the <p></p> tags which are nested under a div. I am able to get them to print but the tags still appear and the beautiful soup won't let me use the .text function to extract the text from the paragraph tags. Is there a better way to do this or am I just doing it wrong?
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.wuxiaworld.com/novel/tales-of-demons-and-gods/tdg-chapter-1"
headers = {'User-Agent': 'Mozilla/5.0'}
req = requests.get(url, headers=headers)
src = req.content
print(req.status_code)
print()
bs4soup = soup(src, 'lxml')
bs4soup.prettify()
# div = bs4soup.find('div', attrs={'id': 'chapter-content'})
# par = div.find_all('p')
# print(par) # This just prints all of the <p> tags on the same line.
for paragraph in bs4soup.find('div', attrs={'id': 'chapter-content'}):
print(paragraph)
This is what my output looks like
<p><strong>Chapter 1 - Reborn</strong></p>
<p><span style="">On the colossal St. Ancestral Mountain range, sunlight shined through the gap between the mountains into the deep canyons. On the hillside beside the valley, some snow remained.</span></p>
<p><span style="">Even though it’s already the early summer, the snow has yet to melt. The cold here is exceptionally long. The roars of beasts can occasionally be heard.</span></p>
and continues down like that for all the lines in the paragraph.
I tried to use the text function built into Beautiful Soup but when I added that to the paragraph.text it spit out an AttributeError: 'NavigableString' object has no attribute 'text'
What would be the best way to tackle this?
Since the structure of page looks something like this:
<div id="chapter-content">
<script></script>
<p></p>
<p></p>
...
</div>
You can use .find_all() method to get all p tags and just print text from those tags:
for paragraph in bs4soup.find('div', attrs={'id': 'chapter-content'}).find_all('p'):
print(paragraph.text)
Output will look like this:
Chapter 1 - Reborn
On the colossal St. Ancestral Mountain range, sunlight shined through the gap between the mountains into the deep canyons. On the hillside beside the valley, some snow remained.
Even though it’s already the early summer, the snow has yet to melt. The cold here is exceptionally long. The roars of beasts can occasionally be heard.
....
You're getting the error because you are predicting that all the children of the div selected are p, sadly not!
You can try by writing this code:
main_div = bs4soup.find('div', attrs={'id': 'chapter-content'}) # recover it first.
for item in main_div:
print(item.name)
Then, you will see tags different than p and you're getting the error because some of them don't contain the .text attribute.
To solve it, you can check each tag if it's a p inside the loop:
main_div = bs4soup.find('div', attrs={'id': 'chapter-content'})
for item in main_div:
if item.name == "p": # Here the work
print(item.text)
Or, continue to find all the p tags inside that div, which I recommend.
main_div = bs4soup.find('div', attrs={'id': 'chapter-content'})
p_tags = main_div.find_all("p") # You got only p tags
# Each p tag have .text attribute, looping now is secure
for p in p_tage:
print(p.text)
I am using BeautifulSoup4 to do some HTML scraping.
I am trying to extract important info such as the title, meta data, paragraphs and listed information.
My issue is I can take the paragraphs like so:
def main():
response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
html = response.read()
soup = BeautifulSoup(html,features="html.parser")
text = [e.get_text() for e in soup.find_all('p')]
article = '\n'.join(text)
print(article)
main()
But if my website link has bullet points in the body of text it would include the navigation bar. i.e. if i change p to li or ul
For example what I want to get as output is:
The Industry Day's objectives are three-fold:
The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.
What I actually get:
The Industry Day's objectives are three-fold:
The tags in the HTML Source:
<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>
You can use Or css selector syntax so you can select the li elements as well.
import requests
from bs4 import BeautifulSoup
url = 'https://ecir2019.org/industry-day/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('p, ol li')]
print(items)
Just that section:
import requests
from bs4 import BeautifulSoup
url = 'https://ecir2019.org/industry-day/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.kg-card-markdown p:nth-of-type(2), .kg-card-markdown p:nth-of-type(2) + ol li')]
print(items)
The page appears to have changed so I am using a cached version (this will only work until cache is updated). You can limit to the post body with an additional class selector:
import requests
from bs4 import BeautifulSoup
url = 'http://webcache.googleusercontent.com/search?q=cache:https://ecir2019.org/industry-day'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.post-body p, .post-body ol li, .post-body ul li')]
print(items)
I'm having trouble understanding how the bs4 parsing works to pull out information that is several levels down in a tag hierarchy.
Here is an example of what I'm trying to parse (from www.j-archive.com/showgame.php?game_id=50):
...
<table>
<tr>
<td>
<div onmouseover="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', '<em class="correct_response"><i>The Red Badge of Courage</i></em><br /><br /><table width="100%"><tr><td class="right">Kelley</td></tr></table>')" onmouseout="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', 'This classic by Stephen Crane is subtitled "An Episode of the American Civil War"')" onclick="togglestick('clue_DJ_1_1_stuck')">
I specifically want to get out the words "The Red Badge of Courage", so that's within <table>, <tr>, <td>, and <div>, and then appears to be part of the attribute onmouseover.
I can pull out all onmouseover statements with:
for tag in soup.findAll(onmouseover=True):
print(tag['onmouseover'])
But I don't know hot to parse inside that ouput.
Thanks in advance.
Since the text you're interested in is in the <em> tags, it's pretty easy to parse using substring indexing:
import requests
from bs4 import BeautifulSoup
req = requests.get('http://www.j-archive.com/showgame.php?game_id=50')
soup = BeautifulSoup(req.text, 'lxml')
for tag in soup.findAll('div',onmouseover=True):
parseText = str(tag['onmouseover'])
tag1 = '<em class="correct_response">'
tag2 = '</em>'
i1 = parseText.index(tag1)
i2 = parseText.index(tag2)
print(parseText[i1+len(tag1):i2])
This code got all of the entries until 'Final Jeopardy'.
from BeautifulSoup import BeautifulSoup
html = '''<div class="thisText">
Poem The RavenOnce upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace
</div>'''
soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)
I have this sample code and i cant find how to add spaces around the removed tags so when the text inside the <a href...> get formatted it can be readable and wont display like this:
PoemThe RavenOnce upon a midnight dreary, while I pondered, weak and weary...
In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace
get_text() in beautifoulsoup4 has an optional input called separator. You can use it as follows :
soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')
One option would be to find all text nodes and join them with a space:
" ".join(item.strip() for item in poems.find_all(text=True))
Additionally, you are using beautifulsoup3 package which is outdated and not maintained. Upgrade to beautifulsoup4:
pip install beautifulsoup4
and replace:
from BeautifulSoup import BeautifulSoup
with:
from bs4 import BeautifulSoup
Here an alternative with lxml and its xpath function to search for all text nodes:
from lxml import etree
html = '''<div class="thisText">
Poem The RavenOnce upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace
</div>'''
root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))
It yields:
Poem The Raven Once upon a midnight dreary, while I pondered, weak and weary...
In the greenest of our valleys By good angels tenanted..., part of The Haunted Palace
I'm having a very hard time getting BeautifulSoup to scrape some data for me. What's the best way to access the date (the actual numbers, 2008) from this code sample? It's my first time using Beautifulsoup, I've figured out how to scrape urls off of the page, but I can't quite narrow it down to only select the word Date, and then to only return whatever numeric date follows (in the dd brackets). Is what I'm asking even possible?
<div class='dl_item_container clearfix detail_date'>
<dt>Date</dt>
<dd>
2008
</dd>
</div>
Find the dt tag by text and find the next dd sibling:
soup.find('div', class_='detail_date').find('dt', text='Date').find_next_sibling('dd').text
The complete code:
from bs4 import BeautifulSoup
data = """
<div class='dl_item_container clearfix detail_date'>
<dt>Date</dt>
<dd>
2008
</dd>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
date_field = soup.find('div', class_='detail_date').find('dt', text='Date')
print(date_field.find_next_sibling('dd').text.strip())
Prints 2008.