BeautifulSoup4 Parsing inside tag hierarchy

BeautifulSoup4 Parsing inside tag hierarchy - python

I'm having trouble understanding how the bs4 parsing works to pull out information that is several levels down in a tag hierarchy.
Here is an example of what I'm trying to parse (from www.j-archive.com/showgame.php?game_id=50):
...
<table>
<tr>
<td>
<div onmouseover="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', '<em class="correct_response"><i>The Red Badge of Courage</i></em><br /><br /><table width="100%"><tr><td class="right">Kelley</td></tr></table>')" onmouseout="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', 'This classic by Stephen Crane is subtitled "An Episode of the American Civil War"')" onclick="togglestick('clue_DJ_1_1_stuck')">
I specifically want to get out the words "The Red Badge of Courage", so that's within <table>, <tr>, <td>, and <div>, and then appears to be part of the attribute onmouseover.
I can pull out all onmouseover statements with:
for tag in soup.findAll(onmouseover=True):
print(tag['onmouseover'])
But I don't know hot to parse inside that ouput.
Thanks in advance.

Since the text you're interested in is in the <em> tags, it's pretty easy to parse using substring indexing:
import requests
from bs4 import BeautifulSoup
req = requests.get('http://www.j-archive.com/showgame.php?game_id=50')
soup = BeautifulSoup(req.text, 'lxml')
for tag in soup.findAll('div',onmouseover=True):
parseText = str(tag['onmouseover'])
tag1 = '<em class="correct_response">'
tag2 = '</em>'
i1 = parseText.index(tag1)
i2 = parseText.index(tag2)
print(parseText[i1+len(tag1):i2])
This code got all of the entries until 'Final Jeopardy'.

Related

How can I get Beautifulsoup to print text from paragraphs that are nested inside of a div?

I am new to web scraping and using beautiful soup 4. I was trying to get just the text printed from inside the <p></p> tags which are nested under a div. I am able to get them to print but the tags still appear and the beautiful soup won't let me use the .text function to extract the text from the paragraph tags. Is there a better way to do this or am I just doing it wrong?
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.wuxiaworld.com/novel/tales-of-demons-and-gods/tdg-chapter-1"
headers = {'User-Agent': 'Mozilla/5.0'}
req = requests.get(url, headers=headers)
src = req.content
print(req.status_code)
print()
bs4soup = soup(src, 'lxml')
bs4soup.prettify()
# div = bs4soup.find('div', attrs={'id': 'chapter-content'})
# par = div.find_all('p')
# print(par) # This just prints all of the <p> tags on the same line.
for paragraph in bs4soup.find('div', attrs={'id': 'chapter-content'}):
print(paragraph)
This is what my output looks like
<p><strong>Chapter 1 - Reborn</strong></p>
<p><span style="">On the colossal St. Ancestral Mountain range, sunlight shined through the gap between the mountains into the deep canyons. On the hillside beside the valley, some snow remained.</span></p>
<p><span style="">Even though it’s already the early summer, the snow has yet to melt. The cold here is exceptionally long. The roars of beasts can occasionally be heard.</span></p>
and continues down like that for all the lines in the paragraph.
I tried to use the text function built into Beautiful Soup but when I added that to the paragraph.text it spit out an AttributeError: 'NavigableString' object has no attribute 'text'
What would be the best way to tackle this?

Since the structure of page looks something like this:
<div id="chapter-content">
<script></script>
<p></p>
<p></p>
...
</div>
You can use .find_all() method to get all p tags and just print text from those tags:
for paragraph in bs4soup.find('div', attrs={'id': 'chapter-content'}).find_all('p'):
print(paragraph.text)
Output will look like this:
Chapter 1 - Reborn
On the colossal St. Ancestral Mountain range, sunlight shined through the gap between the mountains into the deep canyons. On the hillside beside the valley, some snow remained.
Even though it’s already the early summer, the snow has yet to melt. The cold here is exceptionally long. The roars of beasts can occasionally be heard.
....

You're getting the error because you are predicting that all the children of the div selected are p, sadly not!
You can try by writing this code:
main_div = bs4soup.find('div', attrs={'id': 'chapter-content'}) # recover it first.
for item in main_div:
print(item.name)
Then, you will see tags different than p and you're getting the error because some of them don't contain the .text attribute.
To solve it, you can check each tag if it's a p inside the loop:
main_div = bs4soup.find('div', attrs={'id': 'chapter-content'})
for item in main_div:
if item.name == "p": # Here the work
print(item.text)
Or, continue to find all the p tags inside that div, which I recommend.
main_div = bs4soup.find('div', attrs={'id': 'chapter-content'})
p_tags = main_div.find_all("p") # You got only p tags
# Each p tag have .text attribute, looping now is secure
for p in p_tage:
print(p.text)

Scraping webpage with Python: how to return a list of titles of certain elements?

I had luck getting a list of telephone numbers using this code:
from lxml import html
import requests
lnk='https://docs.legis.wisconsin.gov/2019/legislators/assembly'
page=requests.get(lnk)
tree=html.fromstring(page.content)
ph_nums=tree.xpath('//span[#class="info telephone"]/text()')
print(ph_nums)
which is scraping info from an HTML element that looks like this:
<span class="info telephone">
<span class="title"><strong>Telephone</strong>:<br></span>
(608) 266-8580<br>(888) 534-0097
</span>
However, I can't do the same for this element when I change info telephone to info...
<span class="info" style="width:16em;">
<span>
<a id="A">
<strong></strong></a><strong>Jenkins, Leroy t</strong> <small>(R - Madison)</small>
</span>
<br>
<span style="width:8em;"><small>District 69</small></span>
<br>
<span style="width:8em;">Details</span>
<br>
<span style="width:8em;">
Website
</span>
<br>
<br>
</span>
since there's multiple titles in this element, whereas "info telephone" only had one. How would I return separate lists, each with a different piece of info (i.e. a list of names, and a list of Districts, in this scenario)?
FYI - I am not educated in HTML (and hardly experienced in Python) so I would appreciate a simplified explanation.

For this task I would recommend the BeautifulSoup Package for Python.
You don't have to deeply understand HTML to use it (I don't!), and it offers a very friendly approach to find certain items from a web page.
Your first example could be rewritten as follows:
from bs4 import BeautifulSoup
#soup element contains the xml data
soup = BeautifulSoup(page.content, 'lxml')
# the find_all method finds all nodes in page.content whose type is 'span'
# and whose class is 'info telephone'
info_tels = soup.find_all('span', {"class": "info telephone"})
The info_tels element contains all instances of <span class="info telephone"> on your document. We can then parse it to find what's relevant:
list_tels = []
for tel in info_tels:
tel_text = tel.text #extracts text from info_telephone node
tel_text = tel_text.replace("\nTelephone:\n","").replace('\n', "") #removes "Telephone:" part and line breaks
tel_text = tel_text.strip() #removes trailing space
list_tels.append(tel_text)
You can do something similar for the 'info' class:
info_class = soup.find_all('span', {"class": "info"})
And then find the elements you want to put into lists:
info_class[0].find_all('a')[1].text #returns you the first name
The challenge here is to identify which types/classes do these names/districts/etc. have. In your first example, it is relatively clear (('span', {"class": "info telephone"})), but the "info" class has various data points inside of it with no specific, identifiable type.
For instance, the '' tag appears multiple times in your file, also with distinct data points (District, Details, etc.)
I came up with a small solution for the District problem - you might get inspired to tackle the other information too!!
list_districts = []
for info in info_class:
try:
district_contenders = info.find_all('span', {'style': "width:8em;"})
for element in district_contenders:
if 'District' in element.text:
list_districts.append(element.text)
except:
pass

How can I extract an html tag value for specific classes using python beautiful soup?

I have one part of code which works, and one block which doesn't. The structure of the python and html seem the same to me so I can't understand why the second has no ouput.
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Google Chrome v6.9"}
r = requests.get("https://www.oddschecker.com/horse-racing/will-rogers-downs/22:45/winner", headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
for tag in soup.select('.bk-logo-main-90[data-bk]'):
print(tag['data-bk'],end=",")
for tag in soup.select('.popup selTxt[data-name]'):
print(tag['data-name'],end=",")
The first for loop gives me a list of 2 letter acronyms, the second should give me a list of runners in this race but there is no output. Both are trying to find any tags of a certain class which contains a tag of another name and return the value of that tag.
Examples of the specific html I'm looking at in case the race is run already and the page is removed:
<a class="bk-logo-main-90 bk-logo-click" data-bk="B3" data-track="&lid=BookieLogo-B3&lpos=oddsTable" title="Bet365" href="javascript:void(0);">
<a class="popup selTxt" target="" title="View odds history for My Wise Cat" href="https://www.oddschecker.com/horse-racing/will-rogers-downs/22:45/winner/bet-history/my-wise-cat" data-name="My Wise Cat">

Python - Extracting data from this Html tag using BS4, instead of getting None

This is my code:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').string)
It returns None. I think it has to do with that span tag which is empty. I think it goes into that span tag, and returns those contents? So I either want to delete that span tag, or stop as soon as it finds the 'Data I want to extract', or tell it to ignore empty tags
If there are no empty tags inside 'td' it actually works.
Is there a way to ignore empty tags in general and go one step back? Instead of ignoring this specific span tag?
Sorry if this is too elementary, but I spent a fair amount of time searching.

Use .text property, not .string:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').text)
Output:
Data I want to extract

Use .text:
>>> soup.find('td').text
u'Data I want to extract'

Extract multiple <p> text using for loop in Beautiful Soup

How do I extract all the <p> from the following HTML code, using Beautifulsoup in python, see code below that I am trying
html code:
<div class="text_details">
<p>
Allah's Messenger (ﷺ) said: Islam is based on (the following) five (principles):
</p>
<p> 1. To testify that none has the right to be worshipped but Allah and Muhammad is Allah's Messenger (ﷺ).</p>
<p> 2. To offer the (compulsory congregational) prayers dutifully and perfectly.</p>
<p> 3. To pay Zakat (i.e. obligatory charity)</p>
<p> 4. To perform Hajj. (i.e. Pilgrimage to Mecca)</p>
<p> 5. To observe fast during the month of Ramadan.</p>
<p></p>
</div>
Code:
import requests
from bs4 import BeautifulSoup
url = "https://www.sunnah.com/bukhari/11"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for pp in soup.find_all(class_='text_details').p:
print pp.text

You can use select to fetch all child p that have text_details as a parent.
Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.sunnah.com/bukhari/11"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for pp in soup.select("div.text_details > p"):
print (pp.text)

you should find(return one tag) the div tag then find_all(return a list of tag) the p tag
In [59]: for pp in soup.find(class_='text_details').find_all('p'):
...: print(pp.text)
...:
I heard Allah's Messenger (ﷺ) (p.b.u.h) saying, "We (Muslims) are the last (to come) but (will be) the
foremost on the Day of Resurrection though the former nations were given the Holy Scriptures before
us. And this was their day (Friday) the celebration of which was made compulsory for them but they
differed about it. So Allah gave us the guidance for it (Friday) and all the other people are behind us in
this respect: the Jews' (holy day is) tomorrow (i.e. Saturday) and the Christians' (is) the day after
tomorrow (i.e. Sunday)."
The div tag contains only p tag, so you can get all the text by:
In [60]: soup.find(class_='text_details').text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup4 Parsing inside tag hierarchy - python

Related

How can I get Beautifulsoup to print text from paragraphs that are nested inside of a div?

Scraping webpage with Python: how to return a list of titles of certain elements?

How can I extract an html tag value for specific classes using python beautiful soup?

Python - Extracting data from this Html tag using BS4, instead of getting None

Extract multiple <p> text using for loop in Beautiful Soup

Categories

Resources