Parsing <ul> tag using beautiful soup - python

Consider this code:
divTag = soup.find_all("div", {"class":"classname"})
print divTag
for tag in divTag:
ulTag = soup.find_all("ul", {"class":"classname"})
print ulTag
for tag in ulTag:
liTag = soup.find_all("li", {"class":"classname"})
print liTag
for tag in liTag:
diTag = soup.find_all("div", {"class":"classname"})
print diTag
for tag in diTag:
aTags = tag.find_next("a")
value = aTags.string
print value
It prints only "divTag" & "ulTag". I'm sure all the class names are right. There are about 7 'li' tags within the 'ul' tag but it does not print any of the 'li' tags. Please help. Thanks in advance.
UPDATE:
<div class="classname">
<ul auto-load="true" class="classname" data-href="">
<li class="classname">
<div class="classname">"value" string string1 <a class="muted"><abbr class="timeago" title=" 1 Jun, 2015, 10:23 am">7 hours ago</abbr></a>
</div>
</li>
<li>
</li>
</ul>
</div>
I basically want to extract the "string" value within the 'a' tag.

The full solution with a next_sibling
ulTag = soup.find("ul", {"class": "classname"})
aTags = ulTag.find_all("a")
for aTag in aTags:
sibling = aTag.next_sibling
siblingString = str(sibling).strip()
if len(siblingString) > 0:
print siblingString

Here every time you are searching in soup. So you are failing. You should search for a tag in it's parent tag.
Try something like this:
divTag = soup.find_all("div", {"class":"classname"})
for ulTag in divTag:
for liTag in ulTag.find_all("li", {"class":"classname"}):
for tag in liTag.find_all("div", {"class":"classname"}):
for aTag in tag.find_all('a'):
print aTag.string
For the html you provided, The output is:
"value"
string1
7 hours ago

Related

Trouble getting <a> tag using BeautifulSoup

I need to get a href attribute from <а> tag, but it doesn't work
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
print(a_tags[0].p) #print <p> tag
print(a_tags[0].a) #print 'None'
print(a_tags[0].a.get('href')) #doesn't work
but if I try to print(a_tags) it shows them:
[<a href="/org/colleges/instrcol/Pages/pic1.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic1" src="iblock/6ba/%d0%90%d0%b1%d1%80%d0%b0%d0%bc%d0%be%d0%b2%20%d0%a1%d0%b5%d1%80%d0%b3%d0%b5%d0%b9%20%d0%90%d0%bd%d1%82%d0%be%d0%bd%d0%b8%d0%b4%d0%be%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic1</p></div>
</a>, <a href="/org/colleges/instrcol/Pages/pic2.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic2" src="iblock/1ee/%d0%90%d0%b3%d0%b0%d1%84%d0%be%d0%bd%d0%be%d0%b2%20%d0%9f%d0%b0%d0%b2%d0%b5%d0%bb%20%d0%92%d0%b8%d1%82%d0%b0%d0%bb%d1%8c%d0%b5%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic2</p></div>
</a>,
...
What is causing this problem?
You forgot to add href=True while using find_all()
Try this:
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a', href=True)
for a_tag in a_tags:
print(a_tag['href'])
a_tags contains <a> already.
Replace a_tags[0].a.get('href') with a_tags[0].get('href').

python web scraping with beautifulsoup find <li> in different <ul> name tag

<ul id="name1">
<li>
<li>
<li>
<li>
</ul>
<ul id="name2">
<li>
<li>
<li>
</ul>
Hi. I scraping something website. There are different numbers of li tags in different ul tag names. I think there is something wrong with my method. I want your help.
NOTE:
The bottom part of the code is the code of the image I took from the site
ts1 = brosursoup.find("div", attrs={"id": "name"})
ts2 = ts1.find("ul")
hesap = 0
count2 = len(ts1.find_all('ul'))
if (hesap <= count2):
hesap = hesap + 1
for qwe in ts1.find_all("ul", attrs={"id": f"name{hesap}"}):
for bnm in ts1.find_all("li"):
for klo in ts1.find_all("div"):
tgf = ts1.find("span", attrs={"class": "img_w_v8"})
for abn in tgf.find_all("img"):
picture = abn.get("src")
picturename= abn.get("title")
print(picture + " ------ " + picturename)
You can just find which ul tag you want and then use find_all.
page = '<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find('ul', {'id': 'name2'})
li_tags = ul_tag.find_all('li')
for i in li_tags:
print(i.text)
# output
li5
li6
li7
If you are trying to match all ul elements of the form id='nameXXX' then you can use a regular expression as follows:
from bs4 import BeautifulSoup
import re
page = '''<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'''
soup = BeautifulSoup(page, 'lxml')
for ul in soup.find_all('ul', {'id': re.compile('name\d+')}):
for li in ul.find_all('li'):
print(li.text)
This would display:
li1
li2
li3
li4
li5
li6
li7
Try this
from bs4 import BeautifulSoup
page = """<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>"""
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find_all('ul', {"id": ["name1", "name2"]})
for i in ul_tag:
print(i.text)

How to extract a column from HTML into a list? [duplicate]

With the code below:
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class' :'flagPageTitle'})
I get the following html:
<div id="ctl00_ContentPlaceHolder1_Item65404" class="flagPageTitle" style=" ">
<span></span><p>Some text here</p>
</div>
How can I get Some text here without any tags? Is there InnerText equivalent in BeautifulSoup?
All you need is:
result = soup.find('div', {'class' :'flagPageTitle'}).text
You can use findAll(text=True) to only find text nodes.
result = u''.join(result.findAll(text=True))
You can search for <p> and get its text:
soup = BeautifulSoup.BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class': 'flagPageTitle'})
result = result.find('p').text

How to extracts contents of a div tag containing a particular text using BeautifulSoup

I am new to BeautifulSoup and am looking to extract texts from a list inside a div tag. this is the code
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
I would like to extract the text "Awaiting bone marrow transplant". This is the code which I use now which gives me an empty list:
for link in soup.findAll('div', text = re.compile('Description Synonyms ')):
print link
Sorry for not adding this. I do have other divs by the same class name. I am interested in only the description synonyms.The other div is listed below
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
Dsoup.find(text='...') doesn't work if there's other text or tags inside that tag.
Try:
[i.find('ul', {'class': "definitionList"}).find('li').text
for i in soup.find_all('div', {'class': "contentBlurb"})
if 'Description Synonyms' in str(i.text)][0]
You can do this:
# coding: utf-8
from bs4 import BeautifulSoup
html = """
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
"""
souped = BeautifulSoup(html)
matching_divs = [div for div in souped.find_all(
'div', {'class': 'contentBlurb'}) if 'Description Synonyms' in div.getText()]
li_elements = []
matching_uls = []
for mdiv in matching_divs:
matching_uls.extend(mdiv.findAll('ul', {'class': 'definitionList'}))
for muls in matching_uls:
li_elements.extend(muls.findAll('li'))
for li in li_elements:
print(li.getText())
EDIT: Updated for matching particular div.
Try this, Change it to the required string in the if clause. The below snippet will print if the tag's text has Applicable To, you can change it to your requirement
val = soup.find('div', {'class': 'contentBlurb'}).text
if "Description Synonyms" in val:
print soup.find('div', {'class': 'contentBlurb'}).find('ul', {'class': 'definitionList'}).find('li').text

Is there an InnerText equivalent in BeautifulSoup?

With the code below:
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class' :'flagPageTitle'})
I get the following html:
<div id="ctl00_ContentPlaceHolder1_Item65404" class="flagPageTitle" style=" ">
<span></span><p>Some text here</p>
</div>
How can I get Some text here without any tags? Is there InnerText equivalent in BeautifulSoup?
All you need is:
result = soup.find('div', {'class' :'flagPageTitle'}).text
You can use findAll(text=True) to only find text nodes.
result = u''.join(result.findAll(text=True))
You can search for <p> and get its text:
soup = BeautifulSoup.BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class': 'flagPageTitle'})
result = result.find('p').text

Categories