Using BeautifulSoup for the first time and not able to get the idea about how I can extract the text from some specific node. Here is my code
html:
...
<p class="dsm">...</p>
<ul class="also">
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>
...
python:
current_page = urlopen(url)
current_soup = BeautifulSoup(current_page, 'html.parser')
derivative_list = current_soup.select('p.dsm + ul.also li')
for li in derivative_list:
print(li)
output:
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
Its outputting the correct list items, but what I want to get is text values of i.ab and span.at, something like
desired output:
abdrea, groups
shokdia, techs
After getting a list of all the <li> tags, simply iterate over them and find the texts of the <i class="ab"> and <span class="at"> tags individually.
for li in soup.select('p.dsm + ul.also li'):
print(li.i.text, li.span.text)
# abdrea groups
# shokdia techs
If there are other <i> and <span> tags inside the <li> tags, you can use find() on the li variable.
for li in soup.select('p.dsm + ul.also li'):
print(li.find('i', class_='ab').text, li.find('span', class_='at').text)
The Exact answer you looking for:
data = """<ul class="also">
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>"""
from bs4 import BeautifulSoup
page_soup = BeautifulSoup(data, "html.parser")
i_data, span_data= zip([x.text for x in page_soup.find_all("i")], [y.text for y in page_soup.find_all("span")])
print(i_data )
print(span_data)
output:
(u'abdrea', u'groups')
(u'shokdia', u'techs')
Related
I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))
I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name
website code looks like this:
<ul class="article-list">
<li>
<p class="promo">
"sentence sentence sentence sentence"
<a class="readmore" href="https://link.blahblah.com"> Read more >> </a>
</p>
</li>
</ul>
I tried
ul = soup.find_all("ul", class_= "article-list")
for elem in ul:
lis = elem.find_all("li")
for x in lis:
preview = x.find("p", class_="promo").get_text()
this returns
"sentence sentence sentence sentence Read more"
How can I return "sentence sentence sentence sentence" only without "Read more"?
You can use .find_next() method with text=True parameter:
data = '''<ul class="article-list">
<li>
<p class="promo">
"sentence sentence sentence sentence"
<a class="readmore" href="https://link.blahblah.com"> Read more >> </a>
</p>
</li>
</ul>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('p.promo').find_next(text=True))
Prints:
"sentence sentence sentence sentence"
Im not sure though
preview = x.find("p", class_="promo").a.text
you could try adding to a list
soup = bs(resp, 'html.parser')
ul = soup.find_all("ul", class_= "article-list")
preview = []
for elem in ul:
lis = elem.find_all("li")
for x in lis:
preview = x.find("p", class_="promo")
preview.append(x.text)
I have the following html code:
<div>
<span class="test">
<span class="f1">
5 times
</span>
</span>
</span>
</div>
<div>
</div>
<div>
<span class="test">
<span class="f1">
6 times
</span>
</span>
</span>
</div>
I managed to navigate the tree, but when trying to print I get the following error:
AttributeError: 'list' object has no attribute 'text'
Python code working:
x=soup.select('.f1')
print(x)
gives the following:
[]
[]
[]
[]
[<span class="f1"> 19 times</span>]
[<span class="f1"> 12 times</span>]
[<span class="f1"> 6 times</span>]
[]
[]
[]
[<span class="f1"> 6 times</span>]
[<span class="f1"> 1 time</span>]
[<span class="f1"> 11 times</span>]
but print(x.prettify) throws the error above. I am basically trying to get the text between the span tags for all instances, blank when none and string when available.
select() returns a list of the results, regardless if the result has 0 items. Since list object does not have a text attribute, it gives you the AttributeError.
Likewise, prettify() is to make the html more readable, not a way to interpret the list.
If all you're looking to do is extract the texts when available:
texts = [''.join(i.stripped_strings) for i in x if i]
# ['5 times', '6 times']
This removes all the superfluous space/newline characters in the string and give you just the bare text. The last if i indicates to only return the text if i is not None.
If you actually care for the spaces/newlines, do this instead:
texts = [i.text for i in x if i]
# ['\n 5 times\n ', '\n 6 times\n ']
from bs4 import BeautifulSoup
html = '''<div>
<span class="test">
<span class="f1">
5 times
</span>
</span>
</span>
</div>
<div>
</div>
<div>
<span class="test">
<span class="f1">
6 times
</span>
</span>
</span>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
aaa = soup.find_all('span', attrs={'class':'f1'})
for i in aaa:
print(i.text)
Output:
5 times
6 times
I'd recommend you using .findAll method and loop over matched spans.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for span in soup.findAll("span", class_="f1"):
if span.text.isspace():
continue
else:
print(span.text)
The .isspace() method is checking whether a string is empty (checking if a string is True won't work here since an empty html span cointans spaces).
I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.