How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python? - python

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>

I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())

I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....

So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name

Related

How to get child value of div seperately using beautifulsoup

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

python web scraping with beautifulsoup find <li> in different <ul> name tag

<ul id="name1">
<li>
<li>
<li>
<li>
</ul>
<ul id="name2">
<li>
<li>
<li>
</ul>
Hi. I scraping something website. There are different numbers of li tags in different ul tag names. I think there is something wrong with my method. I want your help.
NOTE:
The bottom part of the code is the code of the image I took from the site
ts1 = brosursoup.find("div", attrs={"id": "name"})
ts2 = ts1.find("ul")
hesap = 0
count2 = len(ts1.find_all('ul'))
if (hesap <= count2):
hesap = hesap + 1
for qwe in ts1.find_all("ul", attrs={"id": f"name{hesap}"}):
for bnm in ts1.find_all("li"):
for klo in ts1.find_all("div"):
tgf = ts1.find("span", attrs={"class": "img_w_v8"})
for abn in tgf.find_all("img"):
picture = abn.get("src")
picturename= abn.get("title")
print(picture + " ------ " + picturename)
You can just find which ul tag you want and then use find_all.
page = '<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find('ul', {'id': 'name2'})
li_tags = ul_tag.find_all('li')
for i in li_tags:
print(i.text)
# output
li5
li6
li7
If you are trying to match all ul elements of the form id='nameXXX' then you can use a regular expression as follows:
from bs4 import BeautifulSoup
import re
page = '''<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'''
soup = BeautifulSoup(page, 'lxml')
for ul in soup.find_all('ul', {'id': re.compile('name\d+')}):
for li in ul.find_all('li'):
print(li.text)
This would display:
li1
li2
li3
li4
li5
li6
li7
Try this
from bs4 import BeautifulSoup
page = """<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>"""
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find_all('ul', {"id": ["name1", "name2"]})
for i in ul_tag:
print(i.text)

How do I get this text residing within the tags using Beautiful Soup?

I want to fetch the number 121 from the above code. But the soup object that I am getting is not showing the number.
Link to my Image
[<div class="open_pln" id="pln_1">
<ul>
<li>
<div class="box_check_txt">
<input id="cp1" name="cp1" onclick="change_plan(2,102,2);" type="checkbox"/>
<label for="cp1"><span class="green"></span></label>
</div>
</li>
<li id="li_open"><span>Desk</span> <br/></li>
<li> </li>
</ul>
</div>]
The number 121 for open offices is not inside HTML code, but in the JavaScript. You can use regex to extract it:
import re
import requests
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
htmlpage = requests.get(url).text
open_offices = re.findall(r'var openOffices\s*=\s*(\d+)', htmlpage)[0]
private_offices = re.findall(r'var privateOffices\s*=\s*(\d+)', htmlpage)[0]
print('Open offices: {}'.format(open_offices))
print('Private offices: {}'.format(private_offices))
Prints:
Open offices: 121
Private offices: 40
Without re module:
import requests
from bs4 import BeautifulSoup
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
searchstr = "var openOffices = "
script = soup.select_one(f"script:contains('{searchstr}')").text
print(script.split(searchstr)[1].split(";")[0])
Output:
121
you have to find all the li attribute using soup like this -
attribute=req["li"]
all_links = soup.find_all(attribute)
for link in all_links:
print(link.text.strip())

How to extracts contents of a div tag containing a particular text using BeautifulSoup

I am new to BeautifulSoup and am looking to extract texts from a list inside a div tag. this is the code
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
I would like to extract the text "Awaiting bone marrow transplant". This is the code which I use now which gives me an empty list:
for link in soup.findAll('div', text = re.compile('Description Synonyms ')):
print link
Sorry for not adding this. I do have other divs by the same class name. I am interested in only the description synonyms.The other div is listed below
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
Dsoup.find(text='...') doesn't work if there's other text or tags inside that tag.
Try:
[i.find('ul', {'class': "definitionList"}).find('li').text
for i in soup.find_all('div', {'class': "contentBlurb"})
if 'Description Synonyms' in str(i.text)][0]
You can do this:
# coding: utf-8
from bs4 import BeautifulSoup
html = """
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
"""
souped = BeautifulSoup(html)
matching_divs = [div for div in souped.find_all(
'div', {'class': 'contentBlurb'}) if 'Description Synonyms' in div.getText()]
li_elements = []
matching_uls = []
for mdiv in matching_divs:
matching_uls.extend(mdiv.findAll('ul', {'class': 'definitionList'}))
for muls in matching_uls:
li_elements.extend(muls.findAll('li'))
for li in li_elements:
print(li.getText())
EDIT: Updated for matching particular div.
Try this, Change it to the required string in the if clause. The below snippet will print if the tag's text has Applicable To, you can change it to your requirement
val = soup.find('div', {'class': 'contentBlurb'}).text
if "Description Synonyms" in val:
print soup.find('div', {'class': 'contentBlurb'}).find('ul', {'class': 'definitionList'}).find('li').text

Beautiful Soup and searching in results

These are my first steps with python, please bear with me.
Basically I want to parse a Table of Contents from a single Dokuwiki page with Beautiful Soup. The TOC looks like this:
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li">#</div>
<ul class="toc">
<li class="level2"><div class="li">One</div></li>
<li class="level2"><div class="li">Two</div></li>
<li class="level2"><div class="li">Three</div></li>
I would like to be able to search in the content of the a-tags and if a result is found return its content and also return the href-link. So if I search for "one" the result should be
One
#link1
What I have done so far:
#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
import urllib2
#Grab and open URL, create BeatifulSoup object
url = "http://www.somewiki.at/wiki/doku.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Grab Table of Contents
grab_toc = soup.find('div', {"id":"dw__toc"})
#Look for all divs with class: li
ftext = grab_toc.findAll('div', {"class":"li"})
#Look for links
links = grab_toc.findAll('a',href=True)
#Iterate
for everytext in ftext:
text = ''.join(everytext.findAll(text=True))
data = text.strip()
print data
for everylink in links:
print everylink['href']
This prints out the data I want but I'm kind of lost to rewrite it to be able to search within the result and only return the searchterm. Tried something like
if data == 'searchtearm':
print data
break
else:
print 'Nothing found'
But this is kind of a weak search. Is there a nicer way to do this? In my example the Beatiful Soup resultset is changed into a list. Is it better to search in the result set in the first place, if so then how to do this?
Instead of searching through the links one-by-one, have BeautifulSoup search for you, using a regular expression:
import re
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE))
This would find the first a link in the table of contents with the 3 characters one in the text somewhere. Then just print the link and text:
print matching_link.string
print matching_link['href']
Short demo based on your sample:
>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('''\
... <div id="dw__toc">
... <h3 class="toggle">Table of Contents</h3>
... <div>
...
... <ul class="toc">
... <li class="level1"><div class="li">#</div>
... <ul class="toc">
... <li class="level2"><div class="li">One</div></li>
... <li class="level2"><div class="li">Two</div></li>
... <li class="level2"><div class="li">Three</div></li>
... </ul></ul>''')
>>> matching_link = soup.find('a', text=re.compile('one', re.IGNORECASE))
>>> print matching_link.string
One
>>> print matching_link['href']
#link1
In BeautifulSoup version 3, the above .find() call returns the contained NavigableString object instead. To get back to the parent a element, use the .parent attribute:
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE)).parent
print matching_link.string
print matching_link['href']

Categories