How to get child value of div seperately using beautifulsoup - python

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>

You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor

Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

Related

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?
There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name

python web scraping with beautifulsoup find <li> in different <ul> name tag

<ul id="name1">
<li>
<li>
<li>
<li>
</ul>
<ul id="name2">
<li>
<li>
<li>
</ul>
Hi. I scraping something website. There are different numbers of li tags in different ul tag names. I think there is something wrong with my method. I want your help.
NOTE:
The bottom part of the code is the code of the image I took from the site
ts1 = brosursoup.find("div", attrs={"id": "name"})
ts2 = ts1.find("ul")
hesap = 0
count2 = len(ts1.find_all('ul'))
if (hesap <= count2):
hesap = hesap + 1
for qwe in ts1.find_all("ul", attrs={"id": f"name{hesap}"}):
for bnm in ts1.find_all("li"):
for klo in ts1.find_all("div"):
tgf = ts1.find("span", attrs={"class": "img_w_v8"})
for abn in tgf.find_all("img"):
picture = abn.get("src")
picturename= abn.get("title")
print(picture + " ------ " + picturename)
You can just find which ul tag you want and then use find_all.
page = '<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find('ul', {'id': 'name2'})
li_tags = ul_tag.find_all('li')
for i in li_tags:
print(i.text)
# output
li5
li6
li7
If you are trying to match all ul elements of the form id='nameXXX' then you can use a regular expression as follows:
from bs4 import BeautifulSoup
import re
page = '''<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'''
soup = BeautifulSoup(page, 'lxml')
for ul in soup.find_all('ul', {'id': re.compile('name\d+')}):
for li in ul.find_all('li'):
print(li.text)
This would display:
li1
li2
li3
li4
li5
li6
li7
Try this
from bs4 import BeautifulSoup
page = """<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>"""
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find_all('ul', {"id": ["name1", "name2"]})
for i in ul_tag:
print(i.text)

Beautifulsoup return none value for the href how to select it?

How can I get the hrefs from the Remixed From part on the left hand side for the following page
webpage = 'http://www.thingiverse.com/thing:1492411'
I tried something like:
response = requests.get('http://www.thingiverse.com/thing:1492411')
soup = BeautifulSoup(response.text, "lxml")
remm = soup.find_all("a", class_="thing-stamp")
for rems in remm:
adress = rems.find("href")
print(adress)
Of course didn't work
EDIT: HTML tag looks like this:
<a class="thing-stamp" href="/thing:1430125">
<div class="stamp-content">
<img alt="" class="render" src="https://cdn.thingiverse.com/renders/57/ea/1b/cc/c2/1105df2596f30c4df6dcf12ca5800547_thumb_small.JPG"/> <div>Anet A8 button guide by Simhopp</div>
</div>
</a>
<a class="thing-stamp" href="/thing:1430727">
<div class="stamp-content">
<img alt="" class="render" src="https://cdn.thingiverse.com/renders/93/a9/c8/45/da/05fe99b5215b04ec33d22ea70266ac72_thumb_small.JPG"/> <div>Frame brace for Anet A8 by Simhopp</div>
</div>
</a>
try:
remm = soup.find_all("a", class_="thing-stamp")
for rems in remm:
adress = rems.get('href')
print(adress)

Why does BeautifulSoup work the second time parsing, but not the first

This is the ResultSet of running soup[0].find_all('div', {'class':'font-160 line-110'}):
[<div class="font-160 line-110" data-container=".snippet-container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">
<a class="no-underline group-ib color-inherit"
href="/en/ais/details/ports/959">
<span class="text-default">CN</span><span class="text-default text-darker">XMN
</span>
</a>
</div>]
In an attempt to pull out XIAMEN [CN] after title I could not use a[0].find('div')['title] (where a is the above BeautifulSoup ResultSet). However, if I copy and paste that HTML as a new string, say,
b = '''<div class="font-160 line-110" data-container=".snippet container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">'''
Then do:
>>soup = BeautifulSoup(b, 'html.parser')
>>soup.find('div')['title']
>>XIAMEN [CN] #prints contents of title
Why do I have to reSoup the Soup? Why doesn't this work on my first search?
Edit, origin of soup:
I have a list of urls that I'm going though via grequests. One of the things I'm looking for is that title that contains XIAMEN [CN].
So soup was created when I did
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
The urls are
[
'http://www.marinetraffic.com/en/ais/details/ships/shipid:564352/imo:9643752/mmsi:511228000/vessel:DE%20MI',
'http://www.marinetraffic.com/en/ais/details/ships/shipid:3780155/imo:9712395/mmsi:477588800/vessel:SITC%20GUANGXI?cb=2267'
]
I found out the problem occurred when I set up my BeautifulSoup. I created a list of partial search results then had to iterate over the list to research it. I fixed this by just searching for what I wanted in on line:
I changed:
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
to:
a = soup.find("div", class_='font-160 line-110')["title"]
And run this search as soon as I create my soup which removes a lot of redundancies in the code-- I had been creating lists of ResultSets and having to use find on them for new fields.
You use wrong selection.
Selection soup[0].find_all('div', {'class':'font-160 line-110'}) finds <div> and you can even see <div> when you print it. But when you add .find() it starts searching inside <div> - so .find('div') tries to find new div in current div
You need
a[0]['title']
When you create new soup then main/root element is not div but [document] and div is its child (div is inside main "tag") so you can use find('div').
>>> a[0].name
div
>>> soup = BeautifulSoup(b, 'html.parser')
>>> soup.name
[document]

Categories