beautifulsoup - extracting link, text, and title within child div

beautifulsoup - extracting link, text, and title within child div - python

The layout is as follows:
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>
I'm trying to grab The TITLE, then the APP_URL and ideally, when I print via html, I would like for the TITLE to become a hyper link of the APP_URL.
My code is like this but doesn't yield desire results. I believe I need to add another command within the loop to grab the title. Only problem is, How do I make sure that I grab the TITLE and APP_URL so that they go together? There are at least 15 apps with the class of <div class="App">. Of course, I want all 15 results as well.
IMPORTANT: for the href links, I need it from the class called "signed button".
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[1]
print a.text.strip(), '=>', a.attrs['href']

Use CSS selectors:
from bs4 import BeautifulSoup
html = """
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
for div in soup.select('div.App'):
title = div.select_one('div.title')
link = div.select_one('a')
print("Click here: <a href='{}'>{}</a>".format(link["href"], title.text))
Which yields
Click here: <a href='http://app_url'>Application Name #1</a>

Maybe something like this will work?
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[0]
print div.findAll('div', {'class': 'title'})[0].text, '=>', a.attrs['href']

Related

Ignore one div class in BeautifulSoup find_all in Python 3

I want to ignore one class when using find_all. I've followed this solution Select all divs except ones with certain classes in BeautifulSoup
My divs are a bit different, I want to ignore description-0
<div class="abc">...</div>
<div class="parent">
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
</div>
<div class="xyz">...</div>
Following is my code
classToIgnore = ["description-0"]
all = soup.find_all('div', class_=lambda x: x not in classToIgnore)
It is reading all divs on the page, instead of just the ones with "descriptions-n". How to fix it?

Use regex, like this, for example:
import re
from bs4 import BeautifulSoup
sample_html = """<div class="abc">...</div>
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
<div class="xyz">...</div>"""
classes_regex = (
BeautifulSoup(sample_html, "lxml")
.find_all("div", {"class": (re.compile(r"description-[1-9]"))})
)
print(classes_regex)
Output:
[<div class="description-1"></div>, <div class="description-2"></div>]

Get the text which is found inside a nested Div tag using python BeautifulSoup

I am trying to scrape the text between nested div but unable to get the text(TEXT HERE).The text is found inside the nested div. text here. So as you see below i want to print out the text(TEXT HERE) which is found inside all those 'div',as the text is not inside a 'p' tag i was unable to print the text. I am using BeautifulSoup to extract the text.When i run the code below ,it does not print out anything.
The structure of the 'div' is
<div class="_333v _45kb".....
<div class="_2a_i" ...............
<div class="_2a_j".......</div>
<div class="_2b04"...........
<div class="_14v5"........
<div class="_2b06".....
<div class="_2b05".....</div>
<div id=............>**TEXT HERE**</div>
</div>
</div>
</div>
</div>
</div>
My code:
theurl = "here URL"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.praser")
comm_list = soup.findAll('div', class_="_333v _45kb")
for lists in comm_list:
print(comm_list.find('div').text)

Beacuse OP continue to not provide enough information, here is sample
from bs4 import BeautifulSoup
html = '''
<div class="foo">
<div class="bar">
<div class="spam">Some Spam Here</div>
<div id="eggs">**TEXT HERE**</div>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
# This will print all the text
div = soup.find('div', {'class':'foo'})
print(div.text)
print('\n----\n')
# if other divs don't have id
for div in soup.findAll('div'):
if div.has_attr('id'):
print(div.text)
output
Some Spam Here
**TEXT HERE**
---------
**TEXT HERE**

CSS: Div and Argument Selectors

I have the following structure,
<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>
How do I acess the data-ret argument inside div id done? For doing some web scraping.
Tried a couple of ways but don't seem to be able to stick it.
Thanks

Using beautiful soup library:
from bs4 import BeautifulSoup
html = '''<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>'''
soup = BeautifulSoup(html,"lxml")
data_ret = soup.find("div",{'id':'done'})
print(data_ret['data-ret'])
O/P:
512,500

Python - Beautifulsoup count only outer tag children of a tag

HTML of page:
<form name="compareprd" action="">
<div class="gridBox product " id="quickLookItem-1">
<div class="gridItemTop">
</div>
</div>
<div class="gridBox product " id="quickLookItem-2">
<div class="gridItemTop">
</div>
</div>
<!-- many more like this. -->
I am using Beautiful soup to scrap a page. In that page I am able to get a form tag by its name.
tag = soup.find("form", {"name": "compareprd"})
Now I want to count all immediate child divs but not all nested divs.
Say for example there are 20 immediate divs inside form.
I tried :
len(tag.findChildren("div"))
But It gives 1500.
I think it gives all "div" inside "form" tag.
Any help appreciated.

You can use a single css selector form[name=compareprd] > div which will find div's that are immediate children of the form:
html = """<form name="compareprd" action="">
<div class="gridBox product " id="quickLookItem-1">
<div class="gridItemTop">
</div>
</div>
<div class="gridBox product " id="quickLookItem-2">
<div class="gridItemTop">
</div>
</div>
</form>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(len(soup.select("form[name=compareprd] > div")))
Or as commented pass recursive=True but use find_all, findChildren goes back to the bs2 days and is only provided for backwards compatability.
len(tag.find_all("div", recursive=False)

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))

First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautifulsoup - extracting link, text, and title within child div - python

Maybe something like this will work? soup = BeautifulSoup(example) for div in soup.findAll('div', {'class': 'App'}): a = div.findAll('a')[0] print div.findAll('div', {'class': 'title'})[0].text, '=>', a.attrs['href']

Related

Ignore one div class in BeautifulSoup find_all in Python 3

Get the text which is found inside a nested Div tag using python BeautifulSoup

CSS: Div and Argument Selectors

Python - Beautifulsoup count only outer tag children of a tag

Having problems understanding BeautifulSoup filtering

Categories

Resources