I have the following structure,
<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>
How do I acess the data-ret argument inside div id done? For doing some web scraping.
Tried a couple of ways but don't seem to be able to stick it.
Thanks
Using beautiful soup library:
from bs4 import BeautifulSoup
html = '''<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>'''
soup = BeautifulSoup(html,"lxml")
data_ret = soup.find("div",{'id':'done'})
print(data_ret['data-ret'])
O/P:
512,500
Related
I need to find all tags of a certain kind (class "nice") but excluding those after a certain other tag (class "stop").
<div class="nice"></div>
<div class="nice"></div>
<div class="stop">here should be the end of found items</div>
<div class="nice"></div>
<div class="nice"></div>
How do I accomplish this using bs4?
I found this as a similar question but it appears a bit fuzzy.
You can use for example .find_previous to filter out unwanted tags:
from bs4 import BeautifulSoup
html_doc = """\
<div class="nice">want 1</div>
<div class="nice">want 2</div>
<div class="stop">here should be the end of found items</div>
<div class="nice">do not want 1</div>
<div class="nice">do not want 2</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
for div in soup.find_all("div", class_="nice"):
if div.find_previous("div", class_="stop"):
break
print(div)
Prints:
<div class="nice">want 1</div>
<div class="nice">want 2</div>
I am trying to fetch all classes (including the data inside "data_from", "data_to") from the following structure:
<div class="alldata">
<div class="data_from">
<div class="data_to">
<div class="data_to">
<div class="data_from">
</div>
So far I have tried finding all classes, without success. The "data_from", "data_to" classes are not being fetched by:
soup.find_all(class_=True)
When I try to illiterate over "alldata" class I fetch only the first "data_from" class.
for data in soup.findAll('div', attrs={"class": "alldata"}):
print(data.prettify())
All assistance is greatly appreciated. Thank you.
In newer code avoid old syntax findAll() or a mix with new syntax - instead use find_all() only - For more take a minute to check docs
Your HTML is not valid, but to get your goal with valid HTML you could use css selectors that selects all <div> with a class that are contained in your outer <div>:
soup.select('.alldata div[class]')
Example
from bs4 import BeautifulSoup
html='''<div class="alldata">
<div class="data_from"></div>
<div class="data_to"></div>
<div class="data_to"></div>
<div class="data_from"></div>
</div>'''
soup = BeautifulSoup(html)
soup.select('.alldata div[class]')
Output
[<div class="data_from"></div>,
<div class="data_to"></div>,
<div class="data_to"></div>,
<div class="data_from"></div>]
Just in addition if you like to get its texts, iterate over your ResultSet:
for e in soup.select('.alldata div[class]'):
print(e.text)
I want to ignore one class when using find_all. I've followed this solution Select all divs except ones with certain classes in BeautifulSoup
My divs are a bit different, I want to ignore description-0
<div class="abc">...</div>
<div class="parent">
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
</div>
<div class="xyz">...</div>
Following is my code
classToIgnore = ["description-0"]
all = soup.find_all('div', class_=lambda x: x not in classToIgnore)
It is reading all divs on the page, instead of just the ones with "descriptions-n". How to fix it?
Use regex, like this, for example:
import re
from bs4 import BeautifulSoup
sample_html = """<div class="abc">...</div>
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
<div class="xyz">...</div>"""
classes_regex = (
BeautifulSoup(sample_html, "lxml")
.find_all("div", {"class": (re.compile(r"description-[1-9]"))})
)
print(classes_regex)
Output:
[<div class="description-1"></div>, <div class="description-2"></div>]
I am trying to scrape the text between nested div but unable to get the text(TEXT HERE).The text is found inside the nested div. text here. So as you see below i want to print out the text(TEXT HERE) which is found inside all those 'div',as the text is not inside a 'p' tag i was unable to print the text. I am using BeautifulSoup to extract the text.When i run the code below ,it does not print out anything.
The structure of the 'div' is
<div class="_333v _45kb".....
<div class="_2a_i" ...............
<div class="_2a_j".......</div>
<div class="_2b04"...........
<div class="_14v5"........
<div class="_2b06".....
<div class="_2b05".....</div>
<div id=............>**TEXT HERE**</div>
</div>
</div>
</div>
</div>
</div>
My code:
theurl = "here URL"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.praser")
comm_list = soup.findAll('div', class_="_333v _45kb")
for lists in comm_list:
print(comm_list.find('div').text)
Beacuse OP continue to not provide enough information, here is sample
from bs4 import BeautifulSoup
html = '''
<div class="foo">
<div class="bar">
<div class="spam">Some Spam Here</div>
<div id="eggs">**TEXT HERE**</div>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
# This will print all the text
div = soup.find('div', {'class':'foo'})
print(div.text)
print('\n----\n')
# if other divs don't have id
for div in soup.findAll('div'):
if div.has_attr('id'):
print(div.text)
output
Some Spam Here
**TEXT HERE**
---------
**TEXT HERE**
The layout is as follows:
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>
I'm trying to grab The TITLE, then the APP_URL and ideally, when I print via html, I would like for the TITLE to become a hyper link of the APP_URL.
My code is like this but doesn't yield desire results. I believe I need to add another command within the loop to grab the title. Only problem is, How do I make sure that I grab the TITLE and APP_URL so that they go together? There are at least 15 apps with the class of <div class="App">. Of course, I want all 15 results as well.
IMPORTANT: for the href links, I need it from the class called "signed button".
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[1]
print a.text.strip(), '=>', a.attrs['href']
Use CSS selectors:
from bs4 import BeautifulSoup
html = """
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
for div in soup.select('div.App'):
title = div.select_one('div.title')
link = div.select_one('a')
print("Click here: <a href='{}'>{}</a>".format(link["href"], title.text))
Which yields
Click here: <a href='http://app_url'>Application Name #1</a>
Maybe something like this will work?
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[0]
print div.findAll('div', {'class': 'title'})[0].text, '=>', a.attrs['href']