How to select div including a span with specific id via beautifulsoup? - python

I want to scrape a text of some div that includes a span with specific id or class.
for example:
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span2"></span>
text to scrape
</div>
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span3"></span>
text to scrape
</div>
I want to get the text in the div (class1) but specifically only the one witch include span (span1)
thanks
Thank you I have solved my problem using this code below:
soup = BeautifulSoup(html, "html.parser")
for x in soup.find_all(class_='class1'):
x_span = x.find('span',class_='span1')
print(x_span.parent.text.strip())

this should do
soup.select_one('.class1:has(#span1)').text

As Beso mentioned there is an typo in your html that should be fixed.
How to select?
Simplest approache in my opinion is to use a css selector to select all <div> that have a <span> with class named "span1" (cause there is more than one in your example html)
soup.select('div:has(> #span1)')
or even more specific as mentioned by diggusbickus:
soup.select('.class1:has(> #span1)')
To get the text of all of them you have to iterate the result set:
[x.get_text(strip=True) for x in soup.select('div:has(> #span1)')]
That will give you a list of texts:
['text to scrape', 'text to scrape']
Example
from bs4 import BeautifulSoup
html='''
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span2"></span>
text to scrape
</div>
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span3"></span>
text to scrape
</div>
'''
soup = BeautifulSoup(html, "html.parser")
[x.get_text(strip=True) for x in soup.select('div:has(> #span1)')]

Related

python beautifulsoup: how to find all before certain stop tag?

I need to find all tags of a certain kind (class "nice") but excluding those after a certain other tag (class "stop").
<div class="nice"></div>
<div class="nice"></div>
<div class="stop">here should be the end of found items</div>
<div class="nice"></div>
<div class="nice"></div>
How do I accomplish this using bs4?
I found this as a similar question but it appears a bit fuzzy.
You can use for example .find_previous to filter out unwanted tags:
from bs4 import BeautifulSoup
html_doc = """\
<div class="nice">want 1</div>
<div class="nice">want 2</div>
<div class="stop">here should be the end of found items</div>
<div class="nice">do not want 1</div>
<div class="nice">do not want 2</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
for div in soup.find_all("div", class_="nice"):
if div.find_previous("div", class_="stop"):
break
print(div)
Prints:
<div class="nice">want 1</div>
<div class="nice">want 2</div>

how to get specific links with BeautifulSoup?

I am trying to crawl HTML source with Python using BeautifulSoup.
I need to get the href of specific link <a> tags.
This is my test code. I want to get links <a href="/example/test/link/activity1~10"target="_blank">
<div class="listArea">
<div class="activity_sticky" id="activity">
.
.
</div>
<div class="activity_content activity_loaded">
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
.
.
.
</div>
</div>
Check the main page of the bs4 documentation:
for link in soup.find_all('a'):
print(link.get('href'))
This is a code for the problem. You should find the all <a></a>, then to getting the value of href.
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('a'):
if i['target'] == "_blank":
print(i['href'])
Hope my answer could help you.
Select the <a> specific - lternative to #Mason Ma answer you can also use css selectors:
soup.select('.activity_content a')]
or by its attribute target -
soup.select('.activity_content a[target="_blank"]')
Example
Will give you a list of links, matching your condition:
import requests
from bs4 import BeautifulSoup
html = '''
<div class="activity_content activity_loaded">
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
'''
soup = BeautifulSoup(html)
[x['href'] for x in soup.select('.activity_content a[target="_blank"]')]
Output
['/example/test/link/activity1', '/example/test/link/activity2']
Based on my understanding of your question, you're trying to extract the links (href) from anchor tags where the target value is _blank. You can do this by searching for all anchor tags then narrowing down to those whose target == '_blank'
links = soup.findAll('a', attrs = {'target' : '_blank'})
for link in links:
print(link.get('href'))

Get the text which is found inside a nested Div tag using python BeautifulSoup

I am trying to scrape the text between nested div but unable to get the text(TEXT HERE).The text is found inside the nested div. text here. So as you see below i want to print out the text(TEXT HERE) which is found inside all those 'div',as the text is not inside a 'p' tag i was unable to print the text. I am using BeautifulSoup to extract the text.When i run the code below ,it does not print out anything.
The structure of the 'div' is
<div class="_333v _45kb".....
<div class="_2a_i" ...............
<div class="_2a_j".......</div>
<div class="_2b04"...........
<div class="_14v5"........
<div class="_2b06".....
<div class="_2b05".....</div>
<div id=............>**TEXT HERE**</div>
</div>
</div>
</div>
</div>
</div>
My code:
theurl = "here URL"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.praser")
comm_list = soup.findAll('div', class_="_333v _45kb")
for lists in comm_list:
print(comm_list.find('div').text)
Beacuse OP continue to not provide enough information, here is sample
from bs4 import BeautifulSoup
html = '''
<div class="foo">
<div class="bar">
<div class="spam">Some Spam Here</div>
<div id="eggs">**TEXT HERE**</div>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
# This will print all the text
div = soup.find('div', {'class':'foo'})
print(div.text)
print('\n----\n')
# if other divs don't have id
for div in soup.findAll('div'):
if div.has_attr('id'):
print(div.text)
output
Some Spam Here
**TEXT HERE**
---------
**TEXT HERE**

Search for text inside a tag using beautifulsoup and returning the text in the tag after it

I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.
Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Categories