Extract <div title="xpto"> element with BS4 - python

I'm trying to extract all <div title="some different text here"> elements but I couldn't find a way.
There are some of them spread over the HTML structure but it seems that I need to fill the title attribute with the right value of text, but they are all different.
What I am trying:
container_indicators = today_container.find_all('div', {'title'})
But it didn't work out.
I'm working with beautiful soup + python.
Any help?

Try:
container_indicators = today_container.find_all('div', attrs={'title': True})
Or:
container_indicators = today_container.select('div[title]')

Related

Get content from certain tags with certain attributes using BS4

I need to get the content from the following tag with these attributes: <span class="h6 m-0">.
An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span>, and it obviously needs to return Hello world.
My current code is as follows:
page = BeautifulSoup(text, 'html.parser')
names = [item["class"] for item in page.find_all('span')]
This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class "h6 m-0" and grab the content inside. How will I go about doing this?
page = BeautifulSoup(text, 'html.parser')
names = page.find_all('span' , class_ = 'h6 m-0')
Without knowing your use case I don't know if this will work.
names = [item["class"] for item in page.find_all('span',class_="h6 m-0" )]
can you please be more specific about what problem you face
but this should work fine for you

BS4 to find embedded ID number

I'm trying to use BS4 to find ID numbers that are embedded in html. However, they aren't really attached to anything I'm used to working with, like a tag or the like. I've looked at how to pull something from a div class, but haven't had success with that either. Below is what soup looks like after I collect the html:
<div class="result-bio">
<div class="profile-image">
<img class="search-image load-member-profile" ng-click="results.loadProfile(result.UGuid)"
ng-src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"
src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"/>
</div>
The code I have been attempting is:
soup = BeautifulSoup(browser.page_source, 'html.parser')
print(soup)
ID_Numbers = []
for IDs in soup.find_all('div', string='http'):
ID_Numbers.append(IDs.text)
Does anyone have a suggestion on how to solve this? I imagine I will have to strip later, but really all I want is that id=xxxx value embedded in there. I've tried most of the solutions I've seen on stack with no success. Thanks!
You need to find all the image elements, then loop over these elements to get the source for each image. Then, simply break the src down to get the id.
soup = bs4.BeautifulSoup(page_source, features='html.parser')
for i in soup.find_all('img'):
src = i['src']
try:
id = src.split('?id=')[1]
print(id)
except(IndexError):
continue
Here, I have split the src to get the id, but in more complicated cases you may need to use regex.
You can use the inbuilt [urlparse][1] library like below:
from bs4 import BeautifulSoup
from urllib.parse import urlsplit
html_doc = """
<div class="result-bio">
<div class="profile-image">
<img class="search-image load-member-profile" ng-click="results.loadProfile(result.UGuid)"
ng-src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"
src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"/>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
images = soup.findAll('img')
ids = []
for i in images:
parsed = urlsplit(i['src'])
id = parsed.query.replace("id=", "")
ids.append(id)
print(ids)
This gave me output:-
['9091a557-fd44-44be-9468-9386d90a39b8']

Unable to Scrape Content that comes after a Comment Python BeautifulSoup

I am trying to scrape the tables from the following page:
https://www.baseball-reference.com/boxes/CHA/CHA193805220.shtml
When I reach the html for the batting tables I encounter a very long comment which contains the html for the table
<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
<div class="section_heading">
<div class="section_heading_text">
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
.....
-->
<div class="table_outer_container mobile_table">
<div class="footer no_hide_long">
Where the last two div are what I am interested in scraping and everything in between the <!-- and the --> is a comment which happens to contain a copy of the table in the table_outer_container class below.
The problem is that when I read the page source into beautiful soup it does will not read anything after the comment within the table_wrapper class div which contains everything. The following code illustrates the problem:
batting = page_source.find('div', {'id':'all_WashingtonSenatorsbatting'})
divs = batting.find_all('div')
len(divs)
gives me
Out[1]: 3
When there are obviously 5 div children under the div id="all_WashingtonSenatorsbatting" element.
Even when I extract the comment using
from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
The resulting soup still doesn't contain the last two div elements I want to scrape. I am trying to play with the code using regular expressions but so far no luck, any suggestions?
I found workable solution, By using the following code I extract the comment (which brings with it the last two div elements I wanted to scrape), process it again in BeautifulSoup and scrape the table
s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')
It took me a while to get to this and would be interested to see if anyone comes up with any other solutions or can offer an explanation of how this problem came to be.

problems scraping web page using python

Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!
First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)
perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']
if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r

Parse JavaScript href with Python

Been having a lot of trouble with this... new to Python so sorry if I just don't know the proper search terms to find the info myself. I'm not even positive it's because of the JS but that's the best idea I've got.
Here's the section of HTML I'm parsing:
...
<div class="promotion">
<div class="address">
5203 Alhama Drive
</div>
</div>
...
...and the Python I'm using to do it (this version is the closest I've gotten to success):
homeFinderSoup = BeautifulSoup(open("homeFinderHTML.html"), "html5lib")
addressClass = homeFinderSoup.find_all('div', 'address')
for row in addressClass:
print row.get('href')
...which returns
None
None
None
# Create soup from the html. (Here I am assuming that you have already read the file into
# the variable "html" as a string).
soup = BeautifulSoup(html)
# Find all divs with class="address"
address_class = soup.find_all('div', {"class": "address"})
# Loop over the results
for row in address_class:
# Each result has one <a> tag, and we need to get the href property from it.
print row.find('a').get('href')

Categories