Get values from CSS span element with constantly changing values - python

I am trying to scrape a website that seems to use different values each time a particular span element appears. For example, the first few times the span element appears, it could be:
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
I have tried the following, but I keep getting either empty lists:
site = BeautifulSoup(link.text, "html.parser")
jobs_a = site.find_all("span title")
or
jobs_a = site.find_all("span", attrs="title")
or
jobs_a = site.find_all("span", attrs="title*")
Any suggestions?

I prefer using a CSS selector.
from bs4 import BeautifulSoup
data = '''\
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
'''
soup = BeautifulSoup(data, 'html.parser')
for s in soup.select('span[title]'):
print(f"{s.text=}\t{s.attrs['title']=}")

Related

How to find all the tags matching two values using BeautifulSoup in Python

I am trying to get all the values in the span class="last value" sections, however, sometimes the sections have a minor variation, span class="last value empty", and my code skips the variations, I would like to get all the sections that start with "last value" in the span class="last value", or in alternative all the sections that are either "last value" or "last value empty"
This is the point where I am stuck:
r = requests.get(baseurl)
soup = BeautifulSoup(r.content)
elem = soup.find_all('span', {'class':"last value"})
The problem is that they are treated as two separate classes last and value. You can use a css selector like this
soup.select('span.last.value')
Example
html="""
<span class="last value">
1
</span>
<span class="last value empty">
2
</span>
"""
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,'html5lib')
print(soup.select('span.last.value'))
Output
[<span class="last value">
1
</span>, <span class="last value empty">
2
</span>]
You can use CSS selectors like:
soup.select("span[class*=lastvalue]")
or you can use the scrapy selector with xpath support:
from scrapy.selector import Selector
sel = Selector(text=r.content)
elem = sel.xpath('//span[contains(#class, "lastvalue")]')

Scrape 2 inner texts within div as one value

I have the following html
<div class="price-block__highlight"><span class="promo-price" data-
test="price">102,
<sup class="promo-price__fraction" data-test="price-fraction">99</sup>
</span>
</div>
I want to print the price of this html without comma, so
print price should result in:
102.99
I have the following code
pricea = page_soup.find("div", {"class":"price-block__highlight"})
price = str(pricea.text.replace('-','').replace(',','.').strip())
print price
This however results in:
102.
99
When writing in a csv it creates multiple rows. How to get both numbers in one value?
i think you are using bs4
from bs4 import BeautifulSoup
html_doc = """
<div class="price-block__highlight"><span class="promo-price" data-
test="price">102,
<sup class="promo-price__fraction" data-test="price-fraction">99</sup>
</span>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
price_div = soup.find("div", {"class": 'price-block__highlight'})
texts = [x.strip() for x in price_div.text.split(',')]
print('.'.join(texts))
Output
102.99

How to find an ID in a div class with multiple values BS4 Python

I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "

Why does BeautifulSoup work the second time parsing, but not the first

This is the ResultSet of running soup[0].find_all('div', {'class':'font-160 line-110'}):
[<div class="font-160 line-110" data-container=".snippet-container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">
<a class="no-underline group-ib color-inherit"
href="/en/ais/details/ports/959">
<span class="text-default">CN</span><span class="text-default text-darker">XMN
</span>
</a>
</div>]
In an attempt to pull out XIAMEN [CN] after title I could not use a[0].find('div')['title] (where a is the above BeautifulSoup ResultSet). However, if I copy and paste that HTML as a new string, say,
b = '''<div class="font-160 line-110" data-container=".snippet container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">'''
Then do:
>>soup = BeautifulSoup(b, 'html.parser')
>>soup.find('div')['title']
>>XIAMEN [CN] #prints contents of title
Why do I have to reSoup the Soup? Why doesn't this work on my first search?
Edit, origin of soup:
I have a list of urls that I'm going though via grequests. One of the things I'm looking for is that title that contains XIAMEN [CN].
So soup was created when I did
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
The urls are
[
'http://www.marinetraffic.com/en/ais/details/ships/shipid:564352/imo:9643752/mmsi:511228000/vessel:DE%20MI',
'http://www.marinetraffic.com/en/ais/details/ships/shipid:3780155/imo:9712395/mmsi:477588800/vessel:SITC%20GUANGXI?cb=2267'
]
I found out the problem occurred when I set up my BeautifulSoup. I created a list of partial search results then had to iterate over the list to research it. I fixed this by just searching for what I wanted in on line:
I changed:
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
to:
a = soup.find("div", class_='font-160 line-110')["title"]
And run this search as soon as I create my soup which removes a lot of redundancies in the code-- I had been creating lists of ResultSets and having to use find on them for new fields.
You use wrong selection.
Selection soup[0].find_all('div', {'class':'font-160 line-110'}) finds <div> and you can even see <div> when you print it. But when you add .find() it starts searching inside <div> - so .find('div') tries to find new div in current div
You need
a[0]['title']
When you create new soup then main/root element is not div but [document] and div is its child (div is inside main "tag") so you can use find('div').
>>> a[0].name
div
>>> soup = BeautifulSoup(b, 'html.parser')
>>> soup.name
[document]

Unable to fetch <div> tag values in python

The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())

Categories