I am trying to extract the string enclosed by the span with id="titleDescription" using BeautifulSoup.
<div class="itemText">
<div class="wrapper">
<span class="itemPromo">Customer Choice Award Winner</span>
<a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" >
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
<span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
</a>
</div>
Code snippet
f = open('egg.data', 'rb')
content = f.read()
content = content.decode('utf-8', 'replace')
content = ''.join([x for x in content if ord(x) < 128])
soup = bs(content)
for itemText in soup.find_all('div', attrs={'class':'itemText'}):
wrapper = itemText.div
wrapper_href = wrapper.a
for child in wrapper_href.descendants:
if child['id'] == 'titleDescriptionID':
print(child, "\n")
Traceback Error:
Traceback (most recent call last):
File "egg.py", line 66, in <module>
if child['id'] == 'titleDescriptionID':
TypeError: string indices must be integers
spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
print span.string
In your code, wrapper_href.descendants contains at least 4 elements, 2 span tags and 2 string enclosed by the 2 span tags. It searches its children recursively.
wrapper_href.descendants includes any NavigableString objects, which is what you are tripping over. NavigableString are essentially string objects, and you are trying to index that with the child['id'] line:
>>> next(wrapper_href.descendants)
u'\n'
Why not just load the tag directly using itemText.find('span', id='titleDescriptionID')?
Demo:
>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
... print itemText.find('span', id='titleDescriptionID')
... print itemText.find('span', id='titleDescriptionID').text
...
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K
from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string
for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
print item.string
When we search for a tag using BeautifulSoup, we get a BeautifulSoup.Tag object, which can directly be used to access its other attributes like inner content, style, href etc.
Related
I'm trying to scrape some information about the positions, artists and songs from a ranking list online. Here is the ranking list website: https://kma.kkbox.com/charts/weekly/newrelease?terr=my&lang=en
I'm was trying to use the following code to scrape:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://kma.kkbox.com/charts/weekly/newrelease?terr=my&lang=en')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_songs = soup.find_all(class_="charts-list-song")
all_artists = soup.find_all(class_="charts-list-artist")
print(all_songs)
print(all_artists)
However, the output only shows:
[<span class="charts-list-desc">
<span class="charts-list-song"></span>
<span class="charts-list-artist"></span>
</span>, <span class="charts-list-desc">
<span class="charts-list-song"></span>
...
and
<span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>,
My expected output should be:
Pos artist songs
1 張哲瀚 洪荒劇場Primordial Theater
2 張哲瀚 冰川消失那天Lost Glacier
3 告五人 又到天黑
Use view source in Chrome, you can see that the actual chart content is at the end of the html source code and loaded as chart variable.
code
import requests
from bs4 import BeautifulSoup
import json, re
page = requests.get('https://kma.kkbox.com/charts/weekly/newrelease?terr=my&lang=en')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.select('script')[-2].string
m = re.search(r'var chart = (\[{.*}\])', data)
songs = json.loads(m.group(1))
for song in songs:
print(song['rankings']['this_period'], song['artist_name'], song['song_name'])
output
1 張哲瀚 洪荒劇場Primordial Theater
2 張哲瀚 冰川消失那天Lost Glacier
3 告五人 又到天黑
4 孫盛希 Shi Shi 眼淚記得你 (Remembered)
5 陳零九 Nine Chen 夢裡的女孩 (The Girl)
6 告五人 一念之間
7 苏有朋 玫瑰急救箱
8 林俊傑 想見你想見你想見你
...
Im scraping a page and found that with my xpath and regex methods i cant seem to get to a set of values that are within a div class
I have tried the method stated here on this page
How to get all the li tag within div tag
and then the current logic shown below that is within my file
#PRODUCT ATTRIBUTES (STYLE, SKU, BRAND) need to figure out how to loop thru a class and pull out the 2 list tags
prodattr = re.compile(r'<div class=\"pdp-desc-attr spec-prod-attr\">([^<]+)</div>', re.IGNORECASE)
prodattrmatches = re.findall(prodattr, html)
for m in prodattrmatches:
m = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(m, html)
#STYLE
sty = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(sty, html)
#BRAND
brd = re.compile(r'<li class=\"first first-item\">([^<]+)</li>', re.IGNORECASE)
brdmatches = re.findall(brd, html)
The above is the current code that is NOT working.. everything comes back empty. For the purpose of my testing im merely writing the data, if any, out to the print command so i can see it on the console..
itmDetails2 = dets['sku'] +","+ dets['description']+","+ dets['price']+","+ dets['brand']
and within the console this is what i get this, which is what i expect and the generic messages are just place holders until i get this logic figured out.
SKUE GOES HERE,adidas Women's Essentials Tricot Track Jacket,34.97, BRAND GOES HERE
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
Do not use Regex to parse HTML
There are better and safer ways to do this.
Take a look in this code using Parsel and BeautifulSoup to extract the li tags of your sample code:
from parsel import Selector
from bs4 import BeautifulSoup
html = ('<div class="pdp-desc-attr spec-prod-attr">'
'<ul class="prod-attr-list">'
'<li class="first first-item">Brand: adidas</li>'
'<li>Country of Origin: Imported</li>'
'<li class="last last-item">Style: F18AAW400D</li>'
'</ul>'
'</div>')
# Using parsel
sel = Selector(text=html)
for li in sel.xpath('//li'):
print(li.xpath('./text()').get())
# Using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for li in soup.find_all('li'):
print(li.text)
Output:
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
I would use an html parser and look for the class of the ul. Using bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '''
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
'''
soup = bs(html, 'lxml')
for item in soup.select('.prod-attr-list:has(> li)'):
print([sub_item.text for sub_item in item.select('li')])
The HTML source was
html = """
<td>
<a href="/urlM5CLw" target="_blank">
<img alt="I" height="132" src="VZhAy" width="132"/>
</a>
<br/>
<cite title="mac-os-x-lion-icon-pack.en.softonic.com">
mac-os-x-lion-icon-pac...
</cite>
<br/>
<b>
Mac
</b>
OS X Lion Icon Pack's
<br/>
535 × 535 - 135k - png
</td>"""
My python code
soup = BeautifulSoup(html)
text = soup.find('td').renderContents()
By these code I can get string like
<img alt="I" height="132" src="VZhAy" width="132"/><br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
But I don't want <a>....</a>, I just need:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
Try removing the <a> tag and then fetch what you were trying to.
>>> soup.find('a').extract()
>>> text = soup.find('td').renderContents()
>>> text
'<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 \xd7 535 - 135k - png'
You can use the Tag.decompose() method to remove the a tag and completely destroy his contents also you may need to decode() your byte string and replace all \n occurence by '' .
soup = BeautifulSoup(html, 'lxml')
soup.a.decompose()
print(soup.td.renderContents().decode().replace('\n', ''))
yields:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com"> mac-os-x-lion-icon-pac... </cite><br/><b> Mac </b> OS X Lion Icon Pack's <br/> 535 × 535 - 135k - png
The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())
I am scraping this URL.
I have to scrape the main content of the page like Room Features and Internet Access
Here is my code:
for h3s in Column: # Suppose this is div.RightColumn
for index,test in enumerate(h3s.select("h3")):
print("Feature title: "+str(test.text))
for v in h3s.select("ul")[index]:
print(v.string.strip())
This code scrapes all the <li>'s but when it comes to scrape Internet Access
I get
AttributeError: 'NoneType' object has no attribute 'strip'
Because <li>s data under the Internet Access heading is contained inside the double-quotes like "Wired High Speed Internet Access..."
I have tried replacing print(v.string.strip()) with print(v) which results <li>Wired High...</li>
Also I have tried using print(v.text) but it does not work too
The relevant section looks like:
<h3>Internet Access</h3>
<ul>
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
</ul>
BeautifulSoup elements only have a .string value if that string is the only child in the element. Your <li> tag has a <span> element as well as a text.
Use the .text attribute instead to extract all strings as one:
print(v.text.strip())
or use the element.get_text() method:
print(v.get_text().strip())
which also takes a handy strip flag to remove extra whitespace:
print(v.get_text(' ', strip=True))
The first argument is the separator used to join the various strings together; I used a space here.
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <h3>Internet Access</h3>
... <ul>
... <li>Wired High Speed Internet Access in All Guest Rooms
... <span class="fee">
... 25 USD per day
... </span>
... </li>
... </ul>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.li
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
>>> soup.li.string
>>> soup.li.text
u'Wired High Speed Internet Access in All Guest Rooms\n \n 25 USD per day\n \n'
>>> soup.li.get_text(' ', strip=True)
u'Wired High Speed Internet Access in All Guest Rooms 25 USD per day'
Do make sure you call it on the element:
for index, test in enumerate(h3s.select("h3")):
print("Feature title: ", test.text)
ul = h3s.select("ul")[index]
print(ul.get_text(' ', strip=True))
You could use the find_next_sibling() function here instead of indexing into a .select():
for header in h3s.select("h3"):
print("Feature title: ", header.text)
ul = header.find_next_sibling("ul")
print(ul.get_text(' ', strip=True))
Demo:
>>> for header in h3s.select("h3"):
... print("Feature title: ", header.text)
... ul = header.find_next_sibling("ul")
... print(ul.get_text(' ', strip=True))
...
Feature title: Room Features
Non-Smoking Room Connecting Rooms Available Private Terrace Sea View Room Suites Available Private Balcony Bay View Room Honeymoon Suite Starwood Preferred Guest Room Room with Sitting Area
Feature title: Internet Access
Wired High Speed Internet Access in All Guest Rooms 25 USD per day