Unable to scrape the text from a certain LI element

Unable to scrape the text from a certain LI element - python

I am scraping this URL.
I have to scrape the main content of the page like Room Features and Internet Access
Here is my code:
for h3s in Column: # Suppose this is div.RightColumn
for index,test in enumerate(h3s.select("h3")):
print("Feature title: "+str(test.text))
for v in h3s.select("ul")[index]:
print(v.string.strip())
This code scrapes all the <li>'s but when it comes to scrape Internet Access
I get
AttributeError: 'NoneType' object has no attribute 'strip'
Because <li>s data under the Internet Access heading is contained inside the double-quotes like "Wired High Speed Internet Access..."
I have tried replacing print(v.string.strip()) with print(v) which results <li>Wired High...</li>
Also I have tried using print(v.text) but it does not work too
The relevant section looks like:
<h3>Internet Access</h3>
<ul>
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
</ul>

BeautifulSoup elements only have a .string value if that string is the only child in the element. Your <li> tag has a <span> element as well as a text.
Use the .text attribute instead to extract all strings as one:
print(v.text.strip())
or use the element.get_text() method:
print(v.get_text().strip())
which also takes a handy strip flag to remove extra whitespace:
print(v.get_text(' ', strip=True))
The first argument is the separator used to join the various strings together; I used a space here.
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <h3>Internet Access</h3>
... <ul>
... <li>Wired High Speed Internet Access in All Guest Rooms
... <span class="fee">
... 25 USD per day
... </span>
... </li>
... </ul>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.li
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
>>> soup.li.string
>>> soup.li.text
u'Wired High Speed Internet Access in All Guest Rooms\n \n 25 USD per day\n \n'
>>> soup.li.get_text(' ', strip=True)
u'Wired High Speed Internet Access in All Guest Rooms 25 USD per day'
Do make sure you call it on the element:
for index, test in enumerate(h3s.select("h3")):
print("Feature title: ", test.text)
ul = h3s.select("ul")[index]
print(ul.get_text(' ', strip=True))
You could use the find_next_sibling() function here instead of indexing into a .select():
for header in h3s.select("h3"):
print("Feature title: ", header.text)
ul = header.find_next_sibling("ul")
print(ul.get_text(' ', strip=True))
Demo:
>>> for header in h3s.select("h3"):
... print("Feature title: ", header.text)
... ul = header.find_next_sibling("ul")
... print(ul.get_text(' ', strip=True))
...
Feature title: Room Features
Non-Smoking Room Connecting Rooms Available Private Terrace Sea View Room Suites Available Private Balcony Bay View Room Honeymoon Suite Starwood Preferred Guest Room Room with Sitting Area
Feature title: Internet Access
Wired High Speed Internet Access in All Guest Rooms 25 USD per day

Related

How to scrape last string of <p> tag element?

To start, python is my first language I am learning.
I am scraping a website for rent prices across my city and I am using BeautifulSoup to get the price data, but I am unable to get the value of this tag.
Here is the tag:
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
Here is my code:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
print(price.string)
I also tried:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
items = price.find_all("strong")
for item in items:
print('item.string')
and that works to print out "Monthly Rent:" but I don't understand why I can't get the actual price. The above code shows me that the monthly rent is in the strong tag, which means that the p tag only contains the price which is what I want.

As mentioned by #kyrony there are two children in your <p> - Cause you select the <strong> you will only get one of the texts.
You could use different approaches stripped_strings:
list(soup.p.stripped_strings)[-1]
or contents
soup.p.contents[-1]
or with recursive argument
soup.p.find(text=True, recursive=False)
Example
from bs4 import BeautifulSoup
html = '''<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>'''
soup = BeautifulSoup(html)
soup.p.contents[-1]

Technically your content has two children
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
A strong tag
<strong class="hidden show-mobile-inline">Monthly Rent: </strong>
and a string
2,450 +
The string method in beautiful soup only takes one argument so its going to return None. In order to get the second string you need to use the stripped_strings generator.

How to get the tag name of html using Python Beautiful Soup?

header = head.find_all('span')
[<span itemprop="name">Raj</span>, <span itemprop="street">24 Omni Street</span>, <span itemprop="address">Ohio</span>, <span itemprop="Region">US</span>, <span itemprop="postal">40232</span>, <span class="number">334646344</span>]
print (header[0].tag)
print(header[0].text)
####output
None
Raj
...
####Expected output
Name
Raj
...
I could not able to extract all the value of span itemprop. It throws me None output. Am I doing something wrong?
Thanks,
Raj

Yes, class 'bs4.element.Tag' does not have a tag attribute, as itself is a Tag. From the docs:
You can access a tag’s attributes by treating the tag like a dictionary.
So you've got the list of all the span tags, now just iterate the list and get their attribute that you want (i.e. 'itemprop'):
spans = head.find_all('span')
for span in spans:
try:
print(span['itemprop'].decode().title() + ': ' + span.text)
except KeyError:
continue
output:
Name: Raj
Street: 24 Omni Street
Address: Ohio
Region: US
Postal: 40232
Format the output or store the data as needed

Parsing IMDB with BeautifulSoup

I've stripped the following code from IMDB's mobile site using BeautifulSoup, with Python 2.7.
I want to create a separate object for the episode number '1', title 'Winter is Coming', and IMDB score '8.9'. Can't seem to figure out how to split apart the episode number and the title.
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>

You can use find to locate the span with the class text-large to the specific element you need.
Once you have your desired span, you can use next to grab the next line, containing the episode number and find to locate the strong containing the title
html = """
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
span = soup.find('span', attrs={'text-large'})
ep = str(span.next).strip()
title = str(span.find('strong').text).strip()
print ep
print title
> 1.
> Winter Is Coming

Once you have each a class="btn-full", you can use the span classes to get the tags you want, the strong tag is a child of the span with the text-large class so you just need to call .strong.text on the Tag, for the span with the css class mobile-sprite tiny-star, you need to find the next strong tag as it is a sibling of the span not a child:
h = """<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
title = soup.select_one("span.text-large").strong.text.strip()
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(title, score)
Which gives you:
(u'Winter Is Coming', u'8.9')
If you really want to get the episode the simplest way is to split the text once:
soup = BeautifulSoup(h)
ep, title = soup.select_one("span.text-large").text.split(None, 1)
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(ep, title.strip(), score)
Which will give you:
(u'1.', u'Winter Is Coming', u'8.9')

Using url html scraping with reguest and regular expression search.
import os, sys, requests
frame = ('http://www.imdb.com/title/tt1480055?ref_=m_ttep_ep_ep1')
f = requests.get(frame)
helpme = f.text
import re
result = re.findall('itemprop="name" class="">(.*?) ', helpme)
result2 = re.findall('"ratingCount">(.*?)</span>', helpme)
result3 = re.findall('"ratingValue">(.*?)</span>', helpme)
print result[0].encode('utf-8')
print result2[0]
print result3[0]
output:
Winter Is Coming
24,474
9.0

Removing particular content from result parces using beautifulsoup

def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
This is the code which gives me text from this html
<div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
Read all announcements in Prestige Estate </p><p> </p>
</div>
This result is fine for me, I just want to exclude the content of
Read all announcements in Prestige Estate
from result, that is desc in my script, if it is present and Ignore if it is not present. How can I do this?

You can use extract() to remove unnecessary tags from the find() result:
descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')] # remove <a> tags
return descItem.get_text() # return the text

Just make some changes to last line and add re module
...
return re.sub(r'<a(.*)</a>','',desc)
Output:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>

Python, BeautifulSoup - <div> text and <img> attributes in correct order

I have a short piece of HTML that I would like to run through using BeautifulSoup. I've got basic navigation down, but this one has me stumped.
Here's an example piece of HTML (totally made it up):
<div class="textbox">
Buying this item will cost you
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
silver credits and
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
golden credits
</div>
Using the 'alt' attributes of the img tags I would like to see the following result:
Buying this item will cost you 1 silver credits and 1 golden credits
I have no idea how to loop through the div-tag sequentially. I can do the following to extract all the text contained in the div-tag
html = BeautifulSoup(string)
print html.get_text()
to get all the text contained in the div-tag, but that would give me result like this:
Buying this item will cost you silver credits and golden credits
Likewise, I can get the values of the alt-attributes from the img-tags by doing this:
html = BeautifulSoup(string).img
print html['alt']
But of course this only gives me the attribute value.
How can I iterate through all these elements in the correct order? Is it possible to read the text in the div-element and the attibutes of the img-element in consecutive order?

You can loop through all children of a tag, including text; test for their type to see if they are Tag or NavigableString objects:
from bs4 import Tag
result = []
for child in html.find('div', class_='textbox').children:
if isinstance(child, Tag):
result.append(child.get('alt', ''))
else:
result.append(child.strip())
print ' '.join(result)
Demo:
>>> from bs4 import BeautifulSoup, Tag
>>> sample = '''\
... <div class="textbox">
... Buying this item will cost you
... <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
... silver credits and
... <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
... golden credits
... </div>
... '''
>>> html = BeautifulSoup(sample)
>>> result = []
>>> for child in html.find('div', class_='textbox').children:
... if isinstance(child, Tag):
... result.append(child.get('alt', ''))
... else:
... result.append(child.strip())
...
>>> print ' '.join(result)
Buying this item will cost you 1 silver credits and 1 golden credits

This can also be done with a single XPath query:
//div[#class="textbox"]/text() | //div[#class="textbox"]/img/#alt
Unfortunately, BeautifulSoup doesn't support XPath, but lxml does:
import lxml.html
root = lxml.html.fromstring("""
<div class="textbox">
Buying this item will cost you
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
silver credits and
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
golden credits
</div>
""")
pieces = root.xpath('//div[#class="textbox"]/text() | //div[#class="textbox"]/img/#alt')
print ' '.join(map(str.strip, pieces))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to scrape the text from a certain LI element - python

Related

How to scrape last string of <p> tag element?

How to get the tag name of html using Python Beautiful Soup?

Parsing IMDB with BeautifulSoup

Removing particular content from result parces using beautifulsoup

Python, BeautifulSoup - <div> text and <img> attributes in correct order

Categories

Resources