How to extract html using beautifulsoup?

How to extract html using beautifulsoup? - python

The HTML source was
html = """
<td>
<a href="/urlM5CLw" target="_blank">
<img alt="I" height="132" src="VZhAy" width="132"/>
</a>
<br/>
<cite title="mac-os-x-lion-icon-pack.en.softonic.com">
mac-os-x-lion-icon-pac...
</cite>
<br/>
<b>
Mac
</b>
OS X Lion Icon Pack's
<br/>
535 × 535 - 135k - png
</td>"""
My python code
soup = BeautifulSoup(html)
text = soup.find('td').renderContents()
By these code I can get string like
<img alt="I" height="132" src="VZhAy" width="132"/><br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
But I don't want <a>....</a>, I just need:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png

Try removing the <a> tag and then fetch what you were trying to.
>>> soup.find('a').extract()
>>> text = soup.find('td').renderContents()
>>> text
'<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 \xd7 535 - 135k - png'

You can use the Tag.decompose() method to remove the a tag and completely destroy his contents also you may need to decode() your byte string and replace all \n occurence by '' .
soup = BeautifulSoup(html, 'lxml')
soup.a.decompose()
print(soup.td.renderContents().decode().replace('\n', ''))
yields:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com"> mac-os-x-lion-icon-pac... </cite><br/><b> Mac </b> OS X Lion Icon Pack's <br/> 535 × 535 - 135k - png

Related

How to exclude inner tags with beautifulsoup

Hey Im currently trying to parse through a website and I'm almost done, but there's a little problem. I wannt to exclude inner tags from a html code
<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
I tried using
...find("span", "moto-color5_5") but this returns
Text 1 Text 2
instead of only returning Text 1
Any suggestions?
sincierly :)

Excluding inner tags would also exclude Text 1 because it's in an inner tag <strong>.
You can however just find strong inside of your current soup:
html = """<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
"""
soup = BeautifulSoup(html)
result = soup.find("span", "moto-color5_5").find('strong')
print(result.text) # Text 1

How to extract tuples using findall?

I'm trying to extract tuples from an url and I've managed to extract string text and tuples using the re.search(pattern_str, text_str). However, I got stuck when I tried to extract a list of tuples using re.findall(pattern_str, text_str).
The text looks like:
<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>
... # repeating
...
...
and I'm using the following pattern & code to extract the tuples:
text_above = "..." # this is the text above
pat_str = '<a href="(\d+)">\n(.+)\n<span class'
pat = re.compile(pat_str)
# following line is supposed to return the numbers from the 2nd line
# and the string from the 3rd line for each repeating sequence
list_of_tuples = re.findall(pat, text_above)
for t in list_of tuples:
# supposed to print "11111 -> blah blah 111"
print(t[0], '->', t[1])
Maybe I'm trying something weird & impossible, maybe its better to extract the data using primitive string manipulations... But in case there exists a solution?

Your regex does not take into account the whitespace (indentation) between \n and <span. (And neither the whitespace at the start of the line you want to capture, but that's not as much of a problem.) To fix it, you could add some \s*:
pat_str = '<a href="(\d+)">\n\s*(.+)\n\s*<span class'

As suggested in the comments, use a html parser like BeautifulSoup:
from bs4 import BeautifulSoup
h = """<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>"""
soup = BeautifulSoup(h)
You can get the href and the previous_sibling to the span:
print([(a["href"].strip(), a.span.previous_sibling.strip()) for a in soup.find_all("a")])
[('11111', u'some text 111'), ('22222', u'some text 222'), ('33333', u'some text 333')]
Or the href and the first content from the anchor:
print([(a["href"].strip(), a.contents[0].strip()) for a in soup.find_all("a")])
Or with .find(text=True) to only get the tag text and not from the children.
[(a["href"].strip(), a.find(text=True).strip()) for a in soup.find_all("a")]
Also if you just want the anchors inside the list tags, you can specifically parse those:
[(a["href"].strip(), a.contents[0].strip()) for a in soup.select("li a")]

Parsing IMDB with BeautifulSoup

I've stripped the following code from IMDB's mobile site using BeautifulSoup, with Python 2.7.
I want to create a separate object for the episode number '1', title 'Winter is Coming', and IMDB score '8.9'. Can't seem to figure out how to split apart the episode number and the title.
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>

You can use find to locate the span with the class text-large to the specific element you need.
Once you have your desired span, you can use next to grab the next line, containing the episode number and find to locate the strong containing the title
html = """
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
span = soup.find('span', attrs={'text-large'})
ep = str(span.next).strip()
title = str(span.find('strong').text).strip()
print ep
print title
> 1.
> Winter Is Coming

Once you have each a class="btn-full", you can use the span classes to get the tags you want, the strong tag is a child of the span with the text-large class so you just need to call .strong.text on the Tag, for the span with the css class mobile-sprite tiny-star, you need to find the next strong tag as it is a sibling of the span not a child:
h = """<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
title = soup.select_one("span.text-large").strong.text.strip()
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(title, score)
Which gives you:
(u'Winter Is Coming', u'8.9')
If you really want to get the episode the simplest way is to split the text once:
soup = BeautifulSoup(h)
ep, title = soup.select_one("span.text-large").text.split(None, 1)
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(ep, title.strip(), score)
Which will give you:
(u'1.', u'Winter Is Coming', u'8.9')

Using url html scraping with reguest and regular expression search.
import os, sys, requests
frame = ('http://www.imdb.com/title/tt1480055?ref_=m_ttep_ep_ep1')
f = requests.get(frame)
helpme = f.text
import re
result = re.findall('itemprop="name" class="">(.*?) ', helpme)
result2 = re.findall('"ratingCount">(.*?)</span>', helpme)
result3 = re.findall('"ratingValue">(.*?)</span>', helpme)
print result[0].encode('utf-8')
print result2[0]
print result3[0]
output:
Winter Is Coming
24,474
9.0

Unable to fetch <div> tag values in python

The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?

An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]

If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())

BeautifulSoup: get tag text behind another tag

How to find tag by another tag using BeautifulSoup? In this example I want to get for example '0993 999 999' which is in div right behind another div with 'Telefon:' text.
I tried to get it using this:
print parsed.findAll('div',{'class':"dva" })[3].text
But It does not work properly. I think there must be a way to tell BeautifulSoup that it is right behind 'Telefon' text or another way.
<div class="kontakt">
<h2 class="section-head">Kontaktné údaje</h2>
<address itemprop="address" itemscope itemtype="http://schema.org/PostalAddress" >
<span itemprop="streetAddress" >SNP 12</span>, <span itemprop="postalCode" >904 01</span> <span itemprop="addressLocality" >Pezinok</span> </address>
<div class="jedna">Telefon:</div>
<div class="dva">013 / 688 27 78</div>
<div class="jedna">Mobil:</div>
<div class="dva">0993 999 999</div>
<div class="jedna">Fax:</div
<div class="dva">033 / 690 97 94</div>
<div class="jedna">E-mail:</div>
<div class="dva"><br /></div></div>
EDIT: I tried this, does not works neither.
tags = parsed.findAll('div',{'class':"jedna"})
for tag in tags:
if tag.text=='Telefon:':
print tag.next_siebling.string
Could you guys please give me a hint how to do that?
Thanks!

You can use find_next_sibling():
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""html here"""
soup = BeautifulSoup(data)
print soup.find('div', text='Telefon:').find_next_sibling('div', class_='dva').text
print soup.find('div', text='Mobil:').find_next_sibling('div', class_='dva').text
Prints:
013 / 688 27 78
0993 999 999
FYI, you can extract the duplication and have a nice reusable function:
def get_field_value(soup, field):
return soup.find('div', text=field+':').find_next_sibling('div', class_='dva').text
soup = BeautifulSoup(data)
print get_field_value(soup, 'Telefon') # prints 013 / 688 27 78
print get_field_value(soup, 'Mobil') # prints 0993 999 999
Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract html using beautifulsoup? - python

Related

How to exclude inner tags with beautifulsoup

How to extract tuples using findall?

Parsing IMDB with BeautifulSoup

Unable to fetch <div> tag values in python

BeautifulSoup: get tag text behind another tag

Categories

Resources