Parse complex <li> tag using beautifulSoup - python

I have a webpage with following code:
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
I need to parse the output such that the result will be extracting words like: Thalassery, Tellicherry, Thanjavur, Tanjore, Thane, Tannah, Thoothukudi, Tuticorin
Can anyone please help with this

You can use .findAll() to get all the li elements and use find() 'a' and 'i' tag
for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin

Try simplified_scrapy's solution, its fault tolerance
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])
Result:
[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]

Related

How to get text of selected node using beautifulsoup

Using BeautifulSoup for the first time and not able to get the idea about how I can extract the text from some specific node. Here is my code
html:
...
<p class="dsm">...</p>
<ul class="also">
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>
...
python:
current_page = urlopen(url)
current_soup = BeautifulSoup(current_page, 'html.parser')
derivative_list = current_soup.select('p.dsm + ul.also li')
for li in derivative_list:
print(li)
output:
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
Its outputting the correct list items, but what I want to get is text values of i.ab and span.at, something like
desired output:
abdrea, groups
shokdia, techs
After getting a list of all the <li> tags, simply iterate over them and find the texts of the <i class="ab"> and <span class="at"> tags individually.
for li in soup.select('p.dsm + ul.also li'):
print(li.i.text, li.span.text)
# abdrea groups
# shokdia techs
If there are other <i> and <span> tags inside the <li> tags, you can use find() on the li variable.
for li in soup.select('p.dsm + ul.also li'):
print(li.find('i', class_='ab').text, li.find('span', class_='at').text)
The Exact answer you looking for:
data = """<ul class="also">
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>"""
from bs4 import BeautifulSoup
page_soup = BeautifulSoup(data, "html.parser")
i_data, span_data= zip([x.text for x in page_soup.find_all("i")], [y.text for y in page_soup.find_all("span")])
print(i_data )
print(span_data)
output:
(u'abdrea', u'groups')
(u'shokdia', u'techs')

How to extract tuples using findall?

I'm trying to extract tuples from an url and I've managed to extract string text and tuples using the re.search(pattern_str, text_str). However, I got stuck when I tried to extract a list of tuples using re.findall(pattern_str, text_str).
The text looks like:
<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>
... # repeating
...
...
and I'm using the following pattern & code to extract the tuples:
text_above = "..." # this is the text above
pat_str = '<a href="(\d+)">\n(.+)\n<span class'
pat = re.compile(pat_str)
# following line is supposed to return the numbers from the 2nd line
# and the string from the 3rd line for each repeating sequence
list_of_tuples = re.findall(pat, text_above)
for t in list_of tuples:
# supposed to print "11111 -> blah blah 111"
print(t[0], '->', t[1])
Maybe I'm trying something weird & impossible, maybe its better to extract the data using primitive string manipulations... But in case there exists a solution?
Your regex does not take into account the whitespace (indentation) between \n and <span. (And neither the whitespace at the start of the line you want to capture, but that's not as much of a problem.) To fix it, you could add some \s*:
pat_str = '<a href="(\d+)">\n\s*(.+)\n\s*<span class'
As suggested in the comments, use a html parser like BeautifulSoup:
from bs4 import BeautifulSoup
h = """<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>"""
soup = BeautifulSoup(h)
You can get the href and the previous_sibling to the span:
print([(a["href"].strip(), a.span.previous_sibling.strip()) for a in soup.find_all("a")])
[('11111', u'some text 111'), ('22222', u'some text 222'), ('33333', u'some text 333')]
Or the href and the first content from the anchor:
print([(a["href"].strip(), a.contents[0].strip()) for a in soup.find_all("a")])
Or with .find(text=True) to only get the tag text and not from the children.
[(a["href"].strip(), a.find(text=True).strip()) for a in soup.find_all("a")]
Also if you just want the anchors inside the list tags, you can specifically parse those:
[(a["href"].strip(), a.contents[0].strip()) for a in soup.select("li a")]

Parsing IMDB with BeautifulSoup

I've stripped the following code from IMDB's mobile site using BeautifulSoup, with Python 2.7.
I want to create a separate object for the episode number '1', title 'Winter is Coming', and IMDB score '8.9'. Can't seem to figure out how to split apart the episode number and the title.
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
You can use find to locate the span with the class text-large to the specific element you need.
Once you have your desired span, you can use next to grab the next line, containing the episode number and find to locate the strong containing the title
html = """
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
span = soup.find('span', attrs={'text-large'})
ep = str(span.next).strip()
title = str(span.find('strong').text).strip()
print ep
print title
> 1.
> Winter Is Coming
Once you have each a class="btn-full", you can use the span classes to get the tags you want, the strong tag is a child of the span with the text-large class so you just need to call .strong.text on the Tag, for the span with the css class mobile-sprite tiny-star, you need to find the next strong tag as it is a sibling of the span not a child:
h = """<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
title = soup.select_one("span.text-large").strong.text.strip()
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(title, score)
Which gives you:
(u'Winter Is Coming', u'8.9')
If you really want to get the episode the simplest way is to split the text once:
soup = BeautifulSoup(h)
ep, title = soup.select_one("span.text-large").text.split(None, 1)
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(ep, title.strip(), score)
Which will give you:
(u'1.', u'Winter Is Coming', u'8.9')
Using url html scraping with reguest and regular expression search.
import os, sys, requests
frame = ('http://www.imdb.com/title/tt1480055?ref_=m_ttep_ep_ep1')
f = requests.get(frame)
helpme = f.text
import re
result = re.findall('itemprop="name" class="">(.*?) ', helpme)
result2 = re.findall('"ratingCount">(.*?)</span>', helpme)
result3 = re.findall('"ratingValue">(.*?)</span>', helpme)
print result[0].encode('utf-8')
print result2[0]
print result3[0]
output:
Winter Is Coming
24,474
9.0

How to find the number of text objects in BeautifulSoup object

I'm web scraping a wikipedia page using BeautifulSoup in python and I was wondering whether there is anyone to know the number of text objects in an HTML object. For example the following code gets me the following HTML:
soup.find_all(class_ = 'toctext')
<span class="toctext">Actors and actresses</span>, <span class="toctext">Archaeologists and anthropologists</span>, <span class="toctext">Architects</span>, <span class="toctext">Artists</span>, <span class="toctext">Broadcasters</span>, <span class="toctext">Businessmen</span>, <span class="toctext">Chefs</span>, <span class="toctext">Clergy</span>, <span class="toctext">Criminals</span>, <span class="toctext">Conspirators</span>, <span class="toctext">Economists</span>, <span class="toctext">Engineers</span>, <span class="toctext">Explorers</span>, <span class="toctext">Filmmakers</span>, <span class="toctext">Historians</span>, <span class="toctext">Humourists</span>, <span class="toctext">Inventors / engineers</span>, <span class="toctext">Journalists / newsreaders</span>, <span class="toctext">Military: soldiers/sailors/airmen</span>, <span class="toctext">Monarchs</span>, <span class="toctext">Musicians</span>, <span class="toctext">Philosophers</span>, <span class="toctext">Photographers</span>, <span class="toctext">Politicians</span>, <span class="toctext">Scientists</span>, <span class="toctext">Sportsmen and sportswomen</span>, <span class="toctext">Writers</span>, <span class="toctext">Other notables</span>, <span class="toctext">English expatriates</span>, <span class="toctext">References</span>, <span class="toctext">See also</span>
I can get the first text object by running the following:
soup.find_all(class_ = 'toctext')[0].text
My goal here is to get and store all of the text objects in a list. I'm doing this by using a for loop, however I don't know how many text objects there are in the html block. Naturally I would hit an error if I get to an index that doesn't exist Is there an alternative?
You can use a for...in loop.
In [13]: [t.text for t in soup.find_all(class_ = 'toctext')]
Out[13]:
['Actors and actresses',
'Archaeologists and anthropologists',
'Architects',
'Artists',
'Broadcasters',
'Businessmen',
'Chefs',
'Clergy',
'Criminals',
'Conspirators',
'Economists',
'Engineers',
'Explorers',
'Filmmakers',
'Historians',
'Humourists',
'Inventors / engineers',
'Journalists / newsreaders',
'Military: soldiers/sailors/airmen',
'Monarchs',
'Musicians',
'Philosophers',
'Photographers',
'Politicians',
'Scientists',
'Sportsmen and sportswomen',
'Writers',
'Other notables',
'English expatriates',
'References',
'See also']
Try the following code:
for txt in soup.find_all(class_ = 'toctext'):
print(txt.text)

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

Categories