Beautiful Soup / Regular Expressions: Extract a portion of text from NavigableString - python

I'm really new to learning python so this could be really obvious, but I have extracted a NavigableString from BeautifulSoup and I need to find data in the string. However, it's not as easy as some of the examples I've seen online.
My end goal is to create a dictionary that looks something like this:
dict = {'Fandom':'Undertale (Video Game)', 'Works':15341}
Here's are two examples of the strings:
<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
I've already succeeded extracting fandom from the string, but now I need the works count in parenthesis. How would I use Beautiful Soup and/or Regular Expressions to do this?
I also need to do error handling because while a fandom will always be displayed, it may not have a work count next to it.
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>
Here's the relevant pieces of code:
for each_f in cate:
#print(each_f)
result = each_f.find('a')
if result !=-1:
#here is where I grab the Fandom vals
fandom_name = result.contents
#print(result.contents)
NOTE: I know I'm missing the code to append to the dictionary, I haven't made it that far yet. I'm just trying to get the values to print to the screen.

use dict.fromkeys(('Fandom', 'Works')) to get :
In [17]: dict.fromkeys(('Fandom', 'Works'))
Out[17]: {'Fandom': None, 'Works': None}
use zip to combines the key with strings in the li tag, this will only combines the shortest:
zip(('Fandom', 'Works'),li.stripped_strings)
[('Fandom', 'Undertale (Video Game)'), ('Works', '(15341)')]
[('Fandom', 'Sherlock Holmes & Related Fandoms'), ('Works', '(101015)')]
[('Fandom', 'Composer - Fandom')]
then we update the dict with those data:
In [20]: for li in soup.find_all('li'):
...: d = dict.fromkeys(('Fandom', 'Works'))
...: out = zip(('Fandom', 'Works'),li.stripped_strings)
...: d.update(out)
...: print(d)
out:
{'Works': '(15341)', 'Fandom': 'Undertale (Video Game)'}
{'Works': '(101015)', 'Fandom': 'Sherlock Holmes & Related Fandoms'}
{'Works': None, 'Fandom': 'Composer - Fandom'}

You can use stripped_strings and unpack the values to get your blocks of text. You can store the results in a dictso that you can use them later.
Example:
from bs4 import BeautifulSoup
import requests
example = """<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<li><a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>"""
soup = BeautifulSoup(example, "html.parser")
Fandom = {"Fandom" : []}
for li in soup.find_all("li"):
try:
fandom, count = li.stripped_strings
Fandom["Fandom"].append({fandom.strip() : count[1:-1]})
except:
fandom = li.text.strip()
Fandom["Fandom"].append({fandom.strip() : 0})
print (Fandom)
This outputs:
{'Fandom': [{'Undertale (Video Game)': '15341'}, {'Sherlock Holmes & Related Fandoms': '101015'}, {'Composer - Fandom': 0}]}
The try-catch will catch any unpacking that doesn't contains two values: your fandom title and the word count.

Related

Ignore a link in parenthesis while trying to extract other links

I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...
ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)
I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - \(.*?\).
Anyone able to help?
You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[link_text2]

How to extract text which is surrounded by two <span> element?

I am in mid of scraping data from a website, but I encounter following code
code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "
I need to extract only "₹ 7,372".
I have tried following.
1. Code.text
but it result to
'\n\n₹ 7,372\xa0\r\n \n–\n\n'
code.text.strip()
but it result to
'₹ 7,372\xa0\r\n \n–'
Is there any method?
Please let me know, so that I can complete my project.
Ok, I managed to clean data that you need. This way is a little ugly, but works=)
from bs4 import BeautifulSoup as BS
html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """
soup=BS(html)
li = soup.find('li').text
for j in range(3):
for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
li=li.strip(i)
print(li)
output:
₹ 7,372
In the loop list I outlined all (as far as I know) ASCII spaces and the symbols that you get.
Loop launches 3 times because needed value doesn't clean from the first time, you can check it every iteration in variable explorer.
Also optionally you can try to figure out what precise symbol gives a lot of pseudo spaces between <span> tags.
from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372

Parsing HTML tags using python

I am new to python. I want to store each HTML tag into item of list.
from bs4 import BeautifulSoup
text = """
<body>
<div class="product">
<div class="x">orange</div>
<div class="x">apple</div>
<p> This is text </p>
</div>
</body>"""
soup = BeautifulSoup(text)
y=[]
for i in (soup.find_all("div", class_="product")):
y.append(i.get_text().encode("utf-8").strip())
From the above code the length of y is 1 and store all text on one item of list. However,it is necessary to parse using "div product" and store each text inside html tag into different item to list.
So the y will be :
y =['orange', 'apple', 'This is text']
instead of :
y=['orange\napple\n This is text']
Don't use text if all you wanted was the directly contained string, and ask only for elements contained in the div.product tag:
for elem in soup.select("div.product *"):
y.append(elem.string.strip().encode('utf8'))
Demo:
>>> y = []
>>> for elem in soup.select("div.product *"):
... y.append(elem.string.strip().encode('utf8'))
...
>>> y
['orange', 'apple', 'This is text']
soup.find_all("div",class_="product")
gives all the div tags with class product so u have a list now. So when u run the for loop it iterates only once and gives the complete text in div tag.
So to get data of every child use something like this
for child in soup.find_all("div", class_="product").findChildren():
y.append(child.string.strip().encode('utf8'))

Python: Can we Loop inside a string?

Like for example I have a string of HTML Ordered List. Inside that ordered list, I want to write n number of lists. How can I accomplish the task of adding the lists to this string?
Here is the example code:
html = """
<ol>
<li>
<!--Something-->
</li>
... <!--n lists-->
{} #str().format()
<li>
<!--Something-->
</li>
</ol>
"""
for li in html_lists: #where li is <li>...</li> and is inside the python list.
html.format(li)
As far as I know, Strings are immutable and .format() will add <li> at {}. Hence this won't work for more than one <li>.
Like you said, strings are immutable so just having html.format(li) on a line by itself won't do anything, you need to do html = html.format(li) because the first version won't modify html in place, it will return a result.
As for using a loop with str.format(), you should be able to use the following assuming each element in html_lists is a string that contains a single <li> entry:
html = html.format('\n'.join(html_lists))
This works because '\n'.join(html_lists) will construct a single string from your list of strings, which can then be passed to html.format() to replace the single {} with the content from every element in html_lists. Note that you could also use ''.join(html_lists), the newline is just there to make it more readable when html is displayed.
You could use lxml to build HTML:
import lxml.html as LH
import lxml.builder as builder
html_lists = 'one two three'.split()
E = builder.E
html = (
E.ol(
*([E.li('something')]
+ [E.li(item) for item in html_lists]
+ [E.li('else')])
)
)
print(LH.tostring(html, pretty_print=True))
prints
<ol>
<li>something</li>
<li>one</li>
<li>two</li>
<li>three</li>
<li>else</li>
</ol>
Python is really good for processing text, so here's an example using it to do what you want:
import textwrap
def indent(amt, s):
dent = amt * ' '
return ''.join(map(lambda i: dent+i+'\n', s.split('\n')[:-1])).rstrip()
ordered_list_html = textwrap.dedent('''\
<ol>
{}
</ol>
''')
# create some test data
html_lists = [textwrap.dedent('''\
<li>
list #{}
</li>
''').format(n) for n in xrange(5)]
print ordered_list_html.format(indent(2, ''.join(html_lists)))
Output:
<ol>
<li>
list #0
</li>
<li>
list #1
</li>
<li>
list #2
</li>
<li>
list #3
</li>
<li>
list #4
</li>
</ol>

Python, parsing html

Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.
For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)
Movie Title
Quality
Torrent Link
There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:
item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]
And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:
<div class="browse-info">
<span class="info">
<h3>James Bond: Casino Royale (2006)</h3>
<p><b>Size:</b> 1018.26 MB</p>
<p><b>Quality:</b> 720p</p>
<p><b>Genre:</b> Action | Crime</p>
<p><b>IMDB Rating:</b> 7.9/10</p>
<span>
<p class="peers"><b>Peers:</b> 698</p>
<p class="peers"><b>Seeds:</b> 356</p>
</span>
</span>
<span class="links">
View Info<span></span>
Download<span></span>
</span>
</div>
Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.text
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
...:
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...
or to get exactly the output you wanted:
In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.find(text=True, recursive=False).strip()
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
As you request I paste simple example of parser. As you can see it's use lxml. With lxml you have two ways to work with DOM tree one of these is xpath and the second is css selectors
I prefered xpath.
import lxml.html
import decimal
import urllib
def parse():
url = 'https://sometotosite.com'
doc = lxml.html.fromstring(urllib.urlopen(url).read())
main_div = doc.xpath("//div[#id='line']")[0]
main = {}
tr = []
for el in main_div.getchildren():
if el.xpath("descendant::a[contains(#name,'tn')]/text()"):
category = el.xpath("descendant::a[contains(#name,'tn')]/text()")[0]
main[category] = ''
tr = []
else:
for element in el.getchildren():
if '&#8212' in lxml.html.tostring(element):
tr.append(element)
print category, tr
parse()
LXML official site

Categories