Extract content with BeautifulSoup and Python

Extract content with BeautifulSoup and Python - python

I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...
For example, here's one of the comments that I have a problem with:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.
This is the code that I'm using:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).
Thanks in advance!

Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
Decompose extracts tags that you don't want. In your case:
soup.blockquote.decompose()
or all unwanted tags:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
Your example:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
Update
Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'

BeautifulSoup has a get_text method. Maybe this is what you want.
From their documentation:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
This gives:
THIS IS THE COMMENT THAT I NEED!
To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.

Related

Using BeautifulSoup, how to select a tag without its children?

The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.

With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))

#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]

How to highlight text in complex html

I have an html that looks like this:
<h3>
Heading 3
</h3>
<ol>
<li>
<ol>
....
</li>
</ol>
Need to highlight the entire html starting from first ol. I have found this solution:
soup = bs4.BeautifulSoup(open('temp.html').read(), 'lxml')
new_h1 = soup.new_tag('h1')
new_h1.string = 'Hello '
mark = soup.new_tag('mark')
mark.string = 'World'
new_h1.append(mark)
h1 = soup.h1
h1.replace_with(new_h1)
print(soup.prettify())
Is there any way to highlight entire html without having to find out the specific text?
Edit:
This is what I mean by highlighted text
Edit:
I have tried this code but it only highlights the very innermost li:
for node in soup2.findAll('li'):
if not node.string:
continue
value = node.string
mark = soup2.new_tag('mark')
mark.string = value
node.replace_with(mark)

This will highlight all the <li> content.
As I have no clear idea of how your HTML code looks like, I have tried to highlight all the <li> content. You can modify this code to suit your requirements.
from bs4 import BeautifulSoup
with open('index.html') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
tag = soup.findAll('li')
# Highlights the <li> content
for li in tag:
newtag = soup.new_tag('mark')
li.string.wrap(newtag)
print(soup)
After Highlighting: https://i.stack.imgur.com/iIbXk.jpg

Parse HTML page to get contents of <p> and <b> tags

There are lots of HTML pages which are structured as a sequence of such groups:
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.
How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).
from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
print ''.join(node.findAll(text=True))

I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:
for node in soup.findAll('p'):
print(node.text)
# or: keywords = node.text.split(', ')
# print(keywords)

You need to split your string which in this case is url with /
And then you can choose chunks you want
For example if url is https://some.page.org/year/0001 i use split function to split url with / sign
it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link

There are different ways to HTML parse the desired categories and keywords from this kind of HTML structure, but here is one of the "BeautifulSoup" ways to do it:
find b elements with a text which ends with a :
use .next_sibling to get to the next text node which contains keywords
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<p>
<b> Category 1:</b>
"keyword_a, keyword_b"
</p>
<p>
<b> Category 2:</b>
"keyword_c, keyword_d"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for category in soup('b', text=lambda text: text and text.endswith(":")):
keywords = category.next_sibling.strip('" \n').split(", ")
print(category.get_text(strip=True), keywords)
Prints:
Category 1: ['keyword_a', 'keyword_b']
Category 2: ['keyword_c', 'keyword_d']

Assuming for each block
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
you want to extract keyword_a and keyword_b for each Keywords/Category. So an example would be:
<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>
Once you have the HTML code, you can do:
from bs4 import BeautifulSoup
html = '''<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>'''
soup = BeautifulSoup(html, 'html.parser')
p_elements = soup.find_all('p')
for p_element in p_elements:
b_element = soup.find_all('b')[0]
b_element.extract()
category = b_element.text.strip()
keywords = p_element.text.strip()
keyword_a, keyword_b = keywords[1:-1].split(', ')
print('Category:', category)
print('Keyword A:', keyword_a)
print('Keyword B:', keyword_b)
Which prints:
Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich

Isolating title and text Beautiful Soup

I have a some code that parses out a div from a page then finds all "p" tags which will have a title and some some text
sample:
for fn in os.listdir('.'):
if os.path.isfile(fn):
url = "%s/%s" % (path, fn)
page = open(url)
soup = BeautifulSoup(page,'html.parser')
soup2 = soup.find("div", {"class": "aui-field-wrapper-content"})
print soup2.p.prettify()
for node in soup2.findAll('p'):
print ''.join(node.findAll(text=True))
which returns
sample:
<p>
<b>
<strong class="TooltipInline" data-toggle="tooltip" title="Molecular formula">
Mol. formula:
</strong>
</b>
C23H30O6
</p>
In this instance i want to individually access the title Mol. Formula: and the text "C23H30O6" currently I am able to return
Mol. formula: C23H30O6 but not the individual components. I am realtively new to beautiful soup and am unsure of how to reference each component of a "p" tag

The other way to approach the problem is to get the b element inside the p element and consider it your "label", then go sideways and get the next sibling element:
label = p.b
value = label.next_sibling.strip()
print(label.get_text(strip=True), value)
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <p>
... <b>
... <strong class="TooltipInline" data-toggle="tooltip" title="Molecular formula">
... Mol. formula:
... </strong>
... </b>
... C23H30O6
... </p>
... """
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>>
>>> p = soup.p
>>>
>>> label = p.b
>>> value = label.next_sibling.strip()
>>> print(label.get_text(strip=True), value)
Mol. formula: C23H30O6

Your method of findAll(text=True) is doing the same thing as the get_text() method from Beautiful Soup. It will get all the text in the <p> tag. If you have a stable format a simple way to do it would be:
ptext = node.get_text().split(':',1)
title = ptext[0].strip()
value = ptext[1].strip()
In reference to the child tag question note that the molecular formula isn't in any tag except for the <P> tag.

Python and BeautifulSoup, not finding 'a'

Here's a piece of HTML code (from delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
I'm trying to find all the links where class="inlinesave action". Here's the code:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
But it doesn't find anything!
Any thoughts?
Thanks

If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").
The following statement should work for all cases (even though it looks ugly imo.):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})

Python string methods
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]

May be that issue is fixed in verion 3.1.0, I could parse yours,
>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
... <em class="bookmark-actions">
... <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Gen
... </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%
>>>
I have tried with BeautifulSoup 2.1.1 also, its does not work at all.

You might make some forward progress using pyparsing:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
Gives (long result output snipped at '...'):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract content with BeautifulSoup and Python - python

BeautifulSoup has a get_text method. Maybe this is what you want. From their documentation: markup = '\nI linked to <i>example.com</i>\n' soup = BeautifulSoup(markup) soup.get_text() u'\nI linked to example.com\n' soup.i.get_text() u'example.com'

Related

Using BeautifulSoup, how to select a tag without its children?

How to highlight text in complex html

Parse HTML page to get contents of <p> and <b> tags

Isolating title and text Beautiful Soup

Python and BeautifulSoup, not finding 'a'

Categories

Resources