Using Beautiful Soup v4, I've a span as follows:
<span style="color: grey;">32.44 MB<br/>10454 Downloads<br/>35:25 Mins<br/>128kbps Stereo</span>
I'd like to extract the text for the br elements individually. How can I do it?
Try this:
from bs4 import BeautifulSoup
txt = '''<span style="color: grey;">32.44 MB<br/>10454 Downloads<br/>35:25 Mins<br/>128kbps Stereo</span>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.select('span br'):
print(tag.next)
Output:
10454 Downloads
35:25 Mins
128kbps Stereo
although this may not be proper way to do that, but if you use your span as a string, you can extract the words like this:
user_input = '<span style="color: grey;">32.44 MB<br/>10454 Downloads<br/>35:25 Mins<br/>128kbps Stereo</span>'.split( "<br/>" )
WordList = []
for word in user_input:
if ">" in word:
word = word[word.index(">")+1:]
if word:
WordList.append( [word] )
print(WordList)
Related
Previously I used req = soup.find("td", string = "tags text")(just example) method to find elements by its text, but in this case the tags string has some spaces before and after the text.
Tag is like the below:
<dt> I am text </dt>
How should I ignore the leading and trailing spaces?
If I want to use the previous method, I have to write: req = soup.find("td", string = " I am text "), but I think there should be a better way.
You can use function in the text= parameter of .find() function (or string=):
from bs4 import BeautifulSoup
html_doc = '''<dt>Other text</dt>
<dt> I am text </dt>'''
soup = BeautifulSoup(html_doc, 'html.parser')
dt = soup.find('dt', text=lambda t: 'I am text' == t.strip())
print(dt)
Prints:
<dt> I am text </dt>
There are lots of HTML pages which are structured as a sequence of such groups:
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.
How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).
from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
print ''.join(node.findAll(text=True))
I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:
for node in soup.findAll('p'):
print(node.text)
# or: keywords = node.text.split(', ')
# print(keywords)
You need to split your string which in this case is url with /
And then you can choose chunks you want
For example if url is https://some.page.org/year/0001 i use split function to split url with / sign
it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link
There are different ways to HTML parse the desired categories and keywords from this kind of HTML structure, but here is one of the "BeautifulSoup" ways to do it:
find b elements with a text which ends with a :
use .next_sibling to get to the next text node which contains keywords
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<p>
<b> Category 1:</b>
"keyword_a, keyword_b"
</p>
<p>
<b> Category 2:</b>
"keyword_c, keyword_d"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for category in soup('b', text=lambda text: text and text.endswith(":")):
keywords = category.next_sibling.strip('" \n').split(", ")
print(category.get_text(strip=True), keywords)
Prints:
Category 1: ['keyword_a', 'keyword_b']
Category 2: ['keyword_c', 'keyword_d']
Assuming for each block
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
you want to extract keyword_a and keyword_b for each Keywords/Category. So an example would be:
<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>
Once you have the HTML code, you can do:
from bs4 import BeautifulSoup
html = '''<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>'''
soup = BeautifulSoup(html, 'html.parser')
p_elements = soup.find_all('p')
for p_element in p_elements:
b_element = soup.find_all('b')[0]
b_element.extract()
category = b_element.text.strip()
keywords = p_element.text.strip()
keyword_a, keyword_b = keywords[1:-1].split(', ')
print('Category:', category)
print('Keyword A:', keyword_a)
print('Keyword B:', keyword_b)
Which prints:
Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich
I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...
For example, here's one of the comments that I have a problem with:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.
This is the code that I'm using:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).
Thanks in advance!
Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
Decompose extracts tags that you don't want. In your case:
soup.blockquote.decompose()
or all unwanted tags:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
Your example:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
Update
Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'
BeautifulSoup has a get_text method. Maybe this is what you want.
From their documentation:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
This gives:
THIS IS THE COMMENT THAT I NEED!
To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.
I need to replace a matching pair of HTML tags by another tag. Probably BeautifulSoup (4) would be suitable for the task, but I've never used it before and haven't found a suitable example anywhere, can someone give me a hint?
For example, this HTML code:
<font color="red">this text is red</font>
Should be changed to this:
<span style="color: red;">this text is red</span>
The beginning and ending HTML tags may not be in the same line.
Use replace_with() to replace elements. Adapting the documentation example to your example gives:
>>> from bs4 import BeautifulSoup
>>> markup = '<font color="red">this text is red</font>'
>>> soup = BeautifulSoup(markup)
>>> soup.font
<font color="red">this text is red</font>
>>> new_tag = soup.new_tag('span')
>>> new_tag['style'] = 'color: ' + soup.font['color']
>>> new_tag.string = soup.font.string
>>> soup.font.replace_with(new_tag)
<font color="red">this text is red</font>
>>> soup
<span style="color: red">this text is red</span>
I have a String,
data = 'very <strong class="keyword">Awesome</strong> <strong class="keyword">Book</strong> discount'
I want to get the output in a list as
ans = ['very','<strong class="keyword">Awesome</strong>','<strong class="keyword">Book</strong>','discount']
So i can have the position of the word and also the words occurred in tags.
I used BeautifulSoup to extract words in and the word with are not in . But i need to find the position.
The code i tried.
from bs4 import BeautifulSoup as BS
data = 'very <strong class="keyword">Awesome</strong> <strong class="keyword">Book</strong>'
soup = BS(data)
to_extract = soup.findAll('strong')
[comment.extract() for comment in to_extract]
soup = str(soup)
notInStrongWords = []
for t in to_extract:
t_soup = BS('{0}'.format(t))
t_tag = t_soup.strong
matchWords.append(t_tag.string)
soup = re.sub("[^A-Za-z0-9\\-\\.\\(\\)\\\\\/\\&': ]+",' ', soup)
soup = re.findall('[(][^)]*[)]|\S+', soup)
InStrongWords = []
InStrongWords = [x for x in soup]
Thanks in Advance.
Based on Andrew Alcok's answer, Thank you Ansrew.
lets say,
data = ['very <strong class="keyword">Awesome</strong> <strong class="keyword">Book</strong>','<strong class="keyword">Awesome</strong> <strong class="keyword">Book</strong> discount']
so for python 2.x and BeautifulSoup 4
from bs4 import BeautifulSoup as BS
for d in data:
soup = BS(d)
soupPTag = soup.p
if soupPTag:
soupList = [unicode(child) for child in soupPTag.children if child!=" "]
print soupList
else:
soupBodyTag = soup.body
soupList = [unicode(child) for child in soupBodyTag.children if child!=" "]
print soupList
This will give required answer.
Try (for Python 2.x - Python 3 does unicode differently):
from bs4 import BeautifulSoup as BS
data = 'very <strong class="keyword">Awesome</strong> <strong class="keyword">Book</strong>'
soup = BS(data)
pTag = soup.p
list = [ unicode(child) for child in pTag.children ]
print list
Returns:
[u'very ', u'<strong class="keyword">Awesome</strong>', u' ', u'<strong class="keyword">Book</strong>']
Basically, iterating over the child elements and turn them back into Unicode string. You may want to filter out the space, but this is technically present in your HTML.
If you need to check which children are "strong", you could do something like this:
import bs4
data = 'very <strong class="keyword">Awesome</strong> <strong class="keyword">Book</strong>'
soup = bs4.BeautifulSoup(data)
list = [ (child.name if isinstance(child, bs4.Tag) else None, unicode(child)) for child in soup.children ]
print list
Which returns a list of tuples, each tuple being the (name of the tag or None where no tag, HTML):
[(None, u'very '), (u'strong', u'<strong class="keyword">Awesome</strong>'), (None, u' '), (u'strong', u'<strong class="keyword">Book</strong>')]
re.finditer (instead of re.findall) gives you match objects that you can get the start() and end() of.