I'm creating parser, and i have following construction:
quotes = soup.findAll('div',{'class':'text'})
But it's strip all html tags(like br). How I can change it?
findAll itself will give you a list of HTML nodes.
If you want to retrieve their text content (without tags), use .get_text().
To get the children of these nodes (as objects too), use .contents or .children.
In order to print a node's children as a well-formatted string, you can use .prettify(). Note that this won't exactly preserve the original formatting.
See also:
BeautifulSoup innerhtml?
If you want to take out the tags from the text, you could try something like this:
for item in quotes:
quote = re.sub(r"\<.*?\>", "", quote)
Related
I'd like to remove script and noscript tags under the given tag (node).
for t in node.find_all(["script", "noscript"]):
t.unwrap()
for s in node.stripped_strings:
print s
But the above loop will still print the content of script tags.
Where is the fault?
You are using the wrong method you can use the decompose() method to do this, especially if you don't need to return the tag or string that you want to remove.
Tag.decompose() removes a tag from the tree, then completely destroys it and its contents.
for t in node.find_all(["script", "noscript"]):
t.decompose()
You need the extract() method instead:
PageElement.extract() removes a tag or string from the tree.
for t in node.find_all(["script", "noscript"]):
t.extract()
I have to retrieve text inside an HTML table, in the cells the text sometimes is inside a <div> and sometimes is not.
How can I make a div in a XPath optional?
My actual code:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div/text()")
Wanted pseudocode:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div or nothing/text()")
You want the string value of the td[5] element. Use string():
stuff = tree.xpath("string(/html/body/table/tbody/tr/td[5])")
This will return text without markup beneath td[5].
You can also indirectly obtain the string value of an element via normalize-space() as suggested by splash58 in the comments, if you also want whitespace to be trimmed on the ends and reduced interiorly.
Thanks in advance,
I'm currently using beautiful soup to parse comment tags out of a set block of HTML. The issue I'm having is the html that is scraped has no quotations encapsulating the attribute values of the HTML tags. However BeautifulSoup seems to add these in, which in some case may be desirable but unfortunately not for my case.
Which would be the best route to either leave the actually HTML intact without adding the quotes in via BeautifulSoup - or can these be added back in?
You have a tag where some attribute values are quoted and some unquoted. What do you mean by 'add quoting back': either edit each attribute value to kludge the quotes in (probably a terrible idea), or else add quoting when it renders. It depends on what other processing you're doing to the tag. Here's code to add quotes when it prints:
input = "<html><sometag attr1=dont_quote_me attr2='but this one is quoted'>Text</sometag></html>"
bs = BeautifulSoup(input)
bs2 = bs.find('sometag')
for a in bs2.attrs:
(attr,aval) = a
print "%s='%s'" % (attr,aval),
gives attr1='dont_quote_me' attr2='but this one is quoted'
It's up to you which way. I assume they're all single-words i.e. match regex \w+
How can one tell etree.strip_tags() to strip all possible tags from a given tag element?
Do I have to map them myself, like:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?
Example input:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:
This is some text with multiple tags and sometimes they are nested.
You can use the lxml.html.clean module:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.
Short Answer
Use the "*" argument when you call strip_tags() to specify all tags to be stripped.
Long Answer
Given your XML string, we can create an lxml Element:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
You can inspect that instance like so:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:
>>> lxml.etree.strip_tags(parent_tag, "*")
Inspection shows that all child tags are gone:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
i have need webpage-content. I need to get some data from it. It looks like:
< div class="deg">DATA< /div>
As i understand, i have to use regex, but i can't choose one.
I tried the code below but had no any results. Please, correct me:
regexHandler = re.compile('(<div class="deg">(?P<div class="deg">.*?)</div>)')
result = regexHandler.search( pageData )
I suggest using a good HTML parser (such as BeautifulSoup -- but for your purposes, i.e. with well-formed HTML as input, the ones that come with the Python standard library, such as HTMLParser, should also work well) rather than raw REs to parse HTML.
If you want to persist with the raw RE approach, the pattern:
r'<div class="deg">([^<]*)</div>'
looks like the simplest way to get the string 'DATA' out of the string '<div class="deg">DATA</div>' -- assuming that's what you're after. You may need to add one or more \s* in spots where you need to tolerate optional whitespace.
If you want the div tags included in the matched item:
regexpHandler = re.compile('(<div class="deg">.*?</div>)')
If you don't want the div tags included, only the DATA portion:
regexpHandler = re.compile('<div class="deg">(.*?)</div>')
Then to run the match and get the result:
result = regexHandler.search( pageData )
matchedText = result.groups()[0]
you can use simple string functions in Python, no need for regex
mystr = """< div class="deg">DATA< /div>"""
if "div" in mystr and "class" in mystr and "deg" in mystr:
s = mystr.split(">")
for n,item in enumerate(s):
if "deg" in item:
print s[n+1][:s[n+1].index("<")]
my approach, get something to split on. eg in the above, i split on ">". Then go through the splitted items, check for "deg", and get the item after it, since "deg" appears before the data you want to get. of course, this is not the only approach.
While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.
html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""
import lxml.html
page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)
for element in page.findall('.//div[#class="deg"]'):
print element.text
#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")
for element in sel(page):
print element.text