I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.
Related
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
What I want to do is change every tag (whether its <a href=> or <title> or </title> or </div>... etc) to a symbol.
I tried using beautiful soup but it only finds tags that I define...
I found some code in the HTMLparser.py
tagfind = re.compile('([a-zA-Z][^\t\n\r\f />\x00]*)(?:\s|/(?!>))*')
I beleive this is what I'm looking for I just dont know how to use it properly.
Also I figured I could use the:
handle_starttag(self, tag, attrs):
But I don't want to define the tag, I just want the script to find every single tag and change it to something...
Is this possible?
Thank you for all of your help!!
A much more reliable way is to recursively visit each tag, I just changed the name in the example below but you can do whatever you want once you have the tag:
from bs4 import BeautifulSoup, element
def visit(s):
if isinstance(s, element.Tag):
has_children = s.find_all()
if has_children:
s.name = "foobar"
for child in s:
visit(child)
else:
s.name = "foobar"
To use it:
soup = BeautifulSoup(...)
visit(soup)
Then any changes will be reflected in the soup.
BeautifulSoup isn't a good idea here - that's designed for parsing HTML, not editing it.
Also, that regex doesn't seem like a very good one (only matches the content inside a tag rather than the whole tag itself) so I found a different one that would be better suited to your purposes:
</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>
This tag will match anything like the following:
<h1>
</h1>
<img src="foo.com/image.png">
We can use this for replacing all tags by using re.sub. This finds all matches for a certain regex and replaces them with something else. Here's how you'd use it for what you want to do:
import re
html_regex = r"""</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>"""
html = "<h1>Foo</h1>"
print(re.sub(html_regex, "#", html))
This would print:
#Foo#
I have markup as:
<p>Sample text. Click Here</p>
I want to replace the <p> tag with <span> without changing the children or text of the tag
Your task can be done using replaceWith. You have to duplicate the element you want to use as the replacement, and then feed that as the argument to replaceWith. The documentation for replaceWith is pretty clear on how to do this.
In any case you could read this question How to change tag name with BeautifulSoup?
I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?
To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string
I need to manipulate certain text in an HTML document after I have identified the text in the original document. Let's say I have this HTML code
<div id="identifier">
<a href="link" id="linkid">
</a>
</div>
I want to delete the id attribute in the <a> tag. I can identify a particular tag using BeautifulSoup, but because it changes the formatting of the original document I can't search/replace the string either. I don't want to just write output of BeautifulSoup, instead I want to identify <a href="link" id="linkid"> tag in the original document and replace with just <a href="link">. Any idea how to proceed?
Answering a few questions raised:
This is a huge existing codebase that needs some updation, so it's not just a single search/replace kind of job.
The original formatting is important because the organization follows a certain coding standards for formatting code, which I want to retain. Also, BS introduces extra tags for the sake of completeness like for and so on.
Which version of beautifulsoup are you using?
You can edit html nodes like dictionaries in bs4
From documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#changing-tag-names-and-attributes
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
del tag['class']
del tag['id']
Also, you seem to have a problem with the way beautiful soup output the modified html code.
If you want to pretty print your document, or use custom formatting, you can do it easily
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output