Remove URLs and Images from FeedParser - python

I am using http://code.google.com/p/feedparser/ to write a simple news integrator.
But I want pure text ( with <p> tags), but no urls or images (ie. no <a> or <img> tags).
Here are two methods to do that:
1.Edit the source code. http://code.google.com/p/feedparser/source/browse/branches/f8dy/feedparser/feedparser.py
class _HTMLSanitizer(_BaseHTMLProcessor):
acceptable_elements =[....]
Simply remove the a & img tags.
2.
import feedparser
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('a')
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('img')
When I use feedparser, first remove the two tags.
Which method is better?
Are there any other good methods?
Thanks a lot!

Usually, the quicker is better, and this can be determined using python's timeit module. But in your case, I'd prefer not to alter the source code but stick with the second option. It helps maintainability.
Other options include writing a custom parser (use a C extension for maximum speed) or just let your site's templating engine (Django maybe?) strip those tags. Well, I' ve changed my mind, the last solution seems the best all-around...

Related

Python MiniDom not removing elements properly

I'm converting a piece of JS code to Python, and I have been using mini DOM, but certain things aren't working right. They were working find when running in JavaScript. I'm converting because I want consistent changes / order (i.e. where the class attribute is added), as well as so I can use some of Pythons easier features.
My latest issue that I've come across is this:
fonts = doc.getElementsByTagName('font')
while(fonts.length > 0):
# Create a new span
span = doc.createElement("span")
# Give it a class name based on the color (colors is a map)
span.setAttribute("class", colors[fonts[0].getAttribute("color")])
# Place all the children inside
while(fonts[0].firstChild):
span.appendChild(fonts[0].firstChild)
# end while
# Replace the <font> with a the <span>
print(fonts[0].parentNode.toxml())
fonts[0].parentNode.replaceChild(span, fonts[0])
# end while
The problem is that, unlike in JavaScript, the element isn't removed from fonts like it should be. Is there a better library I should be using that uses the standard (level 3) DOM rules, or am I going to just have to hack it out if I don't want to use xPath (what all the other DOM parsers seem to use)?
Thanks.
You can see in the documentation for Python DOM (very bottom of the page) that it doesn't work like a "real" DOM in the sense that collections like you get from getElementsByTagName are not "live". Using getElementsByTagName here just returns a static snapshot of the matching elements at that moment. This isn't usually a problem with Python, because when you're using xml.dom you're not working with a live-updating page inside a browser; you're just manipulating a static DOM parsed from a file or string, so you know no other code is messing with the DOM while you aren't looking.
In most cases, you can probably get what you want by changing the structure of your code to reflect this. For this case, you should be able to accomplish your goal with something like this:
fonts = doc.getElementsByTagName('font')
for font in fonts:
# Create a new span
span = doc.createElement("span")
# Give it a class name based on the color (colors is a map)
span.setAttribute("class", colors[font.getAttribute("color")])
# Place all the children inside
while(font.firstChild):
span.appendChild(font.firstChild)
# end while
# Replace the <font> with a the <span>
font.parentNode.replaceChild(span, font)
The idea is that instead of always looking at the first element in fonts, you iterate over each one and replace them one at a time.
Because of these differences, if your JavaScript DOM code makes use of these sorts of on-the-fly DOM updates, you won't be able to port it "verbatim" to Python (using the same DOM calls). However, sometimes doing it in this less dynamic way can be easier, because things change less under your feet.

Remove Script tag and on attributes from HTML

I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.
<body>
<script src="...">
</script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">
<script type="text/javascript" language="javascript">
//<![CDATA[
function CreateFixedHeaders() {}//]]>
</script>
<script>
var ClientReportfb64a4706a3749c484169e...
</script>
</body>
My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.
The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).
Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.
So for removing the attributes I have:
script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)
As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.
Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:
<script(.*)</script>
The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.
I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.
So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.
Thanks
Update:
It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?
<script(.*)</script>
As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.
I hope this clears up my question some
Update 2
I just wanted to add a few more notes about what I am doing.
I am crawling a web site to get the data I need.
Once we have the page that contains the data we need it is saved to the database.
Then the saved web page is displayed to the user.
The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).
The error in question only happens in this one page for this one web site.
Update 3
After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.
The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.
So the solution I'm looking at is
import re
def strip_script_tags(page_source: str) -> str:
pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
result = re.sub(pattern, "", page_source)
pattern2 = re.compile(r'<script[\s\S]+?/script>')
result = re.sub(pattern2, "", result)
return result
I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means #skamazin's answer is correct.
As for removing all the attributes that start with on, you can try this
It uses the regex:
\s?on\w+="[^"]+"\s?
And substitutes with the empty string (deletion). So in Python it should be:
pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file)
If you are trying to match anything between the script tags try:
<script[\s\S]+?/script>
DEMO
The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.

restructuredtext: what is the best way of extracting bibliographic and other fields?

I am putting together a website where the content is maintained as restructuredtext that is then converted into html. I need more control than e.g. rst2html.py, so I am using a my own python script that uses things like
docutils.core.publish_parts(source, writer_name='html')
to create the html.
publish_parts() gives me useful parts like the title, body, etc. However, it seems I must look elsewhere to get the values of rst fields like
:Authors:
:version:
etc. For this, I have been using publish_doctree() as in
doctree = core.publish_doctree(source).asdom()
and then going through this recursively using using getElementsByTagName() as in
doctree.getElementsByTagName('authors')
doctree.getElementsByTagName('version')
etc.
Using publish_doctree() to extract fields does the job, and that's good, but it does seem more convoluted than using e.g. publish_parts().
My question is simply whether this is the best recommended way of extracting out these rst fields, or is there a more direct and less convoluted way? If not, that is fine, but I thought I would inquire in case I am missing something.

Django - converting URL into links, images, objects

I'm creating simple comment-like application and need to convert normal urls into links, image links into images and yt/vimeo/etc. links into flash objects. E.g.:
http://foo.bar to http://foo.bar
http://foo.bar/image.gif to <img src="http://foo.bar/image.gif"/>
etc.
Of course i can write all of that by myself, but i think it's such obvious piece of code that somebody has already wrote it (maybe even with splitting text into paragraphs). I was googling for some time but couldn't find anything complex, just few snippets. Does filter (or something like that) exist?
Thanks!
PS. There is urlize but it works only for the first case.
Write a custom filter to handle all the necessary cases. Look at the source code for urlize to get started. You'll also need the urlize function from utils.
In your filter, first test for the first case and call urlize on that. Handle the second case and any other cases you may have.

Python lxml.html linebreaks?

Im using lxml.html.cleaner to clean html from an input text. how can i change \n to <br /> in lxml.html?
Fairly easy, slightly hacky way: You could do this as part of a two step process, assuming you have used lxml.html.parse or whichever method to build DOM.
iterate through the text and tail attributes of the nodes with string replacements. Look at the iterdescendants method, which walks through everything for you.
lxml.html.clean as per normal
A more complex way would be to monkey patch the lxml.html.clean module. Unlike lots of lxml, this module is written in Python and is fairly accessible. For example, there is currently a _substitute_whitespace function.

Categories