Im using lxml.html.cleaner to clean html from an input text. how can i change \n to <br /> in lxml.html?
Fairly easy, slightly hacky way: You could do this as part of a two step process, assuming you have used lxml.html.parse or whichever method to build DOM.
iterate through the text and tail attributes of the nodes with string replacements. Look at the iterdescendants method, which walks through everything for you.
lxml.html.clean as per normal
A more complex way would be to monkey patch the lxml.html.clean module. Unlike lots of lxml, this module is written in Python and is fairly accessible. For example, there is currently a _substitute_whitespace function.
Related
I'm converting a piece of JS code to Python, and I have been using mini DOM, but certain things aren't working right. They were working find when running in JavaScript. I'm converting because I want consistent changes / order (i.e. where the class attribute is added), as well as so I can use some of Pythons easier features.
My latest issue that I've come across is this:
fonts = doc.getElementsByTagName('font')
while(fonts.length > 0):
# Create a new span
span = doc.createElement("span")
# Give it a class name based on the color (colors is a map)
span.setAttribute("class", colors[fonts[0].getAttribute("color")])
# Place all the children inside
while(fonts[0].firstChild):
span.appendChild(fonts[0].firstChild)
# end while
# Replace the <font> with a the <span>
print(fonts[0].parentNode.toxml())
fonts[0].parentNode.replaceChild(span, fonts[0])
# end while
The problem is that, unlike in JavaScript, the element isn't removed from fonts like it should be. Is there a better library I should be using that uses the standard (level 3) DOM rules, or am I going to just have to hack it out if I don't want to use xPath (what all the other DOM parsers seem to use)?
Thanks.
You can see in the documentation for Python DOM (very bottom of the page) that it doesn't work like a "real" DOM in the sense that collections like you get from getElementsByTagName are not "live". Using getElementsByTagName here just returns a static snapshot of the matching elements at that moment. This isn't usually a problem with Python, because when you're using xml.dom you're not working with a live-updating page inside a browser; you're just manipulating a static DOM parsed from a file or string, so you know no other code is messing with the DOM while you aren't looking.
In most cases, you can probably get what you want by changing the structure of your code to reflect this. For this case, you should be able to accomplish your goal with something like this:
fonts = doc.getElementsByTagName('font')
for font in fonts:
# Create a new span
span = doc.createElement("span")
# Give it a class name based on the color (colors is a map)
span.setAttribute("class", colors[font.getAttribute("color")])
# Place all the children inside
while(font.firstChild):
span.appendChild(font.firstChild)
# end while
# Replace the <font> with a the <span>
font.parentNode.replaceChild(span, font)
The idea is that instead of always looking at the first element in fonts, you iterate over each one and replace them one at a time.
Because of these differences, if your JavaScript DOM code makes use of these sorts of on-the-fly DOM updates, you won't be able to port it "verbatim" to Python (using the same DOM calls). However, sometimes doing it in this less dynamic way can be easier, because things change less under your feet.
I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.
<body>
<script src="...">
</script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">
<script type="text/javascript" language="javascript">
//<![CDATA[
function CreateFixedHeaders() {}//]]>
</script>
<script>
var ClientReportfb64a4706a3749c484169e...
</script>
</body>
My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.
The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).
Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.
So for removing the attributes I have:
script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)
As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.
Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:
<script(.*)</script>
The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.
I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.
So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.
Thanks
Update:
It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?
<script(.*)</script>
As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.
I hope this clears up my question some
Update 2
I just wanted to add a few more notes about what I am doing.
I am crawling a web site to get the data I need.
Once we have the page that contains the data we need it is saved to the database.
Then the saved web page is displayed to the user.
The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).
The error in question only happens in this one page for this one web site.
Update 3
After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.
The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.
So the solution I'm looking at is
import re
def strip_script_tags(page_source: str) -> str:
pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
result = re.sub(pattern, "", page_source)
pattern2 = re.compile(r'<script[\s\S]+?/script>')
result = re.sub(pattern2, "", result)
return result
I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means #skamazin's answer is correct.
As for removing all the attributes that start with on, you can try this
It uses the regex:
\s?on\w+="[^"]+"\s?
And substitutes with the empty string (deletion). So in Python it should be:
pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file)
If you are trying to match anything between the script tags try:
<script[\s\S]+?/script>
DEMO
The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.
I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following:
from lxml import etree
for e, tag in etree.iterparse(source, tag='Foo'):
print tag.xpath('bar/baz')[42] # there's actually a function call here
The problem is, some of the documents have a namespace declaration, and some don't have any. That means that in the code above both tag='Foo' and xpath parts won't work.
For now I've been putting up with the ugly
for e, tag in etree.iterparse(source):
if tag.tag.endswith('Foo'):
print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]
but this is so awful that I want to get it right even though it works fine. (I guess it should be slower, too.)
Is there a way to write sane code that would account for both cases using iterparse?
For now I can only think of catching start-ns and end-ns events and updating a "state-keeping" variable, which I'll have to pass to the function that is called within the loop to do the work. The function will then construct the xpath queries accordingly. This makes some sense, but I'm wondering if there's a simpler way around this.
P.S. I've obviously tried searching around, but haven't found a solution that would work both with and without a namespace. I would also accept a solution that eliminates namespaces from the XML, but only if it doesn't store the whole tree in RAM in the process.
All elements have a .nsmap mapping attribute; use it to detect your namespace and branch accordingly.
I am using http://code.google.com/p/feedparser/ to write a simple news integrator.
But I want pure text ( with <p> tags), but no urls or images (ie. no <a> or <img> tags).
Here are two methods to do that:
1.Edit the source code. http://code.google.com/p/feedparser/source/browse/branches/f8dy/feedparser/feedparser.py
class _HTMLSanitizer(_BaseHTMLProcessor):
acceptable_elements =[....]
Simply remove the a & img tags.
2.
import feedparser
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('a')
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('img')
When I use feedparser, first remove the two tags.
Which method is better?
Are there any other good methods?
Thanks a lot!
Usually, the quicker is better, and this can be determined using python's timeit module. But in your case, I'd prefer not to alter the source code but stick with the second option. It helps maintainability.
Other options include writing a custom parser (use a C extension for maximum speed) or just let your site's templating engine (Django maybe?) strip those tags. Well, I' ve changed my mind, the last solution seems the best all-around...
I am trying to override the default implementation of the raw/endraw block tag in jinja2. I am familiar with how to write custom tag extensions, but in this case my extension is not firing (the default implementation of the raw tag is still being called).
Can this even be done? If not, can someone point me to where in the source the raw tag is implemented so I can patch it to fit my needs.
Thanks.
It looks like overriding the raw/endraw tags is not supported.
The code for dealing with the raw/endraw tags is directly in the lexer, and the handling is hard coded.
So you would probably have to patch the code. Luckily, the code is hosted on github, so it would be easy to have your own shallow fork of jinja2, but still keep up to date with future improvements from the main distribution.