Python MiniDom not removing elements properly

Python MiniDom not removing elements properly - python

I'm converting a piece of JS code to Python, and I have been using mini DOM, but certain things aren't working right. They were working find when running in JavaScript. I'm converting because I want consistent changes / order (i.e. where the class attribute is added), as well as so I can use some of Pythons easier features.
My latest issue that I've come across is this:
fonts = doc.getElementsByTagName('font')
while(fonts.length > 0):
# Create a new span
span = doc.createElement("span")
# Give it a class name based on the color (colors is a map)
span.setAttribute("class", colors[fonts[0].getAttribute("color")])
# Place all the children inside
while(fonts[0].firstChild):
span.appendChild(fonts[0].firstChild)
# end while
# Replace the <font> with a the <span>
print(fonts[0].parentNode.toxml())
fonts[0].parentNode.replaceChild(span, fonts[0])
# end while
The problem is that, unlike in JavaScript, the element isn't removed from fonts like it should be. Is there a better library I should be using that uses the standard (level 3) DOM rules, or am I going to just have to hack it out if I don't want to use xPath (what all the other DOM parsers seem to use)?
Thanks.

You can see in the documentation for Python DOM (very bottom of the page) that it doesn't work like a "real" DOM in the sense that collections like you get from getElementsByTagName are not "live". Using getElementsByTagName here just returns a static snapshot of the matching elements at that moment. This isn't usually a problem with Python, because when you're using xml.dom you're not working with a live-updating page inside a browser; you're just manipulating a static DOM parsed from a file or string, so you know no other code is messing with the DOM while you aren't looking.
In most cases, you can probably get what you want by changing the structure of your code to reflect this. For this case, you should be able to accomplish your goal with something like this:
fonts = doc.getElementsByTagName('font')
for font in fonts:
# Create a new span
span = doc.createElement("span")
# Give it a class name based on the color (colors is a map)
span.setAttribute("class", colors[font.getAttribute("color")])
# Place all the children inside
while(font.firstChild):
span.appendChild(font.firstChild)
# end while
# Replace the <font> with a the <span>
font.parentNode.replaceChild(span, font)
The idea is that instead of always looking at the first element in fonts, you iterate over each one and replace them one at a time.
Because of these differences, if your JavaScript DOM code makes use of these sorts of on-the-fly DOM updates, you won't be able to port it "verbatim" to Python (using the same DOM calls). However, sometimes doing it in this less dynamic way can be easier, because things change less under your feet.

Related

Scraping a sliding table os Selenium

I'm trying to get data from a sliding table on a website (like those stockmarket prices on some websites).
I'm using this line:
elem=driver.find_elements_by_xpath('/html/body/div[1]/div/div/article/div/div/div/div/div[1]/div/div/aside/div/div/div/ul/li')
It seems to get all the elements to the list just fine.
But once I use any method on the list, let's say:
for i in elem:
print(i.text)
It actually just return the values visible at that very moment.
Can somebody help?

So in most cases try in the following order:
getText() if it doesn't work use getAttribute('textContent') if that too doesn't work use getAttribute('value')
getAttribute('value') works only if there is an attribute called value in your element (like id, name etc)
So in most cases, if getText doesn't work use .getAttribute('textContent')
Use:
i.get_attribute("textContent")
Because getText or text() actually uses innerText ,and will not detect text from hidden elements .
Don't get confused by the differences between Node.textContent and
HTMLElement.innerText. Although the names seem similar, there are
important differences:
textContent gets the content of all elements, including and
elements. In contrast, innerText only shows “human-readable” elements.
textContent returns every element in the node. In contrast, innerText
is aware of styling and won’t return the text of “hidden” elements.
Moreover, since innerText takes CSS styles into account, reading the
value of innerText triggers a reflow to ensure up-to-date computed
styles. (Reflows can be computationally expensive, and thus should be
avoided when possible.)
Unlike textContent, altering innerText in Internet Explorer (version
11 and below) removes child nodes from the element and permanently
destroys all descendant text nodes. It is impossible to insert the
nodes again into any other element or the same element after
doing so
https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent

Try
for i in elem:
print(i.get_attribute('textContent'))
to get text from hidden elements as well

How to print HTML and to highlight some tags in PyQt?

I'm making the program using PyQt5. One of functions is printing HTML with highlight of some tags by different colours. Every new string is processed and added to text using .append method. I need to print clear HTML, that's way class QTextEdit is not suitable. To solve this problem, one needs to use QPlainTextEdit. But I got a new problem. Now I can't use tags <font> to appoint colour to certain tag. Shielding of tags in class QTextEdit is not good idea. Also, I can't appoint colour to whole field.
How can I solve this problem?
P.S. Sorry for mistakes in my English. You can tell me about them.

I would like to make a comment but I don't have enough reputation.
The comment section already has a good way of doing it, and here is another way.
You can just import html and use html.escape(text). This way you can escape part of the html code that are supposed to be literal strings while keeping other html working. This way you can also keep using QTextEdit.
Here's a quick example:
what I did was:
a = html.escape("<font size=\"3\" color=\"red\">This is some text!</font>")
self.append(""+a+"")
and this is the result.

Is it more reliable to rely on index of occurrence or keywords when parsing websites?

Just started using XPath, I'm parsing a website with lxml.
Is it preferable to do:
number_I_want = parsed_body.xpath('.//b')[6].text
#or
number_I_want = parsed_body.xpath('.//span[#class="class_name"]')[0].text
I'd rather find this out now, rather than much further down the line. Actually I couldn't get something like the second expression to work for my particular case.
But essentially, the question: is it better to rely on class names (or other keywords) or indices of occurrence (such as 7th occurrence of bolded text)?

I'd say that it is generally better to rely on id attributes, or class by default, than on the number and order of appearance of specific tags.
That is more resilient to change in the page content.

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following:
from lxml import etree
for e, tag in etree.iterparse(source, tag='Foo'):
print tag.xpath('bar/baz')[42] # there's actually a function call here
The problem is, some of the documents have a namespace declaration, and some don't have any. That means that in the code above both tag='Foo' and xpath parts won't work.
For now I've been putting up with the ugly
for e, tag in etree.iterparse(source):
if tag.tag.endswith('Foo'):
print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]
but this is so awful that I want to get it right even though it works fine. (I guess it should be slower, too.)
Is there a way to write sane code that would account for both cases using iterparse?
For now I can only think of catching start-ns and end-ns events and updating a "state-keeping" variable, which I'll have to pass to the function that is called within the loop to do the work. The function will then construct the xpath queries accordingly. This makes some sense, but I'm wondering if there's a simpler way around this.
P.S. I've obviously tried searching around, but haven't found a solution that would work both with and without a namespace. I would also accept a solution that eliminates namespaces from the XML, but only if it doesn't store the whole tree in RAM in the process.

All elements have a .nsmap mapping attribute; use it to detect your namespace and branch accordingly.

Remove URLs and Images from FeedParser

I am using http://code.google.com/p/feedparser/ to write a simple news integrator.
But I want pure text ( with <p> tags), but no urls or images (ie. no <a> or <img> tags).
Here are two methods to do that:
1.Edit the source code. http://code.google.com/p/feedparser/source/browse/branches/f8dy/feedparser/feedparser.py
class _HTMLSanitizer(_BaseHTMLProcessor):
acceptable_elements =[....]
Simply remove the a & img tags.
2.
import feedparser
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('a')
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('img')
When I use feedparser, first remove the two tags.
Which method is better?
Are there any other good methods?
Thanks a lot!

Usually, the quicker is better, and this can be determined using python's timeit module. But in your case, I'd prefer not to alter the source code but stick with the second option. It helps maintainability.
Other options include writing a custom parser (use a C extension for maximum speed) or just let your site's templating engine (Django maybe?) strip those tags. Well, I' ve changed my mind, the last solution seems the best all-around...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python MiniDom not removing elements properly - python

Related

Scraping a sliding table os Selenium

How to print HTML and to highlight some tags in PyQt?

Is it more reliable to rely on index of occurrence or keywords when parsing websites?

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

Remove URLs and Images from FeedParser

Categories

Resources