Using getElementsByTagName from xml.dom.minidom - python

I'm going through Asheesh Laroia's "Scrape the Web" presentation from PyCon 2010 and I have a question about a particular line of code which is this line:
title_element = parsed.getElementsByTagName('title')[0]
from the function:
def main(filename):
#Parse the file
parsed = xml.dom.minidom.parse(open(filename))
# Get title element
title_element = parsed.getElementsByTagName('title')[0]
# Print just the text underneath it
print title_element.firstChild.wholeText
I don't know what role '[0]' is performing at the end of that line. Does 'xml.dom.minidom.parse' parse the input into a list?

parse() does not return a list; getElementsByTagName() does. You're asking for all elements with a tag of <title>. Most tags can appear multiple times in a document, so when you ask for those elements, you'll get more than one. The obvious way to return them is as a list or tuple.
In this case you expect only one <title> tag in the document, so you just take the first element in the list.

This method's (getElementsByTagName) documentation says:
Search for all descendants (direct children, children’s children,
etc.) with a particular element type name.
Since it mentions "all descendants", then yes, in all likeness it returns a list that this code just indexes to see the first element.
Looking at the code of this method (in Lib/xml/dom/minidom.py) - it indeed returns a list.

Related

How to delete bs4.element.Tag element in Python list?

I have a Python list which is
url_list = [<img src="https://test.com/temp.jpg" style="display:block"/>, <img src="https://test.com/not_temp.jpg" style="display:block"/>]
both element in that list are 'bs4.element.Tag' type.
How do I delete '<img src="https://test.com/temp.jpg" style="display:block"/>' element while keeping its 'bs4.element.Tag' type?
and the list will keep changing in time, so del url_list[0] is not going to work.
I tried url_list.remove('<img src="https://test.com/temp.jpg" style="display:block"/>')
but it didn't work since its type was different.
Edit:
I want to remove this '<img src="https://test.com/temp.jpg" style="display:block"/>' exact element. and "while keeping its 'bs4.element.Tag' type" means i dont want change the list element's type.
Convert the string representation of the tag into a BS object:
tag = '<img src="https://test.com/temp.jpg" style="display:block"/>'
unwanted = bs4.BeautifulSoup(tag).img
And remove it:
url_list.remove(unwanted)
Easiest thing is probably to simply go through every tag and check whether or not a tag contains a certain element, which you can do with the tag.get() method. For example, you could do something along the lines of
for tag in url_list:
if tag.get('src') == 'some_url':
url_list.remove(tag)
the get() method can be used to extract any of the individual properties of the tag, not just the src. How you filter out what tag to remove is then up to you.

Get the List of Multiple Element with Same ClassName

I'd like to crawl every case whose Panel Reoport has already composed from the WTO official page.
and
As you can check at the above image (or refer to
https://www.wto.org/english/tratop_e/dispu_e/dispu_status_e.htm,
Every case is indexed with "DS XXX" and right at the below it denotes whether the "Panel Composed" or still yet "in Consultation".
If I inspect, they all share the same
<p class = "panel-text-simple">
So I had tried following two commands:
elem_info = driver.find_element_by_class_name("panel-title-simple")
elem_info = driver.find_element_by_xpath("//p[#class='panel-title-simple']");
but every one of them only gives me the top most case, the most recent one.
I have to locate every case's info, then should make a for-loop to check whether the panel composed or not.
How could I do that?
Use find_elements (note the 's'). This returns a list that you can then loop through:
documents = driver.find_elements_by_class_name("panel-title-simple");
for document in documents
# continue with your code
You can use the XPath below to get all the LIs that have a current status of 'Panel composed'
//li[.//p[contains(.,'Panel composed')]]
From there you can get the DS number
.//small
or the details
./p
and so on.

Python Beautifulsoup finding correct tag

I'm having a problem trying to figure out how to grab the specific tag I need.
<div class="meaning"><span class="hinshi">[名]</span><span class="hinshi">(スル)</span></div>, <div class="meaning"><b>1</b> 今まで経験してきた仕事・身分・地位・学業などの事柄。履歴。「―を偽る」</div>,
Right now I have it so it finds all the meaning classes, but I need to narrow it down even further from this to get what I want. Above is an example. I need to grab just the
"<div class="meaning"><b>".
and ignore all the "hinshi" classes.
edit: It seems to be showing the number, which I guess is what is, but I need the text next to it. Any ideas?
You can find a specific attribute by using keyword arguments to the find method. In your case, you'll want to match on the class_ keyword. See the documentation regarding the class_ keyword.
Assuming that you want to filter the elements that don't contain any children with the "hinshi" class, you could try something like this:
soup = BeautifulSoup(data)
potential_matches = soup.find_all(class_="meaning")
matches = []
for match in potential_matches:
bad_children = match.find_all(class_="hinshi")
if not bad_children:
matches.append(match)
return matches
If you'd like, you could make it a little shorter, for example:
matches = soup.find_all(class_="meaning")
return [x for x in matches if not x.find_all(class_="hinshi")]
Or, depending on your Python version, i.e. 2.x:
matches = soup.find_all(class_="meaning")
return filter(matches, lambda x: not x.find_all(class_="hinshi"))
EDIT: If you want to find the foreign characters next to the number in your example, you should first remove the b element, then use the get_text method. For example
# Assuming `element` is one of the matches from above
element.find('b').extract()
print(element.get_text())
You could try using the .select function, which takes a CSS selector:
soup.select('.meaning b')
Just you can do it like this,
for s in soup.findAll("div {class:meaning}"):
for b in s.findAll("b"):
# b.getText("<b>")
And in '#' line, you should accord the result to fix it.

Python: Parsing XML autoadd all key/value pairs

I searched a long and have tried a lot! but I can't get my mind open for this totally easy scenario. I need to say that I'm a python newbie but a very good bash coder ;o) I have written some code with python but maybe there is a lot I need to learn yet so do not be too harsh to me ;o) I'm willing to learn and I read python docs and many examples and tried a lot on my own but now I'm at a point where I picking in the dark..
I parse content provided as XML. It is about 20-50 MB big.
My XML Example:
<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>
(how many levels can not be said / can change)
</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>..
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
...
and so on
</MAIN>
This is the main part of parsing it (if you need more details pls ask):
from lxml import etree
resp = my.request(some call args)
xml = etree.XML(resp)
for element in xml.findall(".//MAIN"):
# this works fine but is not generic enough:
my_dict = OrderedDict()
for only1sub in element.iter(tag="SUBEL2"):
for i in only1sub:
my_dict[i.tag] = i.text
This just working fine with 1 subelement but that means I need to know which one in the tree has subelements and which not. This could change in the future or be added.
Another problem is MULTISUB. With the above code I'm able to parse until the first tag only.
The goal
What I WANT to achieve is - at best:
A) Having one function / code snippet which is able to parse the whole XML content and if there is a subelement (e.g. with "if len(x)" or whatever) then parse to the next level until you reach a level without a subelement/tree. Then go on to B)
B) For each XML tag found which has NO subelements I want to update the dictionary with the tag name and the tag text.
C) I want to do that for all available elements - the tag and the direct child tag names (e.g. "NOSUBEL2" or "MULTISUB") will not change (often) so it will be ok to use them as a start point for parsing.
What I tried so far was to chain several loops like for and while and for again and so on but nothing was full successful. I also dived my hands into python generators because I thought I can do something with the next() function but also nothing. But again I may have not the knowledge to use them correctly and so I'm happy for every answer..
At the end the thing I need is so easy I believe. I only want to have key value pairs from the tag name and the tag content that couldn't be so hard? Any help greatly appreciated..
Can you help me reaching the goal?
(Already a thanks for reading until here!)
What you are looking for is the recursion - a technique of running some procedure inside that procedure, but for sub-problem of the original problem. In this case: either, for each subelement of some element run this procedure (in case there are subelements) or update your dictionary with element's tag name and text.
I assume at the end you're interested in having dictionary (OrderedDict) containing "flat representation" of whole element tree's leaves' (nodes without subelements) tag names/text values, which in your case, printed out, would look like this:
OrderedDict([('NOSUBEL', 'abcd'), ('NOSUBEL2', 'adasdasa'), ('ANOTHERONE', '(how many levels can not be said / can change)'), ('FOO', 'abcdefg'), ('NOSUBEL3', 'abc')])
Generally, you would define a function that will either call itself with part of your data (in this case: subelements, if there are any) or do something (in this case: update some instance of dictionary).
Since I don't know the details behind my.request call, I've replaced that by parsing from string containing valid XML, based on the one you provided. Just replace constructing the tree object.
resp = """<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>(how many levels can not be said / can change)</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
</MAIN>"""
from collections import OrderedDict
from lxml import etree
def update_dict(element, my_dict):
# lxml defines "length" of the element as number of its children.
if len(element): # If "length" is other than 0.
for subelement in element:
# That's where the recursion happens. We're calling the same
# function for a subelement of the element.
update_dict(subelement, my_dict)
else: # Otherwise, subtree is a leaf.
my_dict[element.tag] = element.text
if __name__ == "__main__":
# Change/amend it with your my.request call.
tree = etree.XML(resp) # That's a <MAIN> element, too.
my_dict = OrderedDict()
# That's the first invocation of the procedure. We're passing entire
# tree and instance of dictionary.
update_dict(tree, my_dict)
print(my_dict) # Just to see that dictionarty was filled with values.
As you can see, I didn't use any tag name in the code (except for the XML source, of course).
I've also added missing import from collections.

pulling multiple values from python ElementTree with lxml and xpath

I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.
I am web-scraping. The pages I am scraping have the following salient elements:
<div class='parent'>
<span class='title'>
<a>THIS IS THE TITLE</a>
</span>
<div class='copy'>
<p>THIS IS THE COPY</p>
</div>
</div>
My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')
Below is my code
## 'tree' is the ElementTree of the document I've just pulled
xpath = "//div[#class='parent']"
filtered_html = tree.xpath(xpath)
arr = []
for i in filtered_html:
title_filter = "//span[#class='author']/a/text()" # xpath for title text
copy_filter = "//div[#class='copy']/p/text()" # xpath for copy text
title = i.getroottree().xpath(title_filter)
copy = i.getroottree().xpath(copy_filter)
arr.append((title, copy))
I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.
What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.
I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?
Your core problem is that the getroottree call you're making on each text element resets you to running your xpath over the whole tree. getroottree does exactly what it sounds like - returns the root element tree of the element you call it on. If you leave that call out it looks to me like you'll get what you want.
I personally would use the iterfind method on the element tree for my main loop, and would probably use the findtext method on the resulting elements to ensure that I receive only one title and one copy.
My (untested!) code would look like this:
parent_div_xpath = "//div[#class='parent']"
title_filter = "//span[#class='title']/a"
copy_filter = "//div[#class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]
Alternately, you could skip explicit iteration entirely:
title_filter = "//div[#class='parent']/span[#class='title']/a/text()"
copy_filter = "//div[#class='parent']/div[#class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))
You might need to drop the text() call from the xpath and move it into a generator expression, I'm not sure offhand whether findall will respect it. If it doesn't, something like:
arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))
And you might need to tweak that xpath if having more than one title/copy pair in a parent div is a possibility.

Categories