Handling XML with Python - python

All I want to do is get the content of an XML tag in Python. I'm maybe using the wrong import; ideally I'd love to have the way PHP deals with XML (i.e $XML->this_tag), like the way pyodbc does database stuff (i.e. table.field)
Here's my example:
from xml.dom.minidom import parseString
dom = parseString("<test>I want to read this</test>")
dom.getElementsByTagName("test")[0].toxml()
>>> u'<test>I want to read this</test>'
All I want to be able to do read the contents of the tag (like innerHTML in javascript).

instead of dom.getElementsByTagName("test")[0].toxml() put dom.getElementsByTagName("test")[0].firstChild.data It will print the node value.

I like BeautifulSoup :
from BeautifulSoup import BeautifulStoneSoup
xml = """<test>I want to read this</test>"""
soup = BeautifulStoneSoup(xml)
soup.find('test')
I want to read this
looks somewhat better.

Use firstChild.data instead of toxml:
from xml.dom.minidom import parseString
dom = parseString('<test>I want to read this</test>')
element = dom.getElementsByTagName('test')[0]
print element.firstChild.data
Output:
>>> I want to read this

Related

TypeError: argument 1 must be convertible to a buffer, not BeautifulSoup

from bs4 import BeautifulSoup
import requests
import csv
page=requests.get("http://www.gigantti.fi/catalog/tietokoneet/fi_kannettavat/kannettavat-tietokoneet")
data=BeautifulSoup(page.content)
h=open("test.csv","wb+")
h.write(data)
h.close()
print (data)
i have tried running the code as it is without printing it in csv file and it runs perfectly but the moment I try to save it in csv I get the error : argument 1 must be convertible to a buffer, not BeautifulSoup. PLEASE HELP and thanks in advance
I don't know whether someone was able to solve it or not but my hit and trial worked. the problem was I was not converting the content to string.
#what i needed to add was:
#after line data=BeautifulSoup(page.content)
a=str(data)
Hopefully this helps
What you are trying to do doesn't make any sense.
As mentioned on Beautiful Soup Documentation:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
You do not seem to be pulling any data but you are trying to write a BeautifulSoup object into a file which doesn't make sense.
>>> type(data)
<class 'bs4.BeautifulSoup'>
What you should be using BeautifulSoup for is to search the data for some information, and then use that information, here's a useless example:
from bs4 import BeautifulSoup
import requests
page = requests.get("http://www.gigantti.fi/catalog/tietokoneet/fi_kannettavat/kannettavat-tietokoneet")
data = BeautifulSoup(page.content)
with open("test.txt", "wb+") as f:
# find the first `<title>` tag and retrieve its value
value = data.findAll('title')[0].text
f.write(value)
It seems like you should be using BeautifulSoup to be retreiving all the information on each product in the product listing and putting them into columns in a csv file if I'm guessing correctly, but I will leave that work up to you. You must use BeautifulSoup to find each product in the html and then retrieve all of its details and print to a csv

Python ElementTree doesn't seem to recognize text nodes

I am trying to parse a simple XML document located at http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk using the ElementTree module. The code (so far):
import urllib2
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
url = "http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk"
s = urllib2.urlopen(url)
print s
document = ElementTree.parse(s)
root = document.getroot()
print root
dataset = SubElement(root, 'NewDataSet')
print dataset
table = SubElement(dataset, 'Table')
print table
airportName = SubElement(table, 'CityOrAirportName')
print airportName.text
The final line yields "none" not the name of the airport in the XML. Can anyone assist? This should be realtively simply, but I am missing something.
Look at the documentation for that module. It says, among other things:
The SubElement() function also provides a convenient way to create new sub-elements for a given element
In particular note the word create. You are creating a new element, not reading the elements that are already there.
If you want to locate certain elements within the parsed XML, read the rest of the documentation on that page to understand how to use the library to do that.

Downloading a web page and searching a text with python

I'm trying to scrape specific text from a website. Because I'm new in Python, I find it difficult to scrape a text with a single script, so I used this code first:
import urllib
import requests
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("https://io.winmasters.com/Feeds/api/event /282576?lang=el").read()
data = htmltext
soup = BeautifulSoup(data)
f = open('/Desktop/text.txt', 'w')
f.write(data)
f.close()`
and next I'm trying to write a script for searching the text and print specific words.
with open("/Desktop/text.txt") as openfile:
for line in openfile:
for part in line.split():
if "odds=" in part:
print part
but the search script doesn't return the text I'm searching for. Any suggestions please?
If you simply want the values associated with the odds key, without any context at all, you could simply do the following:
import urllib
from json import loads # JSON parser
jsontext = urllib.urlopen("https://io.winmasters.com/Feeds/api/event/282576?lang=el").read()
data = loads(jsontext) # Parse the JSON
odds = [[b['odds'] for b in a['children']] for a in data['children']]
The nested list comprehension takes advantage of the structure of the data. An advantage of using the data structure is that you can do quite rich analytics without too much effort. If you wanted other info in addition to the odds then this would probably better implemented as a nested for-loop.
How about:
import sys
from bs4 import Beautiful Soup
import mechanize
def viewPage(url):
browser=mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders=[('user-agent','MozillaMozilla/5.0')]
page=browser.open(url)
source_code=page.read()
soup=BeautifulSoup(source_code)
info=soup.findAll("insert what you want to locate")
print(info)
viewPage("www.xkcd.com")
I have a program that when you choose a webpage it reads off all the links, chooses one at random and goes to it, doing the same. It basically crawls across the interweb. The code above is a modified excerpt.

How to parse YouTube XML using Python?

I am trying to parse the xml from YouTube that is embedded in the code below. I am trying to display all of the titles. However, I am running into trouble when I try to print the 'title' only enter lines appear. Any advice?
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib2.urlopen('http://gdata.youtube.com/feeds/api/users/buzzfeed/uploads?v=2&max-results=50')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
entry=dom.getElementsByTagName('entry')
for node in entry:
video_title=node.getAttribute('title')
print video_title
Title is not an attribute, it is a child element of an entry.
here is an example how to extract it:
for node in entry:
video_title = node.getElementsByTagName('title')[0].firstChild.nodeValue
print video_title
lxml can be a bit difficult to figure out, so here's a really simple beautiful soup solution (It's called beautifulsoup for a reason). You can also set up beautiful soup to use the lxml parser, so the speed is about the same.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data) # data as is seen in your code
soup.findAll('title')
returns a list of title elements. you can also use soup.findAll('media:title') in this case to return just the media:title elements (the actual video names).
There a small bug in your code. You access title as an attribute, although it's a child element of entry. Your code can be fixed by:
dom = parseString(data)
for node in dom.getElementsByTagName('entry'):
print node.getElementsByTagName('title')[0].firstChild.data

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Categories