Beautifulsoup xml

Beautifulsoup xml - python

I am trying to use this code to add into an xml tree a simple info, which I have in a table. Each file has its id which I need to add into it. the corresp dictionary has file name and id couples. there is already an empty element in the xml, called idno[#type='TM'] in which I need to enter the corresponding id number.
from bs4 import BeautifulSoup
DIR = 'files/'
corresp = {"00100004":"362375", "00100005":"362376", "00100006":"362377", "00100007":"362378"}
for fileName, tm in corresp.iteritems():
soup = BeautifulSoup(open(DIR + fileName + ".xml"))
tmid = soup.find("idno", type="TM")
tmid.append(tm)
print soup
My first problem is that some time it works, some time it says
tmid.append(tm)
AttributeError: 'NoneType' object has no attribute 'append'
I have no idea why. yesterday evening I run the same sample code and now it complains in this way.
I have also tried etree
import xml.etree.ElementTree as ET
DIR = 'files/'
corresp = {"00100004":"362375", "00100005":"362376", "00100006":"362377", "00100007":"362378"}
for fileName, tm in corresp.iteritems():
f = open(DIR + fileName + ".xml")
tree = ET.parse(f)
tmid = tree.findall("//idno[#type='TM']")
tmid.append(tm)
tree.write('output.xml', encoding='utf-8', xml_declaration=True)
But it says "no element found: line 1, column 0"
My second, probably related problem is that when it did work, I was not able to write the output to a file. Ideally I would like to simply write it back to the file I am modifying.
Thank you very much for any advise on this.

For your first question:
find() just returns the result. If find() can’t find anything, it returns None. both result and None are not a python list, so it does not have append() method.
check the doc:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

I can't read an XML file

It's my first time working with XML files, yet I have a problem with a code as simple as:
from xml.etree import ElementTree as ET
tree = ET.parse('some_xml_file.xml')
s = ET.tostring(tree, method = 'xml')
root = tree.getroot()
all I am trying to do here is reading the XML file as a string,
but whenever I try to run this I get an error:
AttributeError: 'ElementTree' object has no attribute 'tag'
I have no idea what I did wrong just yet, so I would need any hint
and thanks in advance

You can't use ET.tostring on the full tree; you can use it on the root element.
from xml.etree import ElementTree as ET
tree = ET.parse('some_xml_file.xml')
s = ET.tostring(tree.getroot(), method='xml')

My XML file won't load.I've tried everything. Does anyone know why?

from lxml import objectify
import pandas as pd
xml = objectify.parse(open('C:/Users/admin/Downloads/XMLData2.xml'))
root = xml.getroot() # root contains 4 'record' nodes
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))
for i in range(0,4):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['Number', 'String', 'Boolean'],
[obj[0].text, obj[1].text,
obj[2].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
search = pd.DataFrame.duplicated(df)
print (df)
print
print (search[search == True])
This is the error i get ---> No such file or directory: 'C:/Users/admin/Downloads/XMLData2.xml'

I assume you are new to python.
The code you are trying have no errors.
So the possible issues lies in your path.
Please check the address of the xml file and try again.
In case you have trouble finding out the address.
Go to the folder containing the file.
Right click on the file and click on properties.
You will be able to see the file location.
Copy it, add on the file name and replace the \ to /.

How to get more info from lxml errors?

Because I'm not able to use an XSL IDE, I've written a super-simple Python script using lxml to transform a given XML file with a given XSL transform, and write the results to a file. As follows (abridged):
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
print(xml_root.tag)
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
I'm getting the following error:
"lxml.etree.XSLTApplyError: Failed to evaluate the 'select' expression"
...but I have quite a number of select expressions in my XSLT. After having looked carefully and isolated blocks of code, I'm still at a loss as to which select is failing, or why.
Without trying to debug the code, is there a way to get more information out of lxml, like a line number or quote from the failing expression?

aaaaaand of course as soon as I actually take the time to post the question, I stumble upon the answer.
This might be a duplicate of this question, but I think the added benefit here is the Python side of things.
The linked answer points out that each parser includes an error log that you can access. The only "trick" is catching those errors so that you can look in the log once it's been created.
I did it thusly (perhaps also poorly, but it worked):
import os
import lxml.etree as etree
from lxml.etree import XMLParser
import sys
xml_filename = '(some path to an XML file)'
xsl_filename = '(some path to an XSL file)'
output = '(some path to a file)'
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = None
try:
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
except:
for error in transform.error_log:
print(error.message, error.line)
The messages in this log are more descriptive than those printed to the console, and the "line" element will point you to the line number where the failure occurred.

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.

Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')

data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.

The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

Trying to loop through URLs and save contents, as a data frame, to text file

I think this block of code is pretty close to being right, but something is throwing it off. I'm trying to loop through 10 URLs and download the contents of each to a text file, and make sure everything is structured orderly, in a dataframe.
import pandas as pd
rawHtml = ''
url = r'http://www.pga.com/golf-courses/search?page=" + i + "&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0'
g = open("C:/Users/rshuell001/Desktop/MyData.txt", "w")
for i in range(0, 10):
df = pd.DataFrame.from_csv(url)
print(df)
g.write(str(df))
g.close()
The error that I get says:
CParserError: Error tokenizing data.
C error: Expected 1 fields in line 22, saw 2
I have no idea what that means. I only have 9 lines of code, so I don't know why it's mentioning a problem on line 22.
Can someone give me a push to get this working?

pandas.DataFrame.from_csv() takes a first argument which is either a path or a file-like handle, where either are supposed to be pointing at valid CSV file.
You are providing it with a URL.
It seems that you want to use a different function: the top-level pandas.read_csv. This function will actually fetch the data from you from a valid URL, then parse it.
If for any reason you insist on using pandas.DataFrame.from_csv(), you will have to:
Get the text from the page.
Persist the text, or parts thereof, as a valid CSV file, or a file-like object.
Provide the path to the file, or the handler of the file-like, as the first argument to pandas.DataFrame.from_csv().

I finally got it working. This is what I was trying to do all along.
import requests
from bs4 import BeautifulSoup
link = "http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
html = requests.get(link).text
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("div", {"class": "views-field-nothing"})
for r in res:
print("Address: " + r.find("span", {'class': 'field-content'}).text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup xml - python

For your first question: find() just returns the result. If find() can’t find anything, it returns None. both result and None are not a python list, so it does not have append() method. check the doc: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

I can't read an XML file

My XML file won't load.I've tried everything. Does anyone know why?

How to get more info from lxml errors?

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

Trying to loop through URLs and save contents, as a data frame, to text file

Categories

Resources