Parse all XML files in a directory Python - python

Hi I'm trying to parse all XML files in a given directory using python. I am able to parse one file at a time but that would be 'impossible' for me to do due to the large number of files i.e. it works when I pre-define the tree and root, however not when I try to run for all the code.
This is what I implemented so far:
import xml.etree.ElementTree as ET
import os
directory = "C:/Users/danie/Desktop/NLP/blogs/"
def clean_dir(directory):
path = os.listdir(directory)
print(path)
for filename in path:
tree = ET.parse(filename)
root = tree.getroot()
doc_parser(root)
post_list = []
def doc_parser(root):
for child in root.findall('post'):
post_list.append(child.text)
clean_dir(directory)
print(post_list[0])
The error I'm getting as follows:
File "D:\Anaconda\envs\Deep Learning New\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-91-fce6b0119ea7>", line 1, in <module>
runfile('C:/Users/danie/Desktop/NLP/blogs/Parser_Tes.py', wdir='C:/Users/danie/Desktop/NLP/blogs')
File "D:\Anaconda\envs\Deep Learning New\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "D:\Anaconda\envs\Deep Learning New\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/danie/Desktop/NLP/blogs/Parser_Tes.py", line 19, in <module>
clean_dir(directory)
File "C:/Users/danie/Desktop/NLP/blogs/Parser_Tes.py", line 9, in clean_dir
tree = ET.parse(filename)
File "D:\Anaconda\envs\Deep Learning New\lib\xml\etree\ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "D:\Anaconda\envs\Deep Learning New\lib\xml\etree\ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 103, column 225
In terms of printing out the path, all correct filenames are being printed out. Some of which are:
['1000331.female.37.indUnk.Leo.xml', '1000866.female.17.Student.Libra.xml', '1004904.male.23.Arts.Capricorn.xml', '1005076.female.25.Arts.Cancer.xml', '1005545.male.25.Engineering.Sagittarius.xml', '1007188.male.48.Religion.Libra.xml', '100812.female.26.Architecture.Aries.xml', '1008329.female.16.Student.Pisces.xml', '1009572.male.25.indUnk.Cancer.xml', '1011153.female.27.Technology.Virgo.xml', '1011289.female.25.indUnk.Libra.xml', '1011311.female.17.indUnk.Scorpio.xml', '1013637.male.17.RealEstate.Virgo.xml', '1015252.female.23.indUnk.Pisces.xml', '1015556.male.34.Technology.Virgo.xml', '1016560.male.41.Publishing.Sagittarius.xml', '1016738.male.26.Publishing.Libra.xml', '1016787.female.24.Communications-Media.Leo.xml', '1019224.female.27.RealEstate.Libra.xml', '1019622.female.24.indUnk.Aquarius.xml', '1019710.male.16.Student.Pisces.xml', '1021779.female.25.indUnk.Scorpio.xml', '1022037.male.23.indUnk.Cancer.xml', '1022086.female.17.Student.Cancer.xml', '1024234.female.17.indUnk.Libra.xml', '1025783.female.17.Student.Gemini.xml', '1026164.female.23.Education.Aries.xml', '1026443.female.15.Student.Scorpio.xml', '1028027.female.16.indUnk.Libra.xml', '1028257.male.26.Education.Aries.xml', '1029959.male.17.indUnk.Aries.xml', '1031806.male.17.Technology.Sagittarius.xml', '1032153.male.27.Technology.Pisces.xml', '1032591.female.24.Banking.Aquarius.xml', '1032824.female.15.Student.Libra.xml', '1034874.female.43.Publishing.Capricorn.xml', '1039136.male.24.Student.Capricorn.xml', '1039908.female.16.indUnk.Gemini.xml', '1040084.male.17.indUnk.Taurus.xml', '1042993.male.15.Student.Sagittarius.xml', '1043329.male.23.Government.Pisces.xml', '1043569.male.26.indUnk.Virgo.xml', '1043785.female.26.Biotech.Leo.xml', '1044338.female.23.Student.Leo.xml', '1045289.female.25.Arts.Aquarius.xml', '1045316.male.27.Non-Profit.Capricorn.xml', '1045831.male.23.Student.Libra.xml', '1046946.female.25.Arts.Virgo.xml', '1047241.male.16.indUnk.Aries.xml', '1050060.female.24.Student.Pisces.xml', '1051122.female.17.Student.Libra.xml', '1052611.male.23.Student.Aries.xml', '1054833.female.24.indUnk.Scorpio.xml', '1055228.female.16.Student.Cancer.xml', '1056232.female.17.indUnk.Aquarius.xml', '1056581.female.26.indUnk.Leo.xml', ....]
So I took the advice of both #wundermahn and #Kevin and use try...except. This is the output now. i.e. 482 from 19320 items. The issue now, when I try to print out a certain element from the list post_list[]. I'm getting the following error:
IndexError: list index out of range
Files with errors:
ERROR ON FILE: 669116.female.26.indUnk.Gemini.xml
ERROR ON FILE: 669514.female.27.indUnk.Sagittarius.xml
ERROR ON FILE: 669656.female.23.Advertising.Aries.xml
ERROR ON FILE: 669719.male.26.Science.Taurus.xml
ERROR ON FILE: 669764.female.17.indUnk.Sagittarius.xml
ERROR ON FILE: 670277.female.27.Education.Sagittarius.xml
ERROR ON FILE: 670314.male.24.indUnk.Leo.xml
ERROR ON FILE: 670684.male.24.Student.Libra.xml
ERROR ON FILE: 671748.male.27.Communications-Media.Aries.xml
ERROR ON FILE: 673093.male.27.Construction.Scorpio.xml
ERROR ON FILE: 673235.male.37.Internet.Capricorn.xml
ERROR ON FILE: 67459.male.34.Arts.Capricorn.xml
ERROR ON FILE: 674684.female.23.Religion.Libra.xml
Further checked and printed out post_list, for some reason the data is not being appended and it is empty.
Thanks again!

#Kevin was correct in his comment that this error relates to the ElementTree object not being able to parse the document correctly. Something is not "true XML", and it could be something as simple as just an odd, non-unicode character or something.
What you can try to do to help debug is:
import xml.etree.ElementTree as ET
import os
directory = "C:/Users/danie/Desktop/NLP/blogs/"
def clean_dir(directory):
path = os.listdir(directory)
print(path)
for filename in path:
try:
tree = ET.parse(filename)
root = tree.getroot()
doc_parser(root)
except:
print("ERROR ON FILE: {}".format(filename))
post_list = []
def doc_parser(root):
for child in root.findall('post'):
post_list.append(child.text)
clean_dir(directory)
print(post_list[0])
Adding in a try...except statement will try each of the files, and if there is an error, print out which file is causing the error.
I don't have any data to test, but this should fix the error.

Related

Python - Parsing all XML files in a folder to CSV files

I just started learning python so this might be a very basic question but here's where I'm stuck.
I'm trying to parse ALL XML files in a given folder and outputting CSV files, with the same filename as the original XML files. I've tested with single files and it works perfectly but the issue I'm having is with performing the same for all of them and having that running on a loop as it would be a perpetual script.
Here my code:
import os
import xml.etree.cElementTree as Eltree
import pandas as pd
path = r'C:/python_test'
filenames = []
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
print(fullname)
filenames.append(fullname)
cols = ["serviceName", "startDate", "endDate"]
rows = []
for filename in filenames:
xmlparse = Eltree.parse(filename)
root = xmlparse.getroot()
csvoutput=[]
for fixed in root.iter('{http://www.w3.org/2001/XMLSchema}channel'):
channel = fixed.find("channelName").text
for dyn in root.iter('programInformation'):
start = dyn.find("publishedStartTime").text
end = dyn.find("endTime").text
rows.append({"serviceName": channel, "startDate": start, "endDate": end})
df = pd.DataFrame(rows, columns=cols)
df.to_csv(csvoutput)
This is the error I'm getting:
C:/python_test\1.xml
C:/python_test\2.xml
C:/python_test\3.xml
C:/python_test\4.xml
C:/python_test\5.xml
C:/python_test\6.xml
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 49, in <module>
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\format.py", line 1105, in to_csv
csv_formatter.save()
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\csvs.py", line 237, in save
with get_handle(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py", line 609, in get_handle
ioargs = _get_filepath_or_buffer(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py", line 396, in _get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'list'>
Any kind of suggestions would be greatly appreciated!
Many thanks!
This is you bug:
csvoutput=[] is defined as a list. Later on you pass it as argument to df.to_csv(csvoutput). So you are passing a list to a method that looks for a file path.

Python Post Request Response Xml Error load fromstring

I'm literally new to Python and I have encounter something that I am not sure how to resolve I'm sure it must be a simple fix but haven't found an solution and hope someone with more knowledge in Python will be able to help.
My request:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
all is fine until with the above until I have to manipulate the data before saving it :
E.G:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
import xml.etree.ElementTree as ET
# contacts.encoding = 'utf-8'
parser = ET.XMLParser(encoding="UTF-8")
tree = ET.fromstring(contacts.content, parser=parser)
root = tree.getroot()
for item in root[0][0].findall('.//fields'):
if item[0].text == 'maching-text-here':
if not item[1].text:
item[1].text = 'N/A'
print(item[1].text)
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
in the above I literally replacing empty value with value 'N/A'
the error that I'm receiving is:
Traceback (most recent call last):
File "Desktop/PythonTests/test.py", line 107, in <module>
tree = ET.fromstring(contacts.content, parser=parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1311, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1659, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1523, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 192300
looking around this column I can see a text with characters E.G: Sinéd, É is a problem here and actually when I just save this xml file and open in the browser I get kind of same error round about give or take the same column missing by 2:
This page contains the following errors:
error on line 1 at column 192298: Encoding error
Below is a rendering of the page up to the first error.
I wonder What I can do with data xml response that contain data with characters ?
Anyone any help Appreciated!
Found my answer after digging stack overflow:
I've modified:
FROM:
tree = ET.fromstring(contacts.content, parser=parser)
TO:
tree = ElementTree(fromstring(contacts.content))
REF:https://stackoverflow.com/questions/33962620/elementtree-returns-element-instead-of-elementtree/44483259#44483259

TypeError: cannot parse from 'list'

I need to parse a lot of xml files and load the data into the database. Running the following:
import os
from lxml import etree
path = 'C:/Users/xxx/Desktop/python/python-parsing/data'
filename = os.listdir(path)
tree = etree.parse(filename)
test = tree.xpath('///p[#name="bName"]')
print ("".join(test))
Result:
Desktop\python\python-parsing\parser.py", line 6, in <module>
tree = etree.parse(filename)
File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1863, in lxml.etree._parseDocument
TypeError: cannot parse from 'list'
Any ideas how to fix this?
This is because os.listdir(path) returns a list of all files in your data folder, even if there is only one file. As such, you need to get the filename you want in this list before parsing it.

parsing xml file in python - no element found

I'm a python beginner.
I want to be able to pick values of certain elements in an xml sheet. Below is what my xml sheet looks like:
<TempFolder>D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99</TempFolder>
<AnalysisFolder>D:\Mooniology\MiSeqAnalysis\160708_M0209831_0202_000000000-APC99</AnalysisFolder>
<RunStartDate>160708</RunStartDate>
<MostRecentWashType>PostRun</MostRecentWashType>
<RecipeFolder>D:\Mooniology\MiSeq Control Software\CustomRecipe</RecipeFolder>
<ILMNOnlyRecipeFolder>C:\Mooniology\MiSeq Control Software\Recipe</ILMNOnlyRecipeFolder>
<SampleSheetName>20160708 ALK Amplicon NGS cDNA synthesis kit comparison</SampleSheetName>
<SampleSheetFolder>Q:\GNO MiSeq\Jaya</SampleSheetFolder>
<ManifestFolder>Q:\GNO MiSeq</ManifestFolder>
<OutputFolder>\\rpbns4-lab\vol10\RMSdisect\160708_M02091_0202_000000000-APC99</OutputFolder>
<FocusMethod>AutoFocus</FocusMethod>
<SurfaceToScan>Both</SurfaceToScan>
<SaveFocusImages>true</SaveFocusImages>
<SaveScanImages>true</SaveScanImages>
And by "picking values", suppose I want the value of the element called TempFolder. I want the script spit out D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99
Below is the code I'm using to attempt to scan it:
#!/usr/bin/python2.7
import xml.etree.ElementTree as ET
tree = ET.parse('online.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Every time i run this code, no matter how i modify it (from researching google), the end result is always the following error:
Traceback (most recent call last):
File "./mindo.py", line 5, in <module>
tree = ET.parse('online.xml')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 657, in parse
self._root = parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 75, column 0
I suspected that the issue could be the xml file I'm using. But since I'm new to python, i have to presume its my code.
This is because the XML is not well formatted and therefore is not parsable:
In [4]: tree = ET.parse('online.xml')
...:
File "<string>", line unknown
ParseError: junk after document element: line 2, column 2
the xml need to have root element ie :
<params>
<TempFolder>D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99</TempFolder>
<AnalysisFolder>D:\Mooniology\MiSeqAnalysis\160708_M0209831_0202_000000000-APC99</AnalysisFolder>
<RunStartDate>160708</RunStartDate>
<MostRecentWashType>PostRun</MostRecentWashType>
...
...
...
</params>

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

I have a script which is suppose to extract some terms from XML files from a list of URLs.
All the URL's give access to XML data.
It is working fine at first opening, parsing and extracting correctly but then get interrupted in the process by some XML files with this error:
File "<stdin>", line 18, in <module>
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1555, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82511)
File "parser.pxi", line 1585, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:82832)
File "parser.pxi", line 1468, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:81688)
File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:78735)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
From my search it might be because some XML files have white spaces but i'm not sure if it is the problem. I can't tell which files give the error.
Is there a way to get around this error?
Here is my script:
URLlist = ["http://www.uniprot.org/uniprot/"+x+".xml" for x in IDlist]
for id, item in zip(IDlist, URLlist):
goterm_location = []
goterm_function = []
goterm_process = []
location_list[id] = []
function_list[id] = []
biological_list[id] = []
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
#Try to solve empty line error#
tree = etree.parse(textfile);
#root = tree.getroot()
for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
if node.attrib.get('type') == 'GO':
for child in node:
value = child.attrib.get('value');
if value.startswith('C:'):
goterm_C = node.attrib.get('id')
if goterm_C:
location_list[id].append(goterm_C);
if value.startswith('F:'):
goterm_F = node.attrib.get('id')
if goterm_F:
function_list[id].append(goterm_F);
if value.startswith('P:'):
goterm_P = node.attrib.get('id')
if goterm_P:
biological_list[id].append(goterm_P);
I have tried:
tree = etree.iterparse(textfile, events = ("start","end"));
OR
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(textfile, parser)
Without success.
Any help would be greatly appreciated
I can't tell which files give the error
Debug by printing the name of the file/URL prior to parsing. Then you'll see which file(s) cause the error.
Also, read the error message:
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
this suggests that the downloaded XML file is empty. Once you have determined the URL(s) that cause the problem, try downloading the file and check its contents. I suspect it might be empty.
You can ignore problematic files (empty or otherwise syntactically invalid) by using a try/except block when parsing:
try:
tree = etree.parse(textfile)
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue # go on to the next URL
Or you could check just for empty files by checking the 'Content-length' header, or even by reading the resource returned by urlopen(), but I think that the above is better as it will also catch other potential errors.
I got the same error message in Python 3.6
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
In my case the xml file is not empty. Issue is because of encoding,
Initially used utf-8,
from lxml import etree
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='utf-8')
changing encoding to iso-8859-1 solved my issue,
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='iso-8859-1')

Categories