I'm trying to parse an XML file in which there is some VCARD. I need the info: FN, NOTE (SIREN and A) and print them as a list as FN, SIREN_A. I would also like to add them in a list if the string in the description equals "diviseur" only
I've tried different things (vobject, finditer) but none of them work. For my parser, I'm using the library xml.etree.ElementTree and pandas which usually are causing some incompatibilies.
code python :
import xml.etree.ElementTree as ET
import vobject
newlist=[]
data=[]
data.append(newlist)
diviseur=[]
tree=ET.parse('test_oc.xml')
root=tree.getroot()
newlist=[]
for lifeCycle in root.findall('{http://ltsc.ieee.org/xsd/LOM}lifeCycle'):
for contribute in lifeCycle.findall('{http://ltsc.ieee.org/xsd/LOM}contribute'):
for entity in contribute.findall('{http://ltsc.ieee.org/xsd/LOM}entity'):
vcard = vobject.readOne(entity)
siren = vcard.contents['note'].value,":",vcard.contents['fn'].value
print ('siren',siren.text)
for date in contribute.findall('{http://ltsc.ieee.org/xsd/LOM}date'):
for description in date.findall('{http://ltsc.ieee.org/xsd/LOM}description'):
entite=description.find('{http://ltsc.ieee.org/xsd/LOM}string')
print ('Type entité:', entite.text)
newlist.append(entite)
j=0
for j in range(len(entite)-1):
if entite[j]=="diviseur":
diviseur.append(siren[j])
print('diviseur:', diviseur)
newlist.append(diviseur)
data.append(newlist)
print(data)
xml file to parse:
<?xml version="1.0" encoding="UTF-8"?>
<lom:lom xmlns:lom="http://ltsc.ieee.org/xsd/LOM" xmlns:lomfr="http://www.lom-fr.fr/xsd/LOMFR" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ltsc.ieee.org/xsd/LOM">
<lom:version uniqueElementName="version">
<lom:string language="http://id.loc.gov/vocabulary/iso639-2/fre">V4.1</lom:string>
</lom:version>
<lom:lifeCycle uniqueElementName="lifeCycle">
<lom:contribute>
<lom:entity><![CDATA[
BEGIN:VCARD
VERSION:4.0
FN:Cailler
N:;Valérie;;Mr;
ORG:Veoli
NOTE:SIREN=203025106
NOTE :ISNI=0000000000000000
END:VCARD
]]></lom:entity>
<lom:date uniqueElementName="date">
<lom:dateTime uniqueElementName="dateTime">2019-07-10</lom:dateTime>
<lom:description uniqueElementName="description">
<lom:string>departure</lom:string>
</lom:description>
</lom:date>
</lom:contribute>
<lom:contribute>
<lom:entity><![CDATA[
BEGIN:VCARD
VERSION:4.0
FN:Besnard
N:;Ugo;;Mr;
ORG:MG
NOTE:SIREN=501 025 205
NOTE :A=0000 0000
END:VCARD
]]></lom:entity>
<lom:date uniqueElementName="date">
<lom:dateTime uniqueElementName="dateTime">2019-07-10</lom:dateTime>
<lom:description uniqueElementName="description">
<lom:string>diviseur</lom:string>
</lom:description>
</lom:date>
</lom:contribute>
</lom:lifeCycle>
</lom:lom>
Traceback (most recent call last):
File "parser_export_csv_V2.py", line 73, in
vcard = vobject.readOne(entity)
File "C:\Users\b\AppData\Local\Programs\Python\Python36-32\lib\site-packages\vobject\base.py", line 1156, in readOne
allowQP))
File "C:\Users\b\AppData\Local\Programs\Python\Python36-32\lib\site-packages\vobject\base.py", line 1089, in readComponents
for line, n in getLogicalLines(stream, allowQP):
File "C:\Users\b\AppData\Local\Programs\Python\Python36-32\lib\site-packages\vobject\base.py", line 869, in getLogicalLines
val = fp.read(-1)
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'read'
There are a few problems here.
entity is an Element instance, and vCard is a plain text data format. vobject.readOne() expects text.
There is unwanted whitespace adjacent to the vCard properties in the XML file.
NOTE :ISNI=0000000000000000 is invalid; it should be NOTE:ISNI=0000000000000000 (space removed).
vcard.contents['note'] is a list and does not have a value property.
Here is code that probably doesn't produce exactly what you want, but I hope it helps:
import xml.etree.ElementTree as ET
import vobject
NS = {"lom": "http://ltsc.ieee.org/xsd/LOM"}
tree = ET.parse('test_oc.xml')
for contribute in tree.findall('.//lom:contribute', NS):
desc_string = contribute.find('.//lom:string', NS)
print(desc_string.text)
entity = contribute.find('lom:entity', NS)
txt = entity.text.replace(" ", "") # Text with spaces removed
vcard = vobject.readOne(txt)
for p in vcard.contents["note"]:
print(p.name, p.value)
for p in vcard.contents["fn"]:
print(p.name, p.value)
print()
Output:
departure
NOTE SIREN=203025106
NOTE ISNI=0000000000000000
FN Cailler
diviseur
NOTE SIREN=501025205
NOTE A=00000000
FN Besnard
Related
I am trying to read an XML file using lxml but I keep receiving this error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
my code:
import lxml
parser16= lxml.etree.XMLParser(encoding = "utf-16")
tree = lxml.etree.parse(input_filepath, parser = parser16)
XML file header:
<?xml version="1.0" encoding="utf-16"?>
<LoanResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<encompassId>b100ab9f-444-5555-949c-1873ed1ddeb3</encompassId>
<adverseActionDate>2018-06-20T00:00:00Z</adverseActionDate>
<applications>
<Application>
<id>_borrower1</id>
<applicationId>_borrower1</applicationId>
<applicationIndex>0</applicationIndex>
<assets>
<Asset>
<id>Asset/0</id>
<assetType>LifeInsurance</assetType>
<borrowerId>_borrower1</borrowerId>
<isEmpty>true</isEmpty>
</Asset>
Error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Appreciate any help / input in resolving.
New to python,I am presently in the process of converting the XML to CSV using Python 3.6.1
Input file is file1.xml file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Package>
<name>AllFeatureRules</name>
<pkgId>13569656</pkgId>
<pkgMetadata>
<creator>rsikhapa</creator>
<createdDate>13-05-2018 10:07:16</createdDate>
<pkgVersion>3.0.29</pkgVersion>
<application>All</application>
<icType>Feature</icType>
<businessService>Common</businessService>
<technology>All,NA</technology>
<runTimeFormat>RBML</runTimeFormat>
<inputForTranslation></inputForTranslation>
<pkgDescription></pkgDescription>
</pkgMetadata>
<rules>
<rule>
<name>ip_slas_scheduling</name>
<ruleId>46288</ruleId>
<ruleVersion>1.3.0</ruleVersion>
<ruleVersionId>1698132</ruleVersionId>
<nuggetId>619577</nuggetId>
<nuggetVersionId>225380</nuggetVersionId>
<icType>Feature</icType>
<creator>paws</creator>
<customer></customer>
</rule>
</rules>
<versionChanges>
<rulesAdded/>
<rulesModified/>
<rulesDeleted/>
</versionChanges>
</Package>
python code:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file1.xml")
root = tree.getroot()
get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]
df = pd.DataFrame.from_dict(l)
df.to_csv('ABC.csv')
python code written as above
problem is it is taking csv conversion only for parent element(pkgmetadata) not for child element(rules).
,
not converting all xml file into csv .please let me know solution
to iterate over every entry, you can use the element trees ET.iter() function.
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file1.xml")
root = tree.getroot()
iter_root = root.iter()
l = {}
for elem in iter_root:
l[str(elem.tag)] = str(elem.text)
df = pd.DataFrame.from_dict(l,orient="index")
df.to_csv('ABC.csv')
producing a csv:
;0
Package;"
"
name;ip_slas_scheduling
pkgId;13569656
pkgMetadata;"
"
creator;paws
createdDate;13-05-2018 10:07:16
pkgVersion;3.0.29
application;All
icType;Feature
businessService;Common
technology;All,NA
runTimeFormat;RBML
inputForTranslation;None
pkgDescription;None
rules;"
"
rule;"
"
ruleId;46288
ruleVersion;1.3.0
ruleVersionId;1698132
nuggetId;619577
nuggetVersionId;225380
customer;None
versionChanges;"
"
rulesAdded;None
rulesModified;None
rulesDeleted;None
I am trying to use findall to select on some xml elements, but i can't get any results.
import xml.etree.ElementTree as ET
import sys
storefront = sys.argv[1]
xmlFileName = 'promotions{0}.xml'
xmlFile = xmlFileName.format(storefront)
csvFileName = 'hrz{0}.csv'
csvFile = csvFileName.format(storefront)
ET.register_namespace('', "http://www.demandware.com/xml/impex/promotion/2008-01-31")
tree = ET.parse(xmlFile)
root = tree.getroot()
print('------------------Generate test-------------\n')
csv = open(csvFile,'w')
n = 0
for child in root.findall('campaign'):
print(child.attrib['campaign-id'])
print(n)
n+=1
The XML looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<promotions xmlns="http://www.demandware.com/xml/impex/promotion/2008-01-31">
<campaign campaign-id="10off-310781">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
</campaign>
<campaign campaign-id="MNT-deals">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<start-date>2017-07-03T22:00:00.000Z</start-date>
<end-date>2017-07-31T22:00:00.000Z</end-date>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
</campaign>
<campaign campaign-id="black-friday">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<start-date>2017-11-23T23:00:00.000Z</start-date>
<end-date>2017-11-24T23:00:00.000Z</end-date>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
<custom-attributes>
<custom-attribute attribute-id="expires_date">2017-11-29</custom-attribute>
</custom-attributes>
</campaign>
<promotion-campaign-assignment promotion-id="winter17-new-bubble" campaign-id="winter17-new-bubble">
<qualifiers match-mode="any">
<customer-groups/>
<source-codes/>
<coupons/>
</qualifiers>
<rank>100</rank>
</promotion-campaign-assignment>
<promotion-campaign-assignment promotion-id="xmas" campaign-id="xmas">
<qualifiers match-mode="any">
<customer-groups/>
<source-codes/>
<coupons/>
</qualifiers>
</promotion-campaign-assignment>
</promotions>
Any ideas what i am doing wrong?
I have tried different solutions that i found on stackoverflow but nothing seems to work for me(from the things i have tried).
The list is empty.
Sorry if it is something very obvious i am new to python.
As mentioned here by #MartijnPieters, etree's .findall uses the namespaces argument while the .register_namespace() is used for xml output of the tree. Therefore, consider mapping the default namespace with an explicit prefix. Below uses doc but can even be cosmin.
Additionally, consider with and enumerate() even the csv module as better handlers for your print and CSV outputs.
import csv
...
root = tree.getroot()
print('------------------Generate test-------------\n')
with open(csvFile, 'w') as f:
c = csv.writer(f, lineterminator='\n')
for n, child in enumerate(root.findall('doc:campaign', namespaces={'doc':'http://www.demandware.com/xml/impex/promotion/2008-01-31'})):
print(child.attrib['campaign-id'])
print(n)
c.writerow([child.attrib['campaign-id']])
# ------------------Generate test-------------
# 10off-310781
# 0
# MNT-deals
# 1
# black-friday
# 2
here is my code:
from lxml import etree, objectify
def parseXML(xmlFile):
with open(xmlFile) as f:
xml = f.read()
root = objectify.fromstring(xml)
#returns attributes in element node as dict
attrib = root.attrib
#how to extract element data
begin = root.appointment.begin
uid = root.appointment.uid
#loop over elements and print their tags and text
for appt in root.getchildren():
for e in appt.getchildren():
print('%s => %s' % (e.tag, e.text))
print()
#how to change element's text
root.appointment.begin = 'something else'
print(root.appointment.begin)
#how to add a new element
root.appointment.new_element = 'new data'
#remove the py:pytype stuff
objectify.deannotate(root)
etree.cleanup_namespaces(root)
obj_xml = etree.tostring(root, pretty_print=True)
print(obj_xml)
#save your xml
with open('new.xml', 'w') as f:
f.write(obj_xml)
parseXML('example.xml')
Here is parsed xml file:
<?xml version="1.0" ?>
<zAppointments reminder="15">
<appointment>
<begin>1181251600</begin>
<uid>0400000008200E000</uid>
<alarmTime>1181572063</alarmTime>
<state></state>
<location></location>
<duration>1800</duration>
<subject>Bring pizza home</subject>
</appointment>
<appointment>
<begin>1234567890</begin>
<duration>1800</duration>
<subject>Check MS office webstie for updates</subject>
<state>dismissed</state>
<location></location>
<uid>502fq14-12551ss-255sf2</uid>
</appointment>
</zAppointments>
And here is output with error:
/usr/bin/python3.5 "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py"
begin => 1181251600
uid => 0400000008200E000
Traceback (most recent call last):
alarmTime => 1181572063
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 87, in <module>
state => None
location => None
parseXML('example.xml')
duration => 1800
subject => Bring pizza home
begin => 1234567890
duration => 1800
subject => Check MS office webstie for updates
state => dismissed
location => None
uid => 502fq14-12551ss-255sf2
something else
b'<zAppointments reminder="15">\n <appointment>\n <begin>something else</begin>\n <uid>0400000008200E000</uid>\n <alarmTime>1181572063</alarmTime>\n <state/>\n <location/>\n <duration>1800</duration>\n <subject>Bring pizza home</subject>\n <new_element>new data</new_element>\n </appointment>\n <appointment>\n <begin>1234567890</begin>\n <duration>1800</duration>\n <subject>Check MS office webstie for updates</subject>\n <state>dismissed</state>\n <location/>\n <uid>502fq14-12551ss-255sf2</uid>\n </appointment>\n</zAppointments>\n'
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 85, in parseXML
f.write(obj_xml)
TypeError: write() argument must be str, not bytes
Process finished with exit code 1
What can I do to turn that f object to a string? Is it possible even? I got that error few times earlier and still don't know how to fix it (doing Python 101 exercises).
obj_xml is bytes type, so can't use it with write() without decoding it first. Need to change
f.write(obj_xml)
to:
f.write(obj_xml.decode('utf-8'))
And it works great!
Here's my project: I'm graphing weather data from WeatherBug using RRDTool. I need a simple, efficient way to download the weather data from WeatherBug. I was using a terribly inefficient bash-script-scraper but moved on to BeautifulSoup. The performance is just too slow (it's running on a Raspberry Pi) so I need to use LXML.
What I have so far:
from lxml import etree
doc=etree.parse('weather.xml')
print doc.xpath("//aws:weather/aws:ob/aws:temp")
But I get an error message. Weather.xml is this:
<?xml version="1.0" encoding="UTF-8"?>
<aws:weather xmlns:aws="http://www.aws.com/aws">
<aws:api version="2.0"/>
<aws:WebURL>http://weather.weatherbug.com/PA/Tunkhannock-weather.html?ZCode=Z5546&Units=0&stat=TNKCN</aws:WebURL>
<aws:InputLocationURL>http://weather.weatherbug.com/PA/Tunkhannock-weather.html?ZCode=Z5546&Units=0</aws:InputLocationURL>
<aws:ob>
<aws:ob-date>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="10" hour-24="22"/>
<aws:minute number="26"/>
<aws:second number="00"/>
<aws:am-pm abbrv="PM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:ob-date>
<aws:requested-station-id/>
<aws:station-id>TNKCN</aws:station-id>
<aws:station>Tunkhannock HS</aws:station>
<aws:city-state zipcode="18657">Tunkhannock, PA</aws:city-state>
<aws:country>USA</aws:country>
<aws:latitude>41.5663871765137</aws:latitude>
<aws:longitude>-75.9794464111328</aws:longitude>
<aws:site-url>http://www.tasd.net/highschool/index.cfm</aws:site-url>
<aws:aux-temp units="°F">-100</aws:aux-temp>
<aws:aux-temp-rate units="°F">0</aws:aux-temp-rate>
<aws:current-condition icon="http://deskwx.weatherbug.com/images/Forecast/icons/cond013.gif">Cloudy</aws:current-condition>
<aws:dew-point units="°F">40</aws:dew-point>
<aws:elevation units="ft">886</aws:elevation>
<aws:feels-like units="°F">41</aws:feels-like>
<aws:gust-time>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="12" hour-24="12"/>
<aws:minute number="18"/>
<aws:second number="00"/>
<aws:am-pm abbrv="PM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:gust-time>
<aws:gust-direction>NNW</aws:gust-direction>
<aws:gust-direction-degrees>323</aws:gust-direction-degrees>
<aws:gust-speed units="mph">17</aws:gust-speed>
<aws:humidity units="%">98</aws:humidity>
<aws:humidity-high units="%">100</aws:humidity-high>
<aws:humidity-low units="%">61</aws:humidity-low>
<aws:humidity-rate>3</aws:humidity-rate>
<aws:indoor-temp units="°F">77</aws:indoor-temp>
<aws:indoor-temp-rate units="°F">-1.1</aws:indoor-temp-rate>
<aws:light>0</aws:light>
<aws:light-rate>0</aws:light-rate>
<aws:moon-phase moon-phase-img="http://api.wxbug.net/images/moonphase/mphase01.gif">0</aws:moon-phase>
<aws:pressure units=""">30.09</aws:pressure>
<aws:pressure-high units=""">30.5</aws:pressure-high>
<aws:pressure-low units=""">30.08</aws:pressure-low>
<aws:pressure-rate units=""/h">-0.01</aws:pressure-rate>
<aws:rain-month units=""">0.11</aws:rain-month>
<aws:rain-rate units=""/h">0</aws:rain-rate>
<aws:rain-rate-max units=""/h">0.12</aws:rain-rate-max>
<aws:rain-today units=""">0.09</aws:rain-today>
<aws:rain-year units=""">0.11</aws:rain-year>
<aws:temp units="°F">41</aws:temp>
<aws:temp-high units="°F">42</aws:temp-high>
<aws:temp-low units="°F">29</aws:temp-low>
<aws:temp-rate units="°F/h">-0.9</aws:temp-rate>
<aws:sunrise>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="7" hour-24="07"/>
<aws:minute number="29"/>
<aws:second number="53"/>
<aws:am-pm abbrv="AM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:sunrise>
<aws:sunset>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="4" hour-24="16"/>
<aws:minute number="54"/>
<aws:second number="19"/>
<aws:am-pm abbrv="PM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:sunset>
<aws:wet-bulb units="°F">40.802</aws:wet-bulb>
<aws:wind-speed units="mph">3</aws:wind-speed>
<aws:wind-speed-avg units="mph">1</aws:wind-speed-avg>
<aws:wind-direction>S</aws:wind-direction>
<aws:wind-direction-degrees>163</aws:wind-direction-degrees>
<aws:wind-direction-avg>SE</aws:wind-direction-avg>
</aws:ob>
</aws:weather>
I used http://www.xpathtester.com/test to test my xpath and it worked there. But I get the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2043, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:47570)
File "xpath.pxi", line 376, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:118247)
File "xpath.pxi", line 239, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:116911)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:116728)
lxml.etree.XPathEvalError: Undefined namespace prefix
This is all very new to me -- Python, XML, and LXML. All I want is the observed time and the temperature.
Do my problems have anything to do with that aws: prefix in front of everything? What does that even mean?
Any help you can offer is greatly appreciated!
The problem has all "to do with that aws: prefix in front of everything"; it is a namespace prefix which you have to define. This is easily achievable, as in:
print doc.xpath('//aws:weather/aws:ob/aws:temp',
namespaces={'aws': 'http://www.aws.com/aws'})[0].text
The need for this mapping between the namespace prefix to a value is documented at http://lxml.de/xpathxslt.html.
Try something like this:
from lxml import etree
ns = etree.FunctionNamespace("http://www.aws.com/aws")
ns.prefix = "aws"
doc=etree.parse('weather.xml')
print doc.xpath("//aws:weather/aws:ob/aws:temp")[0].text
See this link: http://lxml.de/extensions.html