Python 3: XML Tag Value not being written to csv file - python

My python 3 script takes an xml file and creates a csv file.
Small excerpt of xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<metadata>
<dc>
<title>Golden days for boys and girls, 1895-03-16, v. XVI #17</title>
<subject>Children's literature--Children's periodicals</subject>
<description>Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries</description>
<publisher>James Elverson, 1880-</publisher>
<date>1895-06-15</date>
<type>Text | periodicals</type>
<format>image/jp2</format>
<handle>http://hdl.handle.net/11134/20002:860074494</handle>
<accessionNumber/>
<barcode/>
<identifier>20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870 | hdl:  | http://hdl.handle.net/11134/20002:860074494</identifier>
<rights>These Materials are provided for educational and research purposes only. The University of Connecticut Libraries hold the copyright except where noted. Permission must be obtained in writing from the University of Connecticut Libraries and/or theowner(s) of the copyright to publish reproductions or quotations beyond "fair use." | The collection is open and available for research.</rights>
<creator/>
<relation/>
<coverage/>
<language/>
</dc>
</metadata>
Python3 code:
import csv
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='ctda_set1_uniqueTags.xml')
doc = ET.parse("ctda_set1_uniqueTags.xml")
root = tree.getroot()
oaidc_data = open('ctda_set1_uniqueTags.csv', 'w', encoding='utf-8')
titles = 'dc/title'
subjects = 'dc/subject'
csvwriter = csv.writer(oaidc_data)
oaidc_head = ['Title', 'Subject', 'Description', 'Publisher', 'Date', 'Type', 'Format', 'Handle', 'Accession Number', 'Barcode', 'Identifiers', 'Rights', 'Creator', 'Relation', 'Coverage', 'Language']
count = 0
for member in root.findall('dc'):
if count == 0:
csvwriter.writerow(oaidc_head)
count = count + 1
dcdata = []
titles = member.find('title').text
dcdata.append(titles)
subjects = member.find('subject').text
dcdata.append(subjects)
descriptions = member.find('description').text
dcdata.append(descriptions)
publishers = member.find('publisher').text
dcdata.append(publishers)
dates = member.find('date').text
dcdata.append(dates)
types = member.find('type').text
dcdata.append(types)
formats = member.find('format').text
dcdata.append(formats)
handle = member.find('handle').text
dcdata.append(handle)
accessionNo = member.find('accessionNumber').text
dcdata.append(accessionNo)
barcodes = member.find('barcode').text
dcdata.append(barcodes)
identifiers = member.find('identifier').text
dcdata.append(identifiers)
rt = member.find('rights').text
print(member.find('rights').text)
dcdata.append('rt')
ct = member.find('creator').text
dcdata.append('ct')
rt = member.find('relation').text
dcdata.append('rt')
ce = member.find('coverage').text
dcdata.append('ce')
lang = member.find('language').text
dcdata.append('lang')
csvwriter.writerow(dcdata)
oaidc_data.close()
Everything works as expected except for rt, ce, and lang. What happens is that in the csv, all the data is written with the comma delimiter. For rt, the value is always rt, for ce, ce, lang, lang, etc.
Here's a snippet of the output:
Title,Subject,Description,Publisher,Date,Type,Format,Handle,Accession Number,Barcode,Identifiers,Rights,Creator,Relation,Coverage,Language
"Golden days for boys and girls, 1895-03-16, v. XVI #17",Children's literature--Children's periodicals,"Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries","James Elverson, 1880-",1895-06-15,Text | periodicals,image/jp2,hdl.handle.net/11134/20002:860074494,,,20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870,**rt,ct,rt,ce,lang**
Some of the rights statements get very long - perhaps that's the issue. That's why I added the print(member.find('rights')) to see the output. The text is printed just fine. The text just isn't written to the csv. What I'd like is to have the value or text written for these xml tags. Any help would be appreciated.
Thanks.
Jennifer

In the line dcdata.append('rt') there is no need for the quotes. Try dcdata.append(rt). Similarly, there are unnecessary quotes in the ce and lang lines.

Related

encoding Lithuanian character in xml using python

I have a code:
def convert_df_to_xml(df,fd,ld):
# sukuriam pagrindini elementa (root) su pavadinimu Invoices.
root = ET.Element("Invoices")
root.set("from", str(fd))
root.set("till", str(ld))
for i in range(len(df['partner_id'])):
# pridedam sub elementa.
invoices = ET.SubElement(root, "Invoice")
invoices.set('clientid',df['company_registry'][i])
invoices.set('imones_pavadinimas', df['partner_id'][i])
# pridedam sub-sub elementa.
quebec = ET.SubElement(invoices, "Product")
# susikraunam eiluciu info is dataframe
sectin_1 = ET.SubElement(quebec, "Name")
sectin_1.text = str(df["Name"][i])
sectin_2 = ET.SubElement(quebec, 'Quantity')
sectin_2.text = str(df["time_dif"][i])
sectin_3 = ET.SubElement(quebec, 'Price')
sectin_3.text = str(df["price_unit"][i])
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent=" ", encoding="UTF-8").decode("UTF-8")
with open("bandomasis_itp_xml_failas_V_1.1.xml", "w") as f:
f.write(xmlstr)
I'm creating xml file from python DataFrame. The problem is that in xml file I got "?" marks instead "ė" character.
In dataframe i have strings with characters "ė,ą,š,ų" and I need them to be in xml file.
My dataframe:
df1 = pd.DataFrame({'partner_id': ['MED GRUPĖ, UAB'], 'Name':['Pirmas'], 'company_registry': ['3432543'],
'time_dif':['2'],'price_unit':['23']})
what is the problem with encoding here?

Python re.findall organize list

I have a text file with entries like this:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Applications_GetResponse xmlns="http://www.country.com">
<Applications>
<CS_Application>
<Name>Spain</Name>
<Key>2345364564</Key>
<Status>NORMAL</Status>
<Modules>
<CS_Module>
<Name>zaragoza</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
<CS_Module>
<Name>malaga</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
</Modules>
<CreatedBy>7</CreatedBy>
</CS_Application>
<CS_Application>
<Name>UK</Name>
<Key>2345364564</Key>
<Status>NORMAL</Status>
<Modules>
<CS_Module>
<Name>london</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
<CS_Module>
<Name>liverpool</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
</Modules>
<CreatedBy>7</CreatedBy>
</CS_Application>
</Applications>
</Applications_GetResponse>
</soap:Body>
</soap:Envelope>
I would like to analyze it and obtain the name of the country in the sequence of the cities.
I tried some things with python re.finall, but I didn't get anything like it
print("HERE APPLICATIONS")
applications = re.findall('<CS_Application><Name>(.*?)</Name>', response_apply.text)
print(applications)
print("HERE MODULES")
modules = re.findall('<CS_Module><Name>(.*?)</Name>', response_apply.text)
print(modules)
return:
host-10$ sudo python3 capture.py
HERE APPLICATIONS
['Spain', 'UK']
HERE MODULES
['zaragoza', 'malaga', 'london', 'liverpool']
The expected result is, I would like the result to be like this:
HERE
The Country: Spain - Cities: zaragoza,malaga
The Country: UK - Cities: london,liverpool
Regex is not good to parse xml. Better use xml parser..
If you want regex solution then hope below code help you.
import re
s = """\n<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <soap:Body>\n <Applications_GetResponse xmlns="http://www.country.com">\n <Applications>\n <CS_Application>\n <Name>Spain</Name>\n <Key>2345364564</Key>\n <Status>NORMAL</Status>\n <Modules>\n <CS_Module>\n <Name>zaragoza</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n <CS_Module>\n <Name>malaga</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n </Modules>\n <CreatedBy>7</CreatedBy>\n </CS_Application>\n <CS_Application>\n <Name>UK</Name>\n <Key>2345364564</Key>\n <Status>NORMAL</Status>\n <Modules>\n <CS_Module>\n <Name>london</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n <CS_Module>\n <Name>liverpool</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n </Modules>\n <CreatedBy>7</CreatedBy>\n </CS_Application>\n </Applications>\n </Applications_GetResponse>\n </soap:Body>\n</soap:Envelope>\n"""
pattern1 = re.compile(r'<CS_Application>([\s\S]*?)</CS_Application>')
pattern2 = re.compile(r'<Name>(.*)?</Name>')
for m in re.finditer(pattern1, s):
ss = m.group(1)
res = []
for mm in re.finditer(pattern2, ss):
res.append(mm.group(1))
print("The Country: "+res[0]+" - Cities: "+",".join(res[1:len(res)]))

XML Parsing Python ElementTree - Nested for loops

I'm using Jupyter Notebook and ElementTree (Python 3) to create a dataframe and save as csv from an XML file. Here is the XML format (in Estonian):
<asutused hetk="2020-04-14T03:53:33" ver="2">
<asutus>
<registrikood>10000515</registrikood>
<nimi>Osaühing B.Braun Medical</nimi>
<aadress />
<tegevusload>
<tegevusluba>
<tegevusloa_number>L04647</tegevusloa_number>
<alates>2019-12-10</alates>
<kuni />
<loaliik_kood>1</loaliik_kood>
<loaliik_nimi>Eriarstiabi</loaliik_nimi>
<haiglaliik_kood />
<haiglaliik_nimi />
<tegevuskohad>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
</tegevuskohad>
</tegevusluba>
<tegevusluba>
<tegevusloa_number>L04651</tegevusloa_number>
<alates>2019-12-11</alates>
<kuni />
<loaliik_kood>2</loaliik_kood>
<loaliik_nimi>Õendusabi</loaliik_nimi>
<haiglaliik_kood />
<haiglaliik_nimi />
<tegevuskohad>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
</tegevuskohad>
</tegevusluba>
</tegevusload>
<tootajad>
<tootaja>
<kood>D03091</kood>
<eesnimi>Evo</eesnimi>
<perenimi>Kaha</perenimi>
<kutse_kood>11</kutse_kood>
<kutse_nimi>Arst</kutse_nimi>
<erialad>
<eriala>
<kood>E420</kood>
<nimi>üldkirurgia</nimi>
</eriala>
</erialad>
</tootaja>
<tootaja>
<kood>N01146</kood>
<eesnimi>Karmen</eesnimi>
<perenimi>Mežulis</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
<tootaja>
<kood>N01153</kood>
<eesnimi>Nele</eesnimi>
<perenimi>Terras</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
<tootaja>
<kood>N02767</kood>
<eesnimi>Helena</eesnimi>
<perenimi>Tern</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
<tootaja>
<kood>N12882</kood>
<eesnimi>Hanna</eesnimi>
<perenimi>Leemet</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
</tootajad>
</asutus>
</asutused>
Each "asutus" is a hospital and I need some of the information inside. Here is my code:
tree = ET.parse("od_asutused.xml")
root = tree.getroot()
# open a file for writing
data = open('EE.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(data, delimiter=';')
head = []
count = 0
for member in root.findall('asutus'):
hospital = []
if count == 0:
ident = member.find('registrikood').tag
head.append(id)
name = member.find('nimi').tag
head.append(name)
address = member.find('aadress').tag
head.append(address)
facility_type = member.find('./tegevusload/tegevusluba/haiglaliik_nimi').tag
head.append(facility_type)
site_address = member.find('./tegevusload/tegevusluba/tegevuskohad/tegevuskoht/aadress').tag
head.append(site_address)
for elem in member.findall('tegevusload'):
list_specs = elem.find('./tegevusluba/tegevuskohad/tegevuskoht/teenused/teenus/nimi').tag
head.append(list_specs)
csvwriter.writerow(head)
count = count + 1
ident = member.find('registrikood').text
hospital.append(ident)
name = member.find('nimi').text
hospital.append(name)
address = member.find('aadress').text
hospital.append(address)
facility_type = member.find('./tegevusload/tegevusluba/haiglaliik_nimi').text
hospital.append(facility_type)
site_address = member.find('./tegevusload/tegevusluba/tegevuskohad/tegevuskoht/aadress').text
hospital.append(site_address)
for spec in elem.findall('tegevusload'):
list_specs = spec.find('./tegevusluba/tegevuskohad/tegevuskoht/teenused/teenus/nimi').text
hospital.append(list_specs)
csvwriter.writerow(hospital)
data.close()
#Upload csv for geocoding
df = pd.read_csv(r'EE.csv', na_filter= False, delimiter=';')
#Rename columns
df.rename(columns = {'<built-in function id>':'id',
'nimi':'name',
'aadress':'address',
'haiglaliik_nimi':'facility_type',
'haiglaliik_kood':'facility_type_c',
'aadress.1':'site_address',
'nimi.1':'list_specs'},
inplace = True)
#Add columns
df['country'] = 'Estonia'
df['cc'] = 'EE'
df.head(10)
And the result of the df.head(10):
Result of dataframe
The "list_specs" is blank no matter what I do. How can I populate this field with a list of each 'nimi' for each site address? Thank you.
I found in your code the following points to change:
At least on my computer, calling csv.writer causes that newline chars
are doubled. The remedy I found is to open the output file with
additional parameters:
data = open('EE.csv', 'w', newline='\n', encoding='utf-8')
There is no sense to write head with Estonian column names and then
rename the columns. Note also that in head.append(id) you use an undeclared
variable (id).
But this is not so important, as I changed this whole section with writing
target column names (see below).
As you write the CSV file to be read by read_csv, it should contain a
fixed number of columns. So it is a bad practice to use a loop to write
one element.
Your instruction list_specs = elem.findall(...) was wrong, because
elem is not set in the current loop. Instead you should use member (but
I solved this detail other way).
There is no sense to create a variable only in order to use it once.
More concise and readable code is e.g. hospital.append(member.findtext('nimi')).
To avoid long XPath expressions, with repeated initial part, I decided
to set a temporary variable "in the middle" of this path, e.g.
tgvLb = member.find('tegevusload/tegevusluba') and then use a relative
XPath starting from this node.
Your rename instruction contains one not needed column, namely facility_type_c. You read only 6 columns, not 7.
So change the middle part of your code to:
data = open('EE.csv', 'w', newline='\n', encoding='utf-8')
csvwriter = csv.writer(data, delimiter=';')
head = ['id', 'name', 'address', 'facility_type', 'site_address', 'list_specs']
csvwriter.writerow(head)
for member in root.findall('asutus'):
hospital = []
hospital.append(member.findtext('registrikood'))
hospital.append(member.findtext('nimi'))
hospital.append(member.findtext('aadress'))
tgvLb = member.find('tegevusload/tegevusluba')
hospital.append(tgvLb.findtext('haiglaliik_nimi'))
tgvKoht = tgvLb.find('tegevuskohad/tegevuskoht')
hospital.append(tgvKoht.findtext('aadress'))
hospital.append(tgvKoht.findtext('teenused/teenus/nimi'))
csvwriter.writerow(hospital)
data.close()
df = pd.read_csv(r'EE.csv', na_filter= False, delimiter=';')
and drop df.rename from your code.

How to iterate over XML tags in Python using ElementTree & save to CSV

I am trying to iterate over all nodes & child nodes in a tree using ElementTree. I would like to get the all the parent & its child XML tags as columns and values which could append the child nodes to parent in CSV format. I am using python 2.7. The header should be printed only once & below should be respective values
XML File :
<Customers>
<Customer CustomerID="GREAL">
<CompanyName>Great Lakes Food Market</CompanyName>
<ContactName>Howard Snyder</ContactName>
<ContactTitle>Marketing Manager</ContactTitle>
<Phone>(503) 555-7555</Phone>
<FullAddress>
<Address>2732 Baker Blvd.</Address>
<City>Eugene</City>
<Region>OR</Region>
<PostalCode>97403</PostalCode>
<Country>USA</Country>
</FullAddress>
</Customer>
<Customer CustomerID="HUNGC">
<CompanyName>Hungry Coyote Import Store</CompanyName>
<ContactName>Yoshi Latimer</ContactName>
<ContactTitle>Sales Representative</ContactTitle>
<Phone>(503) 555-6874</Phone>
<Fax>(503) 555-2376</Fax>
<FullAddress>
<Address>City Center Plaza 516 Main St.</Address>
<City>Elgin</City>
<Region>OR</Region>
<PostalCode>97827</PostalCode>
<Country>USA</Country>
</FullAddress>
</Customer>
<Customer CustomerID="LAZYK">
<CompanyName>Lazy K Kountry Store</CompanyName>
<ContactName>John Steel</ContactName>
<ContactTitle>Marketing Manager</ContactTitle>
<Phone>(509) 555-7969</Phone>
<Fax>(509) 555-6221</Fax>
<FullAddress>
<Address>12 Orchestra Terrace</Address>
<City>Walla Walla</City>
<Region>WA</Region>
<PostalCode>99362</PostalCode>
<Country>USA</Country>
</FullAddress>
</Customer>
<Customer CustomerID="LETSS">
<CompanyName>Let's Stop N Shop</CompanyName>
<ContactName>Jaime Yorres</ContactName>
<ContactTitle>Owner</ContactTitle>
<Phone>(415) 555-5938</Phone>
<FullAddress>
<Address>87 Polk St. Suite 5</Address>
<City>San Francisco</City>
<Region>CA</Region>
<PostalCode>94117</PostalCode>
<Country>USA</Country>
</FullAddress>
</Customer>
</Customers>
My Code:
#Import Libraries
import csv
import xmlschema
import xml.etree.ElementTree as ET
#Define the variable to store the XML Document
xml_file = 'C:/Users/391648/Desktop/BOSS_20190618_20190516_18062019141928_CUMA/source_Files_XML/CustomersOrders.xml'
#using XML Schema Library validate the XML against XSD
my_schema = xmlschema.XMLSchema('C:/Users/391648/Desktop/BOSS_20190618_20190516_18062019141928_CUMA/source_Files_XML/CustomersOrders.xsd')
SchemaCheck = my_schema.is_valid(xml_file)
print(SchemaCheck) #Prints as True if the document is validated with XSD
#Parse XML & get root
tree = ET.parse(xml_file)
root = tree.getroot()
#Create & Open CSV file
xml_data_to_csv = open('C:/Users/391648/Desktop/BOSS_20190618_20190516_18062019141928_CUMA/source_Files_XML/PythonXMl.csv','w')
#create variable to write to csv
csvWriter = csv.writer(xml_data_to_csv)
#Create list contains header
count =0
#Loop for each node
for element in root.findall('Customers/Customer'):
List_nodes = []
#Get head by Tag
if count ==0:
list_header =[]
Full_Address = []
CompanyName = element.find('CompanyName').tag
list_header.append(CompanyName)
ContactName = element.find('ContactName').tag
list_header.append(ContactName)
ContactTitle = element.find('ContactTitle').tag
list_header.append(ContactTitle)
Phone = element.find('Phone').tag
list_header.append(Phone)
print(list_header)
csvWriter.writerow(list_header)
count = count + 1
#Get the data of the Node
CompanyName = element.find('CompanyName').text
List_nodes.append(CompanyName)
ContactName = element.find('ContactName').text
List_nodes.append(ContactName)
ContactTitle = element.find('ContactTitle').text
List_nodes.append(ContactTitle)
Phone = element.find('Phone').text
List_nodes.append(Phone)
print(List_nodes)
#Write List_Nodes to CSV
csvWriter.writerow(List_nodes)
xml_data_to_csv.close()
Expected CSV output:
CompanyName,ContactName,ContactTitle,Phone, Address, City, Region, PostalCode, Country
Great Lakes Food Market,Howard Snyder,Marketing Manager,(503) 555-7555, City Center Plaza 516 Main St., Elgin, OR, 97827, USA
Hungry Coyote Import Store,Yoshi Latimer,Sales Representative,(503) 555-6874, 12 Orchestra Terrace, Walla Walla, WA, 99362, USA
You might be better off using lxml. It has most of the desired functionality for finding elements built in.
from lxml import etree
import csv
with open('file.xml') as fp:
xml = etree.fromstring(fp.read())
field_dict = {
'CompanyName': 'CompanyName',
'ContactName': 'ContactName',
'ContactTitle': 'ContactTitle',
'Phone': 'Phone',
'Address': 'FullAddress/Address',
'City': 'FullAddress/City',
'Region': 'FullAddress/Region',
'PostalCode': 'FullAddress/PostalCode',
'Country': 'FullAddress/Country'
}
customers = []
for customer in xml:
line = {k: customer.find(v).text for k, v in field_dict.items()}
customers.append(line)
with open('customers.csv', 'w') as fp:
writer = csv.DictWriter(fp, field_dict)
writer.writerows(customers)
You can use xmltodict to convert data to JSON format instead of parsing XML:
import xmltodict
import pandas as pd
with open('data.xml', 'r') as f:
data = xmltodict.parse(f.read())['Customers']['Customer']
data_pd = {'CompanyName': [i['CompanyName'] for i in data],
'ContactName': [i['ContactName'] for i in data],
'ContactTitle': [i['ContactTitle'] for i in data],
'Phone': [i['Phone'] for i in data],
'Address': [i['FullAddress']['Address'] for i in data],
'City': [i['FullAddress']['City'] for i in data],
'Region': [i['FullAddress']['Region'] for i in data],
'PostalCode': [i['FullAddress']['PostalCode'] for i in data],
'Country': [i['FullAddress']['Country'] for i in data]}
df = pd.DataFrame(data_pd)
df.to_csv('result.csv', index=False)
Output CSV file:
CompanyName,ContactName,ContactTitle,Phone,Address,City,Region,PostalCode,Country
Great Lakes Food Market,Howard Snyder,Marketing Manager,(503) 555-7555,2732 Baker Blvd.,Eugene,OR,97403,USA
Hungry Coyote Import Store,Yoshi Latimer,Sales Representative,(503) 555-6874,City Center Plaza 516 Main St.,Elgin,OR,97827,USA
Lazy K Kountry Store,John Steel,Marketing Manager,(509) 555-7969,12 Orchestra Terrace,Walla Walla,WA,99362,USA
Let's Stop N Shop,Jaime Yorres,Owner,(415) 555-5938,87 Polk St. Suite 5,San Francisco,CA,94117,USA
A couple of things I have changed:
Removed schema validation since I do not have the XSD. You may include it
Made the child node traversal dynamic instead of statically referring each child node
The main for loop condition changed to for customer in root.findall('Customer') from for customer in root.findall('Customers/Customer')
However, I tried to keep your program structure, library usage intact. Here is the modified program:
import xml.etree.ElementTree as et
import csv
tree = et.parse("../data/customers.xml")
root = tree.getroot()
headers = []
count = 0
xml_data_to_csv = open('../data/customers.csv', 'w')
csvWriter = csv.writer(xml_data_to_csv)
for customer in root.findall('Customer'):
data = []
for detail in customer:
if(detail.tag == 'FullAddress'):
for addresspart in detail:
data.append(addresspart.text.rstrip('/n/r'))
if(count == 0):
headers.append(addresspart.tag)
else:
data.append(detail.text.rstrip('/n/r'))
if(count == 0):
headers.append(detail.tag)
if(count == 0):
csvWriter.writerow(headers)
csvWriter.writerow(data)
count = count + 1
With the given input XML content it produces:
CompanyName,ContactName,ContactTitle,Phone,Address,City,Region,PostalCode,Country
Great Lakes Food Market,Howard Snyde,Marketing Manage,(503) 555-7555,2732 Baker Blvd.,Eugene,OR,97403,USA
Hungry Coyote Import Store,Yoshi Latime,Sales Representative,(503) 555-6874,(503) 555-2376,City Center Plaza 516 Main St.,Elgi,OR,97827,USA
Lazy K Kountry Store,John Steel,Marketing Manage,(509) 555-7969,(509) 555-6221,12 Orchestra Terrace,Walla Walla,WA,99362,USA
Let's Stop N Shop,Jaime Yorres,Owne,(415) 555-5938,87 Polk St. Suite 5,San Francisco,CA,94117,USA
Note: Instead of writing to CSV in the loop you may append to an array and write it at one go. It depends on your content size and performance.
Update: When you have customers and their orders in the XML
The XML processing and CSV writing code structure remains the same. Additionally, process Orders element while processing customers. Now, under Orders Order elements can be processed exactly like Customer. As you mentioned each Order has ShipInfo as well.
The input XML is assumed to be (based on the comment below):
<Customers>
<Customer CustomerID="GREAL">
<CompanyName>Great Lakes Food Market</CompanyName>
<ContactName>Howard Snyder</ContactName>
<ContactTitle>Marketing Manager</ContactTitle>
<Phone>(503) 555-7555</Phone>
<FullAddress>
<Address>2732 Baker Blvd.</Address>
<City>Eugene</City>
<Region>OR</Region>
<PostalCode>97403</PostalCode>
<Country>USA</Country>
</FullAddress>
<Orders>
<Order>
<Param1>Value1</Param1>
<Param2>Value2</Param2>
<ShipInfo>
<ShipInfoParam1>Value3</ShipInfoParam1>
<ShipInfoParam2>Value4</ShipInfoParam2>
</ShipInfo>
</Order>
<Order>
<Param1>Value5</Param1>
<Param2>Value6</Param2>
<ShipInfo>
<ShipInfoParam1>Value7</ShipInfoParam1>
<ShipInfoParam2>Value8</ShipInfoParam2>
</ShipInfo>
</Order>
</Orders>
</Customer>
<Customer CustomerID="HUNGC">
<CompanyName>Hungry Coyote Import Store</CompanyName>
<ContactName>Yoshi Latimer</ContactName>
<ContactTitle>Sales Representative</ContactTitle>
<Phone>(503) 555-6874</Phone>
<Fax>(503) 555-2376</Fax>
<FullAddress>
<Address>City Center Plaza 516 Main St.</Address>
<City>Elgin</City>
<Region>OR</Region>
<PostalCode>97827</PostalCode>
<Country>USA</Country>
</FullAddress>
<Orders>
<Order>
<Param1>Value7</Param1>
<Param2>Value8</Param2>
<ShipInfo>
<ShipInfoParam1>Value9</ShipInfoParam1>
<ShipInfoParam2>Value10</ShipInfoParam2>
</ShipInfo>
</Order>
</Orders>
</Customer>
<Customer CustomerID="LAZYK">
<CompanyName>Lazy K Kountry Store</CompanyName>
<ContactName>John Steel</ContactName>
<ContactTitle>Marketing Manager</ContactTitle>
<Phone>(509) 555-7969</Phone>
<Fax>(509) 555-6221</Fax>
<FullAddress>
<Address>12 Orchestra Terrace</Address>
<City>Walla Walla</City>
<Region>WA</Region>
<PostalCode>99362</PostalCode>
<Country>USA</Country>
</FullAddress>
</Customer>
<Customer CustomerID="LETSS">
<CompanyName>Let's Stop N Shop</CompanyName>
<ContactName>Jaime Yorres</ContactName>
<ContactTitle>Owner</ContactTitle>
<Phone>(415) 555-5938</Phone>
<FullAddress>
<Address>87 Polk St. Suite 5</Address>
<City>San Francisco</City>
<Region>CA</Region>
<PostalCode>94117</PostalCode>
<Country>USA</Country>
</FullAddress>
</Customer>
</Customers>
Here is the modified code that process both customers and orders:
import xml.etree.ElementTree as et
import csv
tree = et.parse("../data/customers-with-orders.xml")
root = tree.getroot()
customer_csv = open('../data/customers-part.csv', 'w')
order_csv = open('../data/orders-part.csv', 'w')
customerCsvWriter = csv.writer(customer_csv)
orderCsvWriter = csv.writer(order_csv)
customerHeaders = []
orderHeaders = ['CustomerID']
isFirstCustomer = True
isFirstOrder = True
def processOrders(customerId):
global isFirstOrder
for order in detail.findall('Order'):
orderData = [customerId]
for orderdetail in order:
if(orderdetail.tag == 'ShipInfo'):
for shipinfopart in orderdetail:
orderData.append(shipinfopart.text.rstrip('/n/r'))
if(isFirstOrder):
orderHeaders.append(shipinfopart.tag)
else:
orderData.append(orderdetail.text.rstrip('/n/r'))
if(isFirstOrder):
orderHeaders.append(orderdetail.tag)
if(isFirstOrder):
orderCsvWriter.writerow(orderHeaders)
orderCsvWriter.writerow(orderData)
isFirstOrder = False
for customer in root.findall('Customer'):
customerData = []
customerId = customer.get('CustomerID')
for detail in customer:
if(detail.tag == 'FullAddress'):
for addresspart in detail:
customerData.append(addresspart.text.rstrip('/n/r'))
if(isFirstCustomer):
customerHeaders.append(addresspart.tag)
elif(detail.tag == 'Orders'):
processOrders(customerId)
else:
customerData.append(detail.text.rstrip('/n/r'))
if(isFirstCustomer):
customerHeaders.append(detail.tag)
if(isFirstCustomer):
customerCsvWriter.writerow(customerHeaders)
customerCsvWriter.writerow(customerData)
isFirstCustomer = False
Output produced in customers-part.csv:
CompanyName,ContactName,ContactTitle,Phone,Address,City,Region,PostalCode,Country
Great Lakes Food Market,Howard Snyde,Marketing Manage,(503) 555-7555,2732 Baker Blvd.,Eugene,OR,97403,USA
Hungry Coyote Import Store,Yoshi Latime,Sales Representative,(503) 555-6874,(503) 555-2376,City Center Plaza 516 Main St.,Elgi,OR,97827,USA
Lazy K Kountry Store,John Steel,Marketing Manage,(509) 555-7969,(509) 555-6221,12 Orchestra Terrace,Walla Walla,WA,99362,USA
Let's Stop N Shop,Jaime Yorres,Owne,(415) 555-5938,87 Polk St. Suite 5,San Francisco,CA,94117,USA
Output produced in orders-part.csv:
CustomerID,Param1,Param2,ShipInfoParam1,ShipInfoParam2
GREAL,Value1,Value2,Value3,Value4
GREAL,Value5,Value6,Value7,Value8
HUNGC,Value7,Value8,Value9,Value10
Note: the code can be optimized further by reusing. I am leaving that part to you. Secondly, notice that in each order customer Id is added in order to distinguish.

remove content between tags in python using regex

I was trying to clean up wikitext. Specifically I was trying to remove all the {{.....}} and <..>...</..> in the wikitext. For example, for this wikitext:
"{{Infobox UK place\n|country = England\n|official_name =
Morcombelake\n|static_image_name = Morecombelake from Golden Cap -
geograph.org.uk - 1184424.jpg\n|static_image_caption = Morcombelake as
seen from Golden Cap\n|coordinates =
{{coord|50.74361|-2.85153|display=inline,title}}\n|map_type =
Dorset\n|population = \n|population_ref = \n|shire_district = [[West
Dorset]]\n|shire_county = [[Dorset]]\n|region = South West
England\n|constituency_westminster = West Dorset\n|post_town =
\n|postcode_district = \n|postcode_area = DT\n|os_grid_reference =
SY405938\n|website = \n}}\n'''Morcombelake''' (also spelled
'''Morecombelake''') is a small village near [[Bridport]] in
[[Dorset]], [[England]], within the ancient parish of [[Whitchurch
Canonicorum]]. [[Golden Cap]], part of the [[Jurassic Coast]] World
Heritage Site, is nearby.{{cite
web|url=http://www.nationaltrust.org.uk/golden-cap/|title=Golden
Cap|publisher=National Trust|accessdate=2014-05-04}}\n\n==
References ==\n{{reflist}}\n\n{{West
Dorset}}\n\n\n{{Dorset-geo-stub}}\n[[Category:Villages in
Dorset]]\n\n== External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n"
How can I use regular expressions in python to produce output like this:
\n'''Morcombelake''' (also spelled '''Morecombelake''') is a small
village near [[Bridport]] in [[Dorset]], [[England]], within the
ancient parish of [[Whitchurch Canonicorum]]. [[Golden Cap]], part of
the [[Jurassic Coast]] World Heritage Site, is nearby.\n\n==
References ==\n\n\n\n\n\n\n[[Category:Villages in Dorset]]\n\n==
External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n
As the tags are nested into each other, you can find and remove them in a loop:
n = 1
while n > 0:
s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)
s is a string containing the wikitext.
There are no the <...> tags in your example, but they should be removed as well.

Categories