Extracting data from XML into a csv file using BeautifulSoup - python

My objective is to get all data from an XML file. So I tried to parse the XML using beautifulSoup and tried to extract it all in a single .csv file.
My XML look like below:
<?xml version="1.0" encoding="ISO-8859-1"?>
<jobs>
<job>
<title>
<![CDATA[Personal Shopper]]>
</title>
<date>
<![CDATA[Sat, 05 Dec 2020 12:25:52 UTC]]>
</date>
<referencenumber>
<![CDATA[12312414141]]>
</referencenumber>
<city>
<![CDATA[Powell]]>
</city>
<state>
<![CDATA[Washington]]>
</state>
<country>
<![CDATA[US]]>
</country>
<postalcode>
<![CDATA[98388]]>
</postalcode>
<salary>
<![CDATA[]]>
</salary>
<description>
<![CDATA[Sample of description]]>
</description>
</job>
<job>
<title>
<![CDATA[CEO]]>
</title>
<date>
<![CDATA[Sat, 28 Nov 2020 00:54:32 UTC]]>
</date>
<referencenumber>
<![CDATA[1231314211241412]]>
</referencenumber>
<city>
<![CDATA[peanut]]>
</city> <country>
<![CDATA[US]]>
</country>
<postalcode>
<![CDATA[01961]]>
</postalcode>
<description>
<![CDATA[sample of description]]>
</description>
<source>
<![CDATA[Get me a job]]>
</source>
<cpc>0.36</cpc>
</job>
I used this .py codes below which supposed to print out all reference number from the XML into a csv file, however it only extracted out 1 reference_number from my xml file? Can someone point out which part I did wrong?
from bs4 import BeautifulSoup
fd = open('/users/minion/downloads/goodnight.xml')
xml_file = fd.read()
output_csv = "output.csv"
soup = BeautifulSoup(xml_file,'html.parser')
with open(output_csv, 'w') as fout:
#print header
header="idx,reference_number"
fout.write("{}\n".format(header))
for idx,tag in enumerate(soup.findAll("referencenumber")):
data_row="{},{}".format(idx,tag)
fout.write("{}\n".format(data_row))
fd.close()

Related

To find element based on grandchildren tags using elementtree

I'm completely new to xml parsing .I have some thousands of xml's and I want to find out all element DE , only when I have country tag
Here is my sample xml
<?xml version="1.0" encoding="UTF-8"?>
<DE>
<CT>
<IG>
<FS id="01">
<FE id="A" fId="B">
<title>Apple</title>
</FE>
</FS>
<country syse="21" subSys="2">
<FF FR="101" fe="01" />
<referTo refType="t06">
<CF Code="350" />
</referTo>
<place id="00A" placeValue="00AB">
<Q>001</Q>
<TQ>0001</TQ>
<PR Value="A" CodeValue="C" />
</place>
<place id="00E" placeValue="00EF">
<Q>001</Q>
<TQ>0001</TQ>
<PR Value="03" AValue="957" />
<Books>
<IA>
<Part />
</IA>
<PRGroup>
<country Code="5">
<PR Value="02" AValue="345" />
<constrain>Double condition.</constrain>
<constrain>Double condition.</constrain>
</country>
</PRGroup>
</Books>
</place>
</country>
</IG>
</CT>
</DE>
import xml.etree.ElementTree as ET
tree = ET.parse(content)
root = tree.getroot()
Num = root.findall("//DE[//place/Books/PRGroup/country]")
am getting predicate error or absolute path error when am trying different ways but am not able to figure this out.
How can I retrieve the results and access the attributes based on that
could you please help me on this.
With lxml it should be something along these lines:
from lxml import etree
content = """[your xml above]"""
root = etree.fromstring(content.encode())
Num = root.xpath("//DE[//place/Books/PRGroup/country]")

iterate through XML?

What is the easiest way to navigate through XML with python?
<html>
<body>
<soapenv:envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:body>
<getservicebyidresponse xmlns="http://www.something.com/soa/2.0/SRIManagement">
<code xmlns="">
0
</code>
<client xmlns="">
<action xsi:nil="true">
</action>
<actionmode xsi:nil="true">
</actionmode>
<clientid>
405965216
</clientid>
<firstname xsi:nil="true">
</firstname>
<id xsi:nil="true">
</id>
<lastname>
Last Name
</lastname>
<role xsi:nil="true">
</role>
<state xsi:nil="true">
</state>
</client>
</getservicebyidresponse>
</soapenv:body>
</soapenv:envelope>
</body>
</html>
I would go with regex and try to get the values of the lines I need but is there a pythonic way? something like xml[0][1] etc?
As #deceze already pointed out, you can use xml.etree.ElementTree here.
import xml.etree.ElementTree as ET
tree = ET.parse("path_to_xml_file")
root = tree.getroot()
You can iterate over all children nodes of root:
for child in root.iter():
if child.tag == 'clientid':
print(child.tag, child.text.strip())
Children are nested, and we can access specific child nodes by index, so root[0][1] should work (as long as the indices are correct).

XML Prettifying from file in Python

I have an xml file which looks like the example below.
Many texts contain space as the start character, or have \n (newline) at the beginning, or other crazy stuff. I'm working with xml.etree.ElementTree, and it is good to parse from this file.
But I want more! :) I tried to prettify this mess, but without success. Tried many tutorials, but it always ends without pretty XML.
<?xml version="1.0"?>
<import>
<article>
<name> Name with space
</name>
<source> Daily Telegraph
</source>
<number>72/2015
</number>
<page>10
</page>
<date>2015-03-26
</date>
<author> Tomas First
</author>
<description>Economy
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text is here
</text>
</article>
<article>
<name> How to parse
</name>
<source> Internet article
</source>
<number>72/2015
</number>
<page>1
</page>
<date>2015-03-26
</date>
<author>Some author
</author>
<description> description
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text here
</text>
</article>
</import>
When I tried another answers from SO it generates same file or more messy XML
bs4 can do it
from bs4 import BeautifulSoup
doc = BeautifulSoup(xmlstring, 'xml')
print doc.prettify()

Parsing XML in python - stumped how to do this

I've looked through a number of support pages, examples and documents however I am still stumped as to how I can achieve what I am after using python.
I need to process/parse an xml feed and just take very specific values from the XML document. Which is where I am stumped.
The xml looks like the following:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed>
<title type="text">DailyTreasuryYieldCurveRateData</title>
<id></id>
<updated>2014-12-03T07:44:30Z</updated>
<link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
<entry>
<id></id>
<title type="text"></title>
<updated>2014-12-03T07:44:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6235)" />
<category />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">6235</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">2014-12-01T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double">0.01</d:BC_1MONTH>
<d:BC_3MONTH m:type="Edm.Double">0.03</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">0.08</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">0.13</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">0.49</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">0.9</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">1.52</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">1.93</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">2.22</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">2.66</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">2.95</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">2.95</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
<entry>
<id></id>
<title type="text"></title>
<updated>2014-12-03T07:44:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6236)" />
<category />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">6236</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">2014-12-02T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double">0.04</d:BC_1MONTH>
<d:BC_3MONTH m:type="Edm.Double">0.03</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">0.08</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">0.14</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">0.55</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">0.96</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">1.59</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">2</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">2.28</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">2.72</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">3</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">3</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
</feed>
This XML document gets a new Entry appended each day for the duration of the month when it resets and starts again on the 1st of the next month.
I need to extract the date from d:NEW_DATE and the value from d:BC_10YEAR, now when there is just a single entry this is no problem, however I am struggling to work out how to have it go through the file and extracting the relevant date and value from each ENTRY block.
Any assistance is very much appreciated.
BeautifulSoup is probably the easiest way to do what you're looking for:
from BeautifulSoup import BeautifulSoup
xmldoc = open('datafile.xml', 'r').read()
bs = BeautifulSoup(xmldoc)
entryList = bs.findAll('entry')
for entry in entryList:
print entry.content.find('m:properties').find('d:new_date').contents[0]
print entry.content.find('m:properties').find('d:bc_10year').contents[0]
You can then replace the print with whatever you want to do with the data (add to a list etc.).

python remove element containing namespace

I am trying to remove an element in an xml which contains a namespace.
Here is my code:
templateXml = """<?xml version="1.0" encoding="UTF-8"?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
<Actor>
<ActorName locale="en-GB">XXX</ActorName>
<Character locale="en-GB">XXX</Character>
</Actor>
</TitleInfo>
</Movie>
</Metadata>"""
from lxml import etree
tree = etree.fromstring(templateXml)
namespaces = {'ns':'http://www.amazon.com/UnboxMetadata/v1'}
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
etree.strip_elements(tree, 'ns:Actor')
In my actual XML I have lots of tags, So I am trying to search for the Actor tags which contain XXX and completely remove that whole tag and its contents. But it's not working.
Use remove() method:
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
checkActor.getparent().remove(checkActor)
print etree.tostring(tree, pretty_print=True, xml_declaration=True)
prints:
<?xml version='1.0' encoding='ASCII'?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
</TitleInfo>
</Movie>
</Metadata>

Categories