XML Prettifying from file in Python - python

I have an xml file which looks like the example below.
Many texts contain space as the start character, or have \n (newline) at the beginning, or other crazy stuff. I'm working with xml.etree.ElementTree, and it is good to parse from this file.
But I want more! :) I tried to prettify this mess, but without success. Tried many tutorials, but it always ends without pretty XML.
<?xml version="1.0"?>
<import>
<article>
<name> Name with space
</name>
<source> Daily Telegraph
</source>
<number>72/2015
</number>
<page>10
</page>
<date>2015-03-26
</date>
<author> Tomas First
</author>
<description>Economy
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text is here
</text>
</article>
<article>
<name> How to parse
</name>
<source> Internet article
</source>
<number>72/2015
</number>
<page>1
</page>
<date>2015-03-26
</date>
<author>Some author
</author>
<description> description
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text here
</text>
</article>
</import>
When I tried another answers from SO it generates same file or more messy XML

bs4 can do it
from bs4 import BeautifulSoup
doc = BeautifulSoup(xmlstring, 'xml')
print doc.prettify()

Related

Extracting data from XML into a csv file using BeautifulSoup

My objective is to get all data from an XML file. So I tried to parse the XML using beautifulSoup and tried to extract it all in a single .csv file.
My XML look like below:
<?xml version="1.0" encoding="ISO-8859-1"?>
<jobs>
<job>
<title>
<![CDATA[Personal Shopper]]>
</title>
<date>
<![CDATA[Sat, 05 Dec 2020 12:25:52 UTC]]>
</date>
<referencenumber>
<![CDATA[12312414141]]>
</referencenumber>
<city>
<![CDATA[Powell]]>
</city>
<state>
<![CDATA[Washington]]>
</state>
<country>
<![CDATA[US]]>
</country>
<postalcode>
<![CDATA[98388]]>
</postalcode>
<salary>
<![CDATA[]]>
</salary>
<description>
<![CDATA[Sample of description]]>
</description>
</job>
<job>
<title>
<![CDATA[CEO]]>
</title>
<date>
<![CDATA[Sat, 28 Nov 2020 00:54:32 UTC]]>
</date>
<referencenumber>
<![CDATA[1231314211241412]]>
</referencenumber>
<city>
<![CDATA[peanut]]>
</city> <country>
<![CDATA[US]]>
</country>
<postalcode>
<![CDATA[01961]]>
</postalcode>
<description>
<![CDATA[sample of description]]>
</description>
<source>
<![CDATA[Get me a job]]>
</source>
<cpc>0.36</cpc>
</job>
I used this .py codes below which supposed to print out all reference number from the XML into a csv file, however it only extracted out 1 reference_number from my xml file? Can someone point out which part I did wrong?
from bs4 import BeautifulSoup
fd = open('/users/minion/downloads/goodnight.xml')
xml_file = fd.read()
output_csv = "output.csv"
soup = BeautifulSoup(xml_file,'html.parser')
with open(output_csv, 'w') as fout:
#print header
header="idx,reference_number"
fout.write("{}\n".format(header))
for idx,tag in enumerate(soup.findAll("referencenumber")):
data_row="{},{}".format(idx,tag)
fout.write("{}\n".format(data_row))
fd.close()

BeautifulSoup can't access content of <text> tag

I'm using BeautifulSoup (version 4.4) to preprocess a Wikipedia textdump from https://dumps.wikimedia.org/enwiki/ for further parsing.
The textdump document contains multiple articles, each contained in a <page> tag.
Unfortunately, something about the document structure seems incompatible with BeautifulSoup: Within each <page>, the text body of an article is contained in a <text> block:
<text xml:space="preserve">...</text>
Once I've selected a certain <page> block, I should be able to access the content of the text block as page.text.string.
In BeautifulSoup, .text used to be reserved for the content of a tag between its brackets. In more recent versions, .string is used for that.
Unfortunately, it seems like page.text is still interpreted the same as page.string for backwards compatibility. (Edit: getattr(page, "text") does the same.)
Is there any way I can get around this and access an HTML tag named <text>?
(Edit: For a syntax example, see https://pastebin.com/WQvJn0gf.)
Using .find and .text works as expected:
from bs4 import BeautifulSoup
string = '''<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>...</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>854851586</id>
<parentid>834079434</parentid>
<timestamp>2018-08-14T06:47:24Z</timestamp>
<contributor>
<username>Godsy</username>
<id>23257138</id>
</contributor>
<comment>remove from category for seeking instructions on rcats</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
<sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
</revision>
</page>
...
</mediawiki>'''
soup = BeautifulSoup(string, 'html.parser')
page_tag = soup.find('page')
text_tag = page_tag.find('text')
print(text_tag.text)
# #REDIRECT [[Computer accessibility]]
# {{R from move}}
# {{R from CamelCase}}
# {{R unprintworthy}}

xml.etree.ElementTree.ParseError: not well-formed

I have the following code:
from xml.etree import ElementTree
file_path = 'some_file_path'
document = ElementTree.parse(file_path, ElementTree.XMLParser(encoding='utf-8'))
If my XML looks like the following it gives me the error: "xml.etree.ElementTree.ParseError: not well-formed"
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
In sublime or Notepad++ I see highlighted characters such as ACK, DC4, or STX which seem to be the culprit (one of them appears as a "-" in the above xml in the second "text" node). If I remove these characters it works. What are these and how can I fix this?
Running your code as follows, and it's working fine:
from xml.etree import ElementTree
from StringIO import StringIO
xml_content = """<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>"""
print("parsing xml document")
# using StringIO to simulate reading from file
document = ElementTree.parse(StringIO(xml_content), ElementTree.XMLParser(encoding='utf-8'))
for elem in document.iter():
print(elem.tag)
And the output is as expected:
parsing xml document
pages
page
textbox
textline
text
text
text
So, the issue is how you are copying and pasting your file from notepad++, maybe it's adding some special characters so try with another editor.

How to download the content of an url in a pandas dataframe with python-twitter?

I have an xml like this:
<author ="twitter" lang="english" type="xx" age_misc="xx" url="https://twitter.com/Carmen_RRHH">
<documents count="436">
<document id="106259332342342348513" url="https://twitter.com/Carmen_RRHH/status/106259338234048513"> </document>
<document id="232342342342323423" url="https://twitter.com/Carmen_RRHH/status/106260629999992832"> </document>
<document id="107084815504908291" url="https://twitter.com/Carmen_RRHH/status/107084815504908291"> </document>
<document id="108611036164276224" url="https://twitter.com/Carmen_RRHH/status/108611036164276224"> </document>
<document id="23423423423423" url="https://twitter.com/Carmen_RRHH/status/108611275851956224"> </document>
<document id="109283650823423480806912" url="https://twitter.com/Carmen_RRHH/status/109283650880806912"> </document>
<document id="10951489623423290488320" url="https://twitter.com/Carmen_RRHH/status/109514896290488320"> </document>
<document id="1095159513234234355080704" url="https://twitter.com/Carmen_RRHH/status/109515951355080704"> </document>
<document id="96252622234239511966720" url="https://twitter.com/Carmen_RRHH/status/96252629511966720"> </document>
</documents>
</author>
Is it possible to get the content of this links and place them into a pandas dataframe?, any idea of how to aproach this task?. Thanks in advance.
You have access to python, requests is a good choice:
import requests
r = requests.get("https://twitter.com/Carmen_RRHH/status/106259338234048513")
r.contents # the html
However, to get them into a pandas DataFrame this contents needs to be structured (like a table), which generally it's not going to be...
I recommend looking into the twitter api, or an existing twitter-client (for python) e.g. https://github.com/bear/python-twitter, that way you can extract the features you want cleanly (to columns) rather than munging them from html.

Parsing XML in python - stumped how to do this

I've looked through a number of support pages, examples and documents however I am still stumped as to how I can achieve what I am after using python.
I need to process/parse an xml feed and just take very specific values from the XML document. Which is where I am stumped.
The xml looks like the following:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed>
<title type="text">DailyTreasuryYieldCurveRateData</title>
<id></id>
<updated>2014-12-03T07:44:30Z</updated>
<link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
<entry>
<id></id>
<title type="text"></title>
<updated>2014-12-03T07:44:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6235)" />
<category />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">6235</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">2014-12-01T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double">0.01</d:BC_1MONTH>
<d:BC_3MONTH m:type="Edm.Double">0.03</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">0.08</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">0.13</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">0.49</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">0.9</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">1.52</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">1.93</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">2.22</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">2.66</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">2.95</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">2.95</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
<entry>
<id></id>
<title type="text"></title>
<updated>2014-12-03T07:44:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6236)" />
<category />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">6236</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">2014-12-02T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double">0.04</d:BC_1MONTH>
<d:BC_3MONTH m:type="Edm.Double">0.03</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">0.08</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">0.14</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">0.55</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">0.96</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">1.59</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">2</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">2.28</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">2.72</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">3</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">3</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
</feed>
This XML document gets a new Entry appended each day for the duration of the month when it resets and starts again on the 1st of the next month.
I need to extract the date from d:NEW_DATE and the value from d:BC_10YEAR, now when there is just a single entry this is no problem, however I am struggling to work out how to have it go through the file and extracting the relevant date and value from each ENTRY block.
Any assistance is very much appreciated.
BeautifulSoup is probably the easiest way to do what you're looking for:
from BeautifulSoup import BeautifulSoup
xmldoc = open('datafile.xml', 'r').read()
bs = BeautifulSoup(xmldoc)
entryList = bs.findAll('entry')
for entry in entryList:
print entry.content.find('m:properties').find('d:new_date').contents[0]
print entry.content.find('m:properties').find('d:bc_10year').contents[0]
You can then replace the print with whatever you want to do with the data (add to a list etc.).

Categories