How to remove all " \n" in xml payload by using lxml library - python

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.

It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

Related

Editing existing XML file and sending post via Jboss

I have the following python method which runs though an xml file and parses it, and TRIES to edit a field:
import requests
import xml.etree.ElementTree as ET
import random
def runThrougheTree():
#open xml file
with open("testxml.xml") as xml:
from lxml import etree
#parse
parser = etree.XMLParser(strip_cdata=True, recover=True)
tree = etree.parse("testxml.xml", parser)
root= tree.getroot()
#ATTEMPT to edit field - will not work as of now
for ci in root.iter("CurrentlyInjured"):
ci.text = randomCurrentlyInjured(['sffdgdg', 'sdfsdfdsfsfsfsd','sfdsdfsdfds'])
#Overwrite initial xml file with new fields - will not work as of now
etree.ElementTree(root).write("testxml.xml",pretty_print=True, encoding='utf-8', xml_declaration=True)
#send post (Jboss)
requests.post('http://localhost:9000/something/RuleServiceImpl', data="testxml.xml)
def randomCurrentlyInjured(ran):
random.shuffle(ran)
return ran[0]
#-----------------------------------------------
if __name__ == "__main__":
runThrougheTree()
Edited XML file:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rule="http://somewebsite.com/" xmlns:ws="http://somewebsite.com/" xmlns:bus="http://somewebsite.com/">
<soapenv:Header/>
<soapenv:Body>
<ws:Respond>
<ws:busMessage>
<bus:SomeRef>insertnumericvaluehere</bus:SomeRef>
<bus:Content><![CDATA[<SomeDef>
<SomeType>ABCD</Sometype>
<A_Message>
<body>
<AnonymousField>
<RefIndicator>1111111111111</RefIndicator>
<OneMoreType>HIJK</OneMoreType>
<CurrentlyInjured>ABCDE</CurentlyInjured>
</AnonymousField>
</body>
</A_Message>
</SomeDef>]]></bus:Content>
<bus:MessageTypeId>somenumericvalue</bus:MessageTypeId>
</ws:busMessage>
</ws:Respond>
</soapenv:Body>
</soapenv:Envelope>
Issues:
The field is not being edited.
Jboss error: Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
Note: I have ensured that there is no characters prior to first xml tag.
In the end, I was unable to use lxml, elementtree to edit the fields/post to Jboss as:
I had CDATA in the xml as mzjn pointed out in the comments
Jboss did not like the request after it had been parsed, even when the CDATA tags were removed.
Workaround/Eventual SOlution: I was able to (somewhat tediously) use .replace() in my script to edit the plaintext successfully, and then send the POST via Jboss. I hope this helps someone else someday!

Take the content of a single tag from an xml file with the requests module

I make a request with python requests module to a soap service with this code:
response = requests.get(url,data=body,headers=headers)
and the service return this xml as response:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:aa="example.com/api"><soap:Body>
<aa:GetStockFileResponse> GetStockFileResponseType
<aa:TestMode> boolean </aa:TestMode>
<aa:Errors> ArrayOfError
<aa:Error> Error
<aa:Code> int </aa:Code>
<aa:Description> string </aa:Description>
</aa:Error>
</aa:Errors>
<aa:Warnings> ArrayOfWarning
<aa:Warning> Warning
<aa:Code> int </aa:Code>
<aa:Description> string </aa:Description>
</aa:Warning>
</aa:Warnings>
<aa:StockFileFormat> StockFileFormat (string) </aa:StockFileFormat>
<aa:FieldDelimiter> StringLength1 (string) </aa:FieldDelimiter>
<aa:File> base64Binary </aa:File>
</aa:GetStockFileResponse>
</soap:Body></soap:Envelope>
I need to write to a csv file the content of <aa:File> base64Binary </aa:File>that is a base64 encoded csv file.
My code to write the response is:
with open ('test.csv','wb') as f:
f.write (response.content)
that obviously write the whole xml...
How to take only the <aa:File> base64Binary </aa:File> content?
Something like this would be the solution?
import re
xmlText = '<foo>Foo</foo><aa:File> base64Binary </aa:File><bar>Bar</bar>'
# Target to extract: " base64Binary "
content = re.findall(r'<aa:File>(.+?)</aa:File>', xmlText)
print(content) # outputs " base64Binary "

How to parse an XML file and get its data Python

I have a the below web service : 'https://news.google.com/news/rss/?ned=us&hl=en'
I need to parse it and get the title and date values of each item in the XML file.
I have tried to get the data to an xml file and i am trying to parse it but i see all blank values:
import requests
import xml.etree.ElementTree as ET
response = requests.get('https://news.google.com/news/rss/?ned=us&hl=en')
with open('text.xml','w') as xmlfile:
xmlfile.write(response.text)
with open('text.xml','rt') as f:
tree = ET.parse(f)
for node in tree.iter():
print (node.tag, node.attrib)
I am not sure where i am going wrong . I have to somehow extract the values of title and published date of each and every item in the XML.
Thanks for any answers in advance.
#Ilja Everilä is right, you should use feedparser.
For sure there is no need to write any xml file... except if you want to archive it.
I didn't really get what output you expected but something like this works (python3)
import feedparser
url = 'https://news.google.com/news/rss/?ned=us&hl=en'
d = feedparser.parse(url)
#print the feed title
print(d['feed']['title'])
#print tuples (title, tag)
print([(d['entries'][i]['title'], d['entries'][i]['tags'][0]['term']) for i in range(len(d['entries']))] )
to explicitly print it as utf8 strings use:
print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))])
Maybe if you show your expected output, we could help you to get the right content from the parser.

Adding <root> tag to XML doc with Python

Attempting to add a root tag to the beginning and end of a 2mil line XML file so the file can be properly processed with my Python code.
I tried using this code from a previous post, but I am getting an error "XMLSyntaxError: Extra content at the end of the document, line __, column 1"
How do I solve this? Or is there a better way to add a root tag to the beginning and end of my large XML doc?
import lxml.etree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
newroot = ET.Element("root")
newroot.insert(0, root)
print(ET.tostring(newroot, pretty_print=True))
My test XML
<pub>
<ID>75</ID>
<title>Use of Lexicon Density in Evaluating Word Recognizers</title>
<year>2000</year>
<booktitle>Multiple Classifier Systems</booktitle>
<pages>310-319</pages>
<authors>
<author>Petr Slavík</author>
<author>Venu Govindaraju</author>
</authors>
</pub>
<pub>
<ID>120</ID>
<title>Virtual endoscopy with force feedback - a new system for neurosurgical training</title>
<year>2003</year>
<booktitle>CARS</booktitle>
<pages>782-787</pages>
<authors>
<author>Christos Trantakis</author>
<author>Friedrich Bootz</author>
<author>Gero Strauß</author>
<author>Edgar Nowatius</author>
<author>Dirk Lindner</author>
<author>Hüseyin Kemâl Çakmak</author>
<author>Heiko Maaß</author>
<author>Uwe G. Kühnapfel</author>
<author>Jürgen Meixensberger</author>
</authors>
</pub>
I suspect that that gambit works because there is only one A element at the highest level. Fortunately, even with two million lines it's easy to add the lines you need.
In doing this I noticed that the lxml parser seems unable to process the accented characters. I have there added code to anglicise them.
import re
def anglicise(matchobj): return matchobj.group(0)[1]
outputFilename = 'result.xml'
with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
outXML.write('<root>\n')
for line in inXML.readlines():
outXML.write(re.sub('&[a-zA-Z]+;',anglicise,line))
outXML.write('</root>\n')
from lxml import etree
tree = etree.parse(outputFilename)
years = tree.xpath('.//year')
print (years[0].text)
Edit: Replace anglicise to this version to avoid replacing &.
def anglicise(matchobj):
if matchobj.group(0) == '&':
return matchobj.group(0)
else:
return matchobj.group(0)[1]

writing string as element attribute of xml file using embedded python in C

I'm using etree from lxml to write a plain string as element attribute in xml file. the problem is when my string contains non english characters(non ascii), the program fails to write even with unicode conversion such as
unicode(mystring, "utf-8")
here is an example
from lxml import etree as ET
doc = ET.parse('my_file.xml')
info = "préparation"
top = doc.getroot()
element1 = ET.Element("first_element")
element1.set("msg ",unicode(info,"utf-8"))
top.append(element1)
file1= open('my_file.xml',"wb")
doc.write(file1)
file1.close()
this code doesn't give any errors but I can't get any result
content of 'my_file.xml'
<?xml-stylesheet type="text/xsl" href="parse.xsl" encoding="UTF-8"?>
<article date="2017-04-27" langue="french" time="16:33:20"/>
I want to add 'element1' s a child of 'article' and add 'msg' as attribute of 'element1'.
edit : I'm sure that this issue is related to embedding python in C because this program works when I used a python code without embedding in C

Categories