re.sub() does not keep blanks and new lines - python

I have an xml file with the following line :
<CREATION_DATE>2009-12-20T10:47:07.000Z</CREATION_DATE>
That I would like to replace with the following :
<CREATION_DATE>XXX</CREATION_DATE>
Thought it would be pretty straightforward using the re module in the python script I'm supposed to modify. I did something of the sort:
if '</CREATION_DATE>' in ligne:
out_lines[i] = re.sub(r'(^.*<CREATION_DATE>).*(</CREATION_DATE>.*$)', r'\1XXX\2', ligne)
The field with the date is correctly replaced, but the trailing new line and indentation are lost in the process. I tried converting ligne and the result of the sub function to a raw string with .encode('string-escape'), with no success. I am a noob in python, but I am a bit accustomed to regex's, and I really cannot see what it is I am doing wrong.

An alternative, a simpler and a more reliable way to replace the text of an XML element would be to use an XML parser. There is even one in the Python Standard Library:
>>> import xml.etree.ElementTree as ET
>>>
>>> s = '<ROOT><CREATION_DATE>2009-12-20T10:47:07.000Z</CREATION_DATE></ROOT>'
>>> root = ET.fromstring(s)
>>> root.find("CREATION_DATE").text = 'XXX'
>>> ET.tostring(root)
'<ROOT><CREATION_DATE>XXX</CREATION_DATE></ROOT>'

As stated in comments, the variable ligne was stripped of blanks and new lines with ligne = ligne.strip() elsewhere in the code... I am not deleting my question though because alecxe's answer on the xml module is very informative.

Related

Python reading from an xml file without the special characters

from lxml import etree
import xml.etree.ElementTree as ET
tree2 = ET.parse(r'C:\Users\W\Desktop\220-01.xml')
root = tree2.getroot()
txt = ""
for c in root:
txt += c.text
break;
I wrote this above code a month ago so apologies for importing both libraries, I think I only use one. I only need to read from the first root, but the issue is the way the text is stored has special characters, for example:
"\n\n\nPatient went"
Is there a way to get rid of the \n's? I have similar issue with other special characters too, I want the text to look exactly like it does within the xml document because the indices are very important for my work.
Thanks
EDIT: I found a working solution for the time being, after some more searching I ran into this post: replace special characters in a string python
And with the suggestion from user 'Kobi K' what I did was replace all the \n's with " " 's, this somehow maintained the integrity of the document and the indices still match how I want.

Latex command substitution using regexp in python

I wrote a very ugly script in order to parse some rows of latex in python and doing string substitution. I'm here because I'm want to write something to be proud of, and learn :P
More specifically, I'd like to change:
\ket{(.*)} into |(.*)\rangle
\bra{(.*)} into \langle(*)|
To this end, I wrote a very very ugly script. The intended use is to do a thing like this:
cat file.tex | python script.py > new_file.tex
So what I did is the following. It's working, but is not nice at all and I'm wondering if you could give me a suggestion, even a link to the right command to use is ok. Note that I do recursion because when I have found the first "\ket{" i know that I want to replace the first occuring "}" (i.e. I'm sure there are no other subcommands within "\ket{"). But again, it's not the right way of parsing latex.
def recursion_ket(string_input, string_output=""):
match = re.search("\ket{", string_input)
if not match:
return string_input
else:
string_output = re.sub(r"\\ket{", '|', string_input, 1)
string_output_second =re.sub(r"}", "\rangle", stringa_output.split('|', 1)[1], 1)
string_output = string_output.split('|', 1)[0]+string_output_second
string_output=recursion_ket(string_output, string_output)
return string_output
if __name__ == '__main__':
with open(sys.argv[1]) as f:
content=f.readlines()
new=[]
for line in content:
new.append(ricorsione_ket(line))
z=open(sys.argv[2], 'w')
for i in new:
z.write(i.replace("\r", '\\r').replace("\b", '\\b'))
z.write("")
Which I know is very ugly. And it's definitely not the right way of doing it. Probably it's because I come from perl, and I'm not used to python regexp.
First problem: is it possible to use regexp to substitute just the "border" of a matching string, and leave the inside as it is? I want to leave the content of \command{xxx} as it is.
Second problem: the \r. Apparently, when I try to print on the terminal or in a file each string, I need to make sure \r is not interpreted as carriage return. I have tried to use the automatic escape, but it's not what I need. It escapes the \n with another \ and this is not what I want.
To answer your questions,
First problem: You can use (named) groups
Second problem: In Python3, you can use r"\btree" to deal with the backslash gracefully.
Using a latex parser like github.com/alvinwan/TexSoup, we can simplify the code a bit. I know OP has asked for regex, but if OP is tool-agnostic, a parser would be more robust.
Nice Function
We can abstract this into a replace function
def replaceTex(soup, command, replacement):
for node in soup.find_all(command):
node.replace(replacement.format(args=node.args))
Then, use this replaceTex function in the following way
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> replaceTex('bra', r"|{args[0]}\rangle")
>>> replaceTex('ket', r"\langle{args[0]}|")
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol
Demo
Here's a self-contained demonstration, based on TexSoup:
>>> import TexSoup
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> soup
\section{hello} text \bra{(.)} haha \ket{(.)}lol
>>> soup.ket.replace(r"|{args[0]}\rangle".format(args=soup.ket.args))
>>> soup.bra.replace(r"\langle{args[0]}|".format(args=soup.bra.args))
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol

Parsing xml with "not well-formed" characters in python

I am getting xml data from an application, which I want to parse in python:
#!/usr/bin/python
import xml.etree.ElementTree as ET
import re
xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)
It works for smaller datasets with example data, but when I go to real live data, I get
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72
Looking at the xml file, I see this line 364658:
WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>
I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.
The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.
You already do:
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:
xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)
I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.
https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

Python strip XML tags from document

I am trying to strip XML tags from a document using Python, a language I am a novice in. Here is my first attempt using regex, whixh was really a hope-for-the-best idea.
mfile = file("somefile.xml","w")
for line in mfile:
re.sub('<./>',"",line) #trying to match elements between < and />
That failed miserably. I would like to know how it should be done with regex.
Secondly, I googled and found: http://code.activestate.com/recipes/440481-strips-xmlhtml-tags-from-string/
which seems to work. But I would like to know is there a simpler way to get rid of all xml tags? Maybe using ElementTree?
The most reliable way to do this is probably with LXML.
from lxml import etree
...
tree = etree.parse('somefile.xml')
notags = etree.tostring(tree, encoding='utf8', method='text')
print(notags)
It will avoid the problems with "parsing" XML with regular expressions, and should correctly handle escaping and everything.
An alternative to Jeremiah's answer without requiring the lxml external library:
import xml.etree.ElementTree as ET
...
tree = ET.fromstring(Text)
notags = ET.tostring(tree, encoding='utf8', method='text')
print(notags)
Should work with any Python >= 2.5
Please, note, that usually it is not normal to do it by regular expressions. See Jeremiah answer.
Try this:
import re
text = re.sub('<[^<]+>', "", open("/path/to/file").read())
with open("/path/to/file", "w") as f:
f.write(text)

How to regex in python?

I am trying to parse the keywords from google suggest, this is the url:
http://google.com/complete/search?output=toolbar&q=test
I've done it with php using:
'|<CompleteSuggestion><suggestion data="(.*?)"/><num_queries int="(.*?)"/></CompleteSuggestion>|is'
But that wont work with python re.match(pattern, string), I tried a few but some show error and some return None.
How can I parse that info? I dont want to use minidom because I think regex will be less code.
You could use etree:
>>> from xml.etree.ElementTree import XMLParser
>>> x = XMLParser()
>>> x.feed('<toplevel><CompleteSuggestion><suggestion data=...')
>>> tree = x.close()
>>> [(e.find('suggestion').get('data'), int(e.find('num_queries').get('int')))
for e in tree.findall('CompleteSuggestion')]
[('test internet speed', 31800000), ('test', 686000000), ...]
It is more code than a regex, but it also does more. Specifically, it will fetch the entire list of matches in one go, and unescape any weird stuff like double-quotes in the data attribute. It also won't get confused if additional elements start appearing in the XML.
RegEx match open tags except XHTML self-contained tags
This is an XML document. Please, reconsider an XML parser. It will be more robust and probably take you less time in the end, even if it is more code.

Categories