IronPython, C# and XML - break indentation?

IronPython, C# and XML - break indentation? - python

We're using IronPython on C#, and I get different results on the console and on our application.
This code runs fine on IronPython Console:
str=[]
a = 1
b = 0
c = 1
if a==1:
str.append('One')
if b==1:
str.append('Two')
if c==1:
str.append('Three')
out=','.join(str)
print out
But the same code returns an error on our application:
unexpected token 'if'
I suspect that the problem is in my newlines, because the string containing the Python code is passed through XML (XML>C#>Python):
<Set key="PythonCode" value="ipy:str=[]
a = 1
b = 0
c = 1
if a==1:
str.append('One')
if b==1:
str.append('Two')
if c==1:
str.append('Three')
out=','.join(str)"/>
Other commands return expected results, my problem is with indented commands (conditions, loops).
As I don't have access to the C# code, I look for a way to write one-liners, or any other way not to be dependent on indentation or newlines.
I tried this:
<Set key="PythonCode" value="ipy:str=[];
a = 1;
b = 0;
c = 1;
if a==1: str.append('One');
if b==1: str.append('Two');
if c==1: str.append('Three');
out=','.join(str);"/>
But I get again the same error, because there should be a blank line after each condition.
Any ideas?

Nonsignificant whitespace in xml is not preserved
http://www.w3.org/TR/1998/REC-xml-19980210#AVNormalize
Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize it as follows:
(...)
a whitespace character (#x20, #xD, #xA, #x9) is processed by appending #x20 to the normalized value, except that only a single #x20 is appended for a "#xD#xA" sequence that is part of an external parsed entity or the literal entity value of an internal parsed entity
(...)
If the declared value is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.
if you want to transmit text with significant whitespace within xml tags, you need to enclose it inside a cdata section:
<<Set key="PythonCode"><![CDATA[
YOUR CODE HERE
]]></Set>
As far as I know, you cannot use a cdata section inside an attribute string, so you will have to change that part of your xml format to enclose the code in tags instead.
Another workaround would be to tell your xml exporter as well as your xml importer to preserve nonsignificant whitespace.
for c# how to do this depends on which method you use to parse xml (XDocument, XmlDocument, ...), see for example
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.preservewhitespace(v=vs.71).aspx
http://msdn.microsoft.com/en-us/library/bb387014.aspx
http://msdn.microsoft.com/en-us/library/bb387103.aspx
But using cdata is definitely the better solution
what you definitely should not do is use Whython – Python For People Who Hate Whitespace

It seems like the code that goes out of XML is without line breaks.
If it's so, you have little hope of running Python code.
I have no idea how to make XML behave differently. Maybe there's something you can embed in the text which would translate to a newline (Maybe \n or <br>).
The if statement can't work without newlines, even in the single-line format. This is because a single line can't have : twice.
For this program, you could replace the if statements with and:
a==1 and str.append('One')
This way your code can be a one-liner.
But if you try to take this further, you'll find it very hard to program this way.

Related

Python: how to get rid of non-ascii characters being read from a file

I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.

Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)

I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....

read xml file using lxml get error EntityRef

i use lxml to read a xml file which has structure like bellow
<domain>http://www.trademe.co.nz</domain>
<start>http://www.trademe.co.nz/Browse/CategoryAttributeSearchResults.aspx?search=1&cid=5748&sidebar=1&rptpath=350-5748-4233-&132=FLAT&134=&153=&29=&122=0&122=0&59=0&59=0&178=0&178=0&sidebarSearch_keypresses=0&sidebarSearch_suggested=0</start>
and my python code is:
from lxml import etree
tree = etree.parse('metaWeb.xml')
when i run it i get
entityref: expecting ';' error
however, when i remove & symbol in xml file, everything is fine.
how can i solve that error?

The problem is that this isn't valid XML. In XML, a & symbol always starts an entity reference, like Ӓ for the character U+04D2 (aka Ӓ), " for the character ", or some custom entity defined in your document/DTD/schema.*
If you want to put a literal & into a string, you have to replace it with something else, typically &, which is a character entity reference for the ampersand character.
So, if you're sure there are no actual entity references in your document, just un-escaped ampersands, you can fix it pretty simply:
with open('metaWeb.xml') as f:
xml = f.read().replace('&', '&')
tree = etree.fromstring(xml)
However, a better solution, if possible, is to fix whatever program is generating this incorrect XML.
* This is slightly misleading quite true; a numeric character reference is not actually an entity reference. Also, a character entity reference like " or & is the same as any other reference with replacement text, the entities just happen to be implicitly defined by the XML/HTML base DTDs. But lxml, like most XML software, uses the term "entity reference" slightly more broadly than the standard.

Replace & with & in your xml file, othewise your xml is not compliant to the XML standard.

Python regex to find characters unsupported by XML 1.0 returns no results

I'm writing a Python 3.2 script to find characters in a Unicode XML-formatted text file which aren't valid in XML 1.0. The file itself isn't XML 1.0, so it could easily contain characters supported in 1.1 and later, but the application which uses it can only handle characters valid in XML 1.0 so I need to find them.
XML 1.0 doesn't support any characters in the range \u0001-\u0020, except for \u0009, \u000A, \u000D, and \u0020. Above that, \u0021-\uD7FF and \u010000-\u10FFFF are also supported ranges, but nothing else. In my Python code, I define that regex pattern this way:
re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
However, the code below isn't finding a known bad character in my sample file (\u0007, the 'bell' character.) Unfortunately I can't provide a sample line (proprietary data).
I think the problem is in one of two places: Either a bad regex pattern, or how I'm opening the file and reading in lines—i.e. an encoding problem. I could be wrong, of course.
Here's the relevant code snippet.
processChunkFile() takes three parameters: chunkfile is an absolute path to a file (a 'chunk' of 500,000 lines of the original file, in this case) which may or may not contain a bad character. outputfile is an absolute path to an optional, pre-existing file to write output to. verbose is a boolean flag to enable more verbose command-line output. The rest of the code is just getting command-line arguments (using argparse) and breaking the single large file up into smaller files. (The original file's typically larger than 4GB, hence the need to 'chunk' it.)
def processChunkFile(chunkfile, outputfile, verbose):
"""
Processes a given chunk file, looking for XML 1.0 chars.
Outputs any line containing such a character.
"""
badlines = []
if verbose:
print("Processing file {0}".format(os.path.basename(chunkfile)))
# open given chunk file and read it as a list of lines
with open(chunkfile, 'r') as chunk:
chunklines = chunk.readlines()
# check to see if a line contains a bad character;
# if so, add it to the badlines list
for line in chunklines:
if badCharacterCheck(line, verbose) == True:
badlines.append(line)
# output to file if required
if outputfile is not None:
with open(outputfile.encode(), 'a') as outfile:
for badline in badlines:
outfile.write(str(badline) + '\n')
# return list of bad lines
return badlines
def badCharacterCheck(line, verbose):
"""
Use regular expressions to seek characters in a line
which aren't supported in XML 1.0.
"""
invalidCharacters = re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
matches = re.search(invalidCharacters, line)
if matches:
if verbose:
print(line)
print("FOUND: " + matches.groups())
return True
return False

\u010000
Python \u escapes are four-digit only, so that U+0100 followed by two U+0030 Digit Zeros. Use capital-U escape with eight digits for characters outside the BMP:
\U00010000-\U0010FFFF
Note that this and your expression in general won't work on ‘narrow builds’ of Python where strings are based on UTF-16 code units and characters outside the BMP are handled as two surrogate code units. (Narrow build were the default for Windows. Thankfully they go away with Python 3.3.)
it could easily contain characters supported in 1.1 and later
(Although XML 1.1 can only contain those characters when they're encoded as numeric character references &#...;, so the file itself may still not be well-formed.)
open(chunkfile, 'r')
Are you sure the chunkfile is encoded in locale.getpreferredencoding?
The original file's typically larger than 4GB, hence the need to 'chunk' it.
Ugh, monster XML is painful. But with sensible streaming APIs (and filesystems!) it should still be possible to handle. For example here, you could process each line one at a time using for line in chunk: instead of reading all of the chunk at once using readlines().
re.search(invalidCharacters, line)
As invalidCharacters is already a compiled pattern object you can just invalidCharacters.search(...).
Having said all that, it still matches U+0007 Bell for me.

The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup. Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '' var)#wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.

How does the SAX parser process characters?

I wrote a little code to parse a XML file, and want to print it's characters, but each character seems invoke characters() callback function three times.
code：
def characters(self,chrs):
if self.flag==1:
self.outfile.write(chrs+'\n')
xml file:
<e1>9308</e1>
<e2>865</e2>
and the output is like below, many blank lines.
9308
865
I think it should like:
9308
865
Why there are space line? and I read the doc info:
characters(self, content)
Receive notification of character data.
The Parser will call this method to report each chunk of
character data. SAX parsers may return all contiguous
character data in a single chunk, or they may split it into
several chunks; however, all of the characters in any single
event must come from the same external entity so that the
Locator provides useful information.
so SAX will process one character area as several fragments? and callback several times?

The example XML you posted is obviously not the full XML, because that would be malformed (and the SAX parser would tell you that instead of producing your output). So I'll assume that there's more to the XML than you showed us.
You need to be aware that every whitespace between any XML elements is character data. So if you have something like that:
<foo>
<bar>123</bar>
</foo>
Then you have at least 3 text nodes: one containing "\n " (i.e. one newline, two space characters), one containing "123" and last but not least another one with "\n" (i.e. just a newline).

Using self.outfile.write(chrs+'\n') you don't have a chance of seeing exactly what is happening.
Try self.outfile.write("Chrs: %r\n" % chrs)
Look up the built-in function repr() ... "%r" % foo produces the same as repr(foo); both constructs are very useful in error messages and when debugging.

so SAX will process one character area as several fragments? and callback several times?
This is obviously happening in your case - any doubt?
But your problem description is poor since you did not mention which parser you are exactly using.

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.