Python text processing and parsing

Python text processing and parsing - python

I have a file in gran/config.py AND I cannot import this file (not an option).
Inside this config.py, there is the following code
...<more code>
animal = dict(
bear = r'^bear4x',
tiger = r'^.*\tiger\b.*$'
)
...<more code>
I want to be able parse r'^bear4x' or r'^.*\tiger\b.*$' based on bear or tiger.
I started out with
try:
text = open('gran/config.py','r')
tline = filter('not sure', text.readlines())
text.close()
except IOError, str:
pass
I was hoping to grab the whole animal dict by
grab = re.compile("^animal\s*=\s*('.*')") or something like that
and maybe change tline to tline = filter(grab.search,text.readlines())
but it only grabs animal = dict( and not the following lines of dict.
how can i grab multiple lines?
look for animal then confirm the first '(' then continue to look until ')' ??
Note: the size of animal dict may change so anything static approach (like grab 4 extra lines after animal is found) wouldnt work

Maybe you should try some AST hacks? With python it is easy, just:
import ast
config= ast.parse( file('config.py').read() )
So know you have your parsed module. You need to extract assign to animals and evaluate it. There are safe ast.literal_eval function but since we make a call to dict it wont work here. The idea is to traverse whole module tree leaving only assigns and run it localy:
class OnlyAssings(ast.NodeTransformer):
def generic_visit( self, node ):
return None #throw other things away
def visit_Module( self, node ):
#We need to visit Module and pass it
return ast.NodeTransformer.generic_visit( self, node )
def visit_Assign(self, node):
if node.targets[0].id == 'animals': # this you may want to change
return node #pass it
return None # throw away
config= OnlyAssings().visit(config)
Compile it and run:
exec( compile(config,'config.py','exec') )
print animals
If animals should be in some dictionary, pass it as a local to exec:
data={}
exec( compile(config,'config.py','exec'), globals(), data )
print data['animals']
There is much more you can do with ast hacking, like visit all If and For statement or much more. You need to check documentation.

If the only reason you can't import that file as-is is because of imports that will fail otherwise, you can potentially hack your way around it than trying to process a perfectly good Python file as just text.
For example, if I have a file named busted_import.py with:
import doesnotexist
foo = 'imported!'
And I try to import it, I will get an ImportError. But if I define what the doesnotexist module refers to using sys.modules before trying to import it, the import will succeed:
>>> import sys
>>> sys.modules['doesnotexist'] = ""
>>> import busted_import
>>> busted_import.foo
'imported!'
So if you can just isolate the imports that will fail in your Python file and redefine those prior to attempting an import, you can work around the ImportErrors

I am not getting what exactly are you trying to do.
If you want to process each line with regular expression - you have ^ in regular expression re.compile("^animal\s*=\s*('.*')"). It matches only when animal is at the start of line, not after some spaces. Also of course it does not match bear or tiger - use something like re.compile("^\s*([a-z]+)\s*=\s*('.*')").
If you want to process multiple lines with single regular expression,
read about re.DOTALL and re.MULTILINE and how they affect matching newline characters:
http://docs.python.org/2/library/re.html#re.MULTILINE
Also note that text.readlines() reads lines, so the filter function in filter('not sure', text.readlines()) is run on each line, not on whole file. You cannot pass regular expression in this filter(<re here>, text.readlines()) and hope it will match multiple lines.
BTW processing Python files (and HTML, XML, JSON... files) using regular expressions is not wise. For every regular expression you write there are cases where it will not work. Use parser designed for given format - for Python source code it's ast. But for your use case ast is too complex.
Maybe it would be better to use classic config files and configparser. More structured data like lists and dicts can be easily stored in JSON or YAML files.

Related

Python comment-preserving parsing using only builtin libraries?

I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.
Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?

What I ended up doing was writing a simple parser, without a meta-language in 339 source lines:
https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py
Implementation of Concrete Syntax Tree [List!]
Reads source character by character;
Once end of statement† is detected, add statement-type into 1D list;
†end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
Once finished there is a big (1D) list where each element is a namedtuple with a value property.
Integration with builtin Abstract Syntax Tree ast
Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
…
Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).
Quality control
100% test coverage
100% doc coverage
Support for last 6 versions of Python (including latest alpha)
CI/CD
(Apache-2.0 OR MIT) licensed
Limitations
Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
Not builtin to stdlib so stdlib could break compatibility;
Deleting nodes is [probably] not supported;
Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.

Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.
Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:
import ast
program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
'b'] # Ending comment for b lost
def p1f(x):
"p1f docstring"
# Comment in function p1f lost
return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))
The output shows all comments gone:
== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']
def p1f(x):
"""p1f docstring"""
return x
print(p1f(p1l), p1f(p1v))
However, if we scan the comments with the tokenizer, we can
use this to merge the comments back in:
from io import StringIO
import tokenize
def scan_comments(source):
""" Scan source code file for relevant comments
"""
# Find token for comments
for k,v in tokenize.tok_name.items():
if v == 'COMMENT':
comment = k
break
comtokens = []
with StringIO(source) as f:
tokens = tokenize.generate_tokens(f.readline)
for token in tokens:
if token.type != comment:
continue
comtokens += [token]
return comtokens
comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])
Output (edited to split long line):
== Comment after p1l[0]
TokenInfo(type=60 (COMMENT),
string='# Implicit line joining comment for a lost',
start=(4, 12), end=(4, 54),
line="p1l = ['a', # Implicit line joining comment for a lost\n")
Using a slightly modified version of ast.unparse(), replacing
methods maybe_newline() and traverse() with modified versions,
you should be able to merge back in all comments at their
approximate locations, using the location info from the comment
scanner (start variable), combined with the location info from the
AST; most nodes have a lineno attribute.
Not exactly. See for example the list variable assignment. The
source code is split out over two lines, but ast.unparse()
generates only one line (see output in the second code segment).
Also, you need to ensure to update the location info in the AST
using ast.increment_lineno() after adding code.
It seems some more calls to
maybe_newline() might be needed in the library code (or its
replacement).

Python - Import formatted lines as indexed list objects

I am writing a minor OP5 plugin in Python 2.7 (version is out of my hands) that iterates over a multidimensional list that verifies fallback zip downloads have gone as they should.
Up until now I have put each host with their IP address in a multidimensional list looking like (cut short for brevity):
fallback = [
["host1", "192.168.1.3"],
["host2", "192.168.15.59"]
]
...and so on.
This lets me iterate through fallback[i] and use that along with fallback[i][1] for the IP address, the rest of the script uses both of these informations for various tasks and string manipulations. The script as it is now is mechanically sound but relies on availability of these indexes.
There is however a hidden file (.fallbackinfo) containing the same information for another script but it is written for perl, same as the script that uses that file as a source.
The file looks like this:
#hosts = (
["host1", "192.168.1.3", "type of firmware", "subfolder"],
["host2", "192.168.15.59", "type of firmware", "subfolder"],
);
I wish to import this into an iterable multidimensional list in my Python script, but am getting incredibly stuck.
My current attempt is the closest I have gotten:
with open("/home/runninguser/.fallbackinfo") as f:
lines = []
for line in f:
lines.append(line.rstrip().strip())
fallback = lines[1:len(lines)-1]
This has successfully made the list look as I want it, but all lines get imported as str objects. I have attempted to use list() to force the object to become a list but most of the time, that makes each character in the lines to become a list object instead. The network in question is cut off from internet access so I have to rely on built-in modules. My interpretation is that since it is formatted as a list, it should somehow be able to be interpreted as a list.
Can this be done at all, and if so, how?

You can use the json package (built-in) to achieve this:
import json
with open("/home/runninguser/.fallbackinfo") as f:
# For each line
for line in f:
# If the line starts with a bracket
if line.strip()[0] == "[":
# Print the line after removing spaces in front and the comma in the back
# and converting it into a list
print(json.loads(line.strip().rstrip(",")))
If you now use the type() function, you will see the list-formatted strings are now <class 'list'>

Add returned values to list in python?

When I print this I get:
FuncA
FuncB
FuncC
When really what I want is:
['FuncA', 'FuncB', 'FuncC']
How would I be able to iterate through my returned values and add them to the list?

Rather than manually look for text (which can easily lead to false positives), use the ast module to build an abstract syntax tree, then extract function names with that:
import ast
functions = []
with open( 'codefile.py', 'r') as file:
tree = ast.parse(file.read(), 'codefile.py')
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
functions.append(node.name)
print(functions)
This finds all function objects anywhere in the source code, just like your search for def text would have. Except this skips commented out code or the word def in a string literal, for instance.

How to check the file names are in os.listdir('.')?

I use RegEx & a String to get if this file name & similars to it exists in os.listdir('.') or not, If exists print('Yes'), If not print('No'), But If the file name even doesn't exists in my listdir('.') It shows me YES.
How should I check that ?
search = str(args[0])
pattern = re.compile('.*%s.*\.pdf' %search, re.I)
if filter(pattern.search, os.listdir('.')):
print('Yes ...')
else:
print('No ...')

filter on Python 3 is lazy, it doesn't return a list, it returns a generator, which is always "truthy", whether or not it would produce items (it doesn't know if it would until it's run out). If you want to check if it got any hits, the most efficient way would be to try to pull an item from it. On Python 3, you'd use two-arg next to do this lazily (so you stop when you get a hit and don't look further):
if next(filter(pattern.search, os.listdir('.')), False):
If you need the complete list a la Py2, you'd just wrap it in the list constructor:
matches = list(filter(pattern.search, os.listdir('.')))
On Python 2, your existing code should work as written.
I'll note, what you're doing would usually be handled much better with the glob module; I'd strongly recommend taking a look at it.

An alternative to your code (not considering additional requirements you might not have listed):
from pathlib import Path
search = str(args[0]).lower()
file_cnt = sum([search in file.stem.lower() for file in Path('.').glob('*.pdf')])
if file_cnt > 0:
print('Yes')
else:
print('No')

Manipulating Directory Paths in Python

Basically I've got this current url and this other key that I want to merge into a new url, but there are three different cases.
Suppose the current url is localhost:32401/A/B/foo
if key is bar then I want to return localhost:32401/A/B/bar
if key starts with a slash and is /A/bar then I want to return localhost:32401/A/bar
finally if key is its own independent url then I just want to return that key = http://foo.com/bar -> http://foo.com/bar
I assume there is a way to do at least the first two cases without manipulating the strings manually, but nothing jumped out at me immediately in the os.path module.

Have you checked out the urlparse module?
From the docs,
from urlparse import urljoin
urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
Should help with your first case.
Obviously, you can always do basic string manipulation for the rest.

I assume there is a way to do at least the first two cases without manipulating the strings manually, but nothing jumped out at me immediately in the os.path module.
That's because you want to use urllib.parse (for Python 3.x) or urlparse (for Python 2.x) instead.
I don't have much experience with it, though, so here's a snippet using str.split() and str.join().
urlparts = url.split('/')
if key.startswith('http://'):
return key
elif key.startswith('/'):
return '/'.join(urlparts[:2], key[1:])
else:
urlparts[len(urlparts) - 1] = key
return '/'.join(urlparts)

String objects in Python all have startswith and endswith methods that should be able to get you there. Something like this perhaps?
def merge(current, key):
if key.startswith('http'):
return key
if key.startswith('/'):
parts = current.partition('/')
return '/'.join(parts[0], key)
parts = current.rpartition('/')
return '/'.join(parts[0], key)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python text processing and parsing - python

Related

Python comment-preserving parsing using only builtin libraries?

Python - Import formatted lines as indexed list objects

Add returned values to list in python?

How to check the file names are in os.listdir('.')?

Manipulating Directory Paths in Python

Categories

Resources