Optimized Lemmitization method in python - python

I have written python script which have this below function. Lemmatized function taking so much time which is affecting the code efficiency. I am using spacy module for lemmatization.
def lemmatization(cleaned_data, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
try:
logging.info("loading function lemmatization")
texts = list(sent_to_words(cleaned_data))
texts_out = []
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] \
else '' for token in doc if token.pos_ in allowed_postags]))
except Exception as error:
logging.info("Error occured in Lemmatization method. Error is %s", error)
return texts_out
Is there any way to optimize it?
Thanks in advance!

Variable names and variable transformations. I do not quite understand behind what the data variables are. cleaned_data is text, texts are again a list of word and what is sent in texts? Things can improve if you change variable names, document args in fucntion docstrings and add type annotations (python 3.6+). This is very typical when you work with program as a script, but unclear variables haunt both outsiode reader like myself and probably authors of code in 2-3 months from now, so better change.
Ideas for speedup. As for speedup there can be following cases, I think:
nlp function is slow itself
nlp() encouters lots of errors and does a lot of logging
something is slow in the rest of script (but these things seem rather minimal)
sent_to_words() not shown, maybe somethign happens there
Refactoring. For profiling the program you need to split it to fucntions to see what actiually takes a lot of time. See a refactoring below, hope it helps.
import logging
import spacy
from profilehooks import profile
# your actaul fucntion here
def sent_to_words(x):
pass
# a small speedup comes from == vs in
def exclude_pron(token):
x = token.lemma_
if x == '-PRON-':
return ''
return x
# functional approach, could be faster than signle comprehension
def extract_lemmas(doc, allowed_postags):
gen = (token for token in doc if token.pos_ in allowed_postags)
return map(exclude_pron, gen)
def make_model():
"""Initialize spacy 'en' model, keeping only tagger component for efficiency.
Run in terminal: python3 -m spacy download en
"""
return spacy.load('en', disable=['parser', 'ner'])
def make_texts_out(texts, nlp, allowed_postags):
texts_out = []
for sent in texts:
# really important and bothering = what is 'sent'?
doc = nlp(" ".join(sent))
res = extract_lemmas(doc, allowed_postags)
texts_out.append(res)
return res
# FIXME:
# - *clean_data* is too generic variable name, better rename
# - flow of variables is unclear: cleaned_data is split to words,
# and then combined to text " ".join(sent) again,
# it is not so clear what happens
#profile(immediate=True, entries=20)
def lemmatization(cleaned_data: list, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
logging.info("loading function lemmatization")
texts = list(sent_to_words(cleaned_data))
nlp = make_model()
try:
texts_out = list(make_texts_out(texts, nlp, allowed_postags))
except Exception as error:
logging.info("Error occured in lemmatization method. Error is %s", error)
return texts_out

Related

Run a dynamic multi-line string of code in python at runtime

How can I run a multi-line string of code from inside python, additionally the string is generated at runtime and stored in either a string or an array.
Background
I have a function which dynamically generates random code snippets in high volume.
I wish to pass this code to a variable/string and evaluate syntax/return code.
Code can not be imported from file/api request etc as focus is on high throughput and speed.
Example
# my function returns random code snippets
myCode = myRandomCodeGenerator()
# I wish to perform some in fight validation (ideally RC or syntax)
someTestingFunction(myCode)
My attempts so far
I have seen solutions such as the one below, however since my code is generated dynamically I have issues formatting. I have tried generating \n and \r or adding """ in the string to compensate with no luck.
code = """
def multiply(x,y):
return x*y
print('Multiply of 2 and 3 is: ',multiply(2,3))
"""
exec(code)
I have also tried using the call function on a string yields similar formatting issues.
So far the best I can do is to perform a line by line syntax check and feed it a line at a time as shown below.
import codeop
def is_valid_code(line):
try:
codeop.compile_command(line)
except SyntaxError:
return False
else:
return True
I suspect there may be some syntax tricks I could apply to preserve formatting of indentation and returns. I am also aware of the risk of generating dynamic code in runtime, and have a filter of allowed terms in the function.
Another way instead of eval and exec is to compile the code before and then to exec it:
Example:
import contextlib,sys
from io import StringIO
#contextlib.contextmanager
def stdoutIO(stdout=None):
old = sys.stdout
if stdout is None:
stdout = StringIO()
sys.stdout = stdout
yield stdout
sys.stdout = old
def run_code(override_kale_blocks):
compiled123 = []
for b123 in override_kale_blocks:
compiled123.append(compile(b123,"<string>","exec"))
with stdoutIO() as s:
for c123 in compiled123:
exec(c123)
return s.getvalue()
block0='''
import time
a=5
b=6
b=a+b
'''
block1='''
b=a+b
'''
block2="print(b)"
blocksleep='''
print('startsleep')
time.sleep(1)
print('donesleep')
'''
pc = (block0,blocksleep,block1,block2)
cb = []
print('before')
output= run_code(pc)
print(output)
print('after')
print("Hello World!\n")
source:
https://stackoverflow.com/a/3906390/1211174
Question was answered by Joran Beasley.
Essentially, my issue was I was adding a space after \n as other compilers get confused, python however takes this into account.
Thus if you add a space after a return carriage in a def statement, it will produce syntax error.
I have kept this incase others encounter similar issues.
Try using '\t' instead of placing tabs and spaces for the indentation, that may preserve your formatting.
code = """
def multiply(x,y):
\t return x*y
print('Multiply of 2 and 3 is: ',multiply(2,3))
"""
exec(code)

Loading library on each map() execution

Library called spaCy has some problems when being shared across executors (problem with pickling). One of the workarounds is to import it independently on each map execution but the load takes a while.
I'm new to Spark and so I don't understand what's the exact mechanism behind map. What will happen in example case below?
I'm afraid of the worst case scenario, where individual lines of text are processed independently and for each one it will import spacy. Fresh import can take good 10+ s and we have 1,000,000+ lines of text.
class SpacyMagic(object):
_spacys = {}
#classmethod
def get(cls, lang):
if lang not in cls._spacys:
import spacy
cls._spacys[lang] = spacy.load(lang)
return cls._spacys[lang]
def run_spacy(sent):
nlp = SpacyMagic.get('en')
return [wrd.text for wrd in nlp(sent)]
sc = SparkContext(appName="LineTokenizer")
data = sc.textFile(s3in)
res = data.map(process_line)
print res.take(100)
sc.stop()

Generate parser for Python3 in python, using ANTLR 4.6

I'm using the ANTLRv4 Python3 grammar from here:
https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4
and running:
java -jar antlr-4.6-complete.jar -Dlanguage=Python2 Python3.g4
to generate Python3Lexer.py + some other files.
However, Python3Lexer.py contains code which is not python! For eg.
def __init__(self, input=None):
super(Python3Lexer, self).__init__(input)
self.checkVersion("4.6")
self._interp = LexerATNSimulator(self, self.atn, self.decisionsToDFA, PredictionContextCache())
self._actions = None
self._predicates = None
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
Its unusable because of this. Does anyone know why this is happening and how I can fix it? Thanks!
This grammar is full of action code written in Java to deal with specialities of Python. You have to port that manually to python to make the grammar usuable for you. This is why grammar writers are encouraged to move out action code into base classes or listener code.

ANTLR4 and the Python target

I'm having issues getting going with a Python target in ANTLR4. There seems to be very few examples available and going to the corresponding Java code doesn't seem relevant.
I'm using the standard Hello.g4 grammar:
// Define a grammar called Hello
grammar Hello;
r : 'hello' ID ; // match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
The example (built from the standard Hello.g4 example):
input_ = antlr4.FileStream(_FILENAME)
lexer = HelloLexer.HelloLexer(input_)
stream = antlr4.CommonTokenStream(lexer)
parser = HelloParser.HelloParser(stream)
rule_name = 'r'
tree = getattr(parser, rule_name)()
I also wrote a listener. To assert/verify that this is correct, I'll repeat it here:
class HelloListener(antlr4.ParseTreeListener):
def enterR(self, ctx):
print("enterR")
def exitR(self, ctx):
print("exitR")
def enterId(self, ctx):
print("enterId")
def exitId(self, ctx):
print("exitId")
So, first, I can't guarantee that the string I'm giving it is valid because I'm not getting any screen output. How do I tell from the tree object if anything was matched? How do I extract the matching rules/tokens?
A Python example would be great, if possible.
I hear you, having the same issues right now. The Python documentation for v4 is useless and v3 differs to much to be usable. I'm thinking about switching back to Java to implement my stuff.
Regarding your code: I think your own custom listener has to inherit from the generated HelloListener. You can do the printing there.
Also try parsing invalid input to see if the parser starts at all. I'm not sure about the line with getattr(parser, rule_name)() though. I followed the steps in the (unfortunately very short) documentation for the Antlr4 Python target: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Python+Target
You can also find some documentation about the listener stuff there. Hope it helps.
This question seems to be old, but I also had the same problems and found out how to deal with it. When using strings in python, you have to use the function antlr4.InputStream as pointed out here
So, in the end, you could get a working example with this sort of code (based on Alan's answer and an example from dzone)
from antlr4 import *
from grammar.HelloListener import HelloListener
from grammar.HelloLexer import HelloLexer
from grammar.HelloParser import HelloParser
import sys
class HelloPrintListener(HelloListener):
def enterHi(self, ctx):
print("Hello: %s" % ctx.ID())
def main():
giveMeInput = input ("say hello XXXX\n")
print("giveMeInput is {0}".format(giveMeInput))
# https://www.programcreek.com/python/example/93166/antlr4.InputStream
# https://groups.google.com/forum/#!msg/antlr-discussion/-9VJ5H9NcDs/OukVNCTQCAAJ
i_stream = InputStream(giveMeInput)
lexer = HelloLexer(i_stream)
t_stream = CommonTokenStream(lexer)
parser = HelloParser(t_stream)
tree = parser.hi()
printer = HelloPrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
if __name__ == '__main__':
main()
I've created an example for Python 2 using the Hello grammar.
Here's the relevant code:
from antlr4 import *
from HelloLexer import HelloLexer
from HelloListener import HelloListener
from HelloParser import HelloParser
import sys
class HelloPrintListener(HelloListener):
def enterHi(self, ctx):
print("Hello: %s" % ctx.ID())
def main():
lexer = HelloLexer(StdinStream())
stream = CommonTokenStream(lexer)
parser = HelloParser(stream)
tree = parser.hi()
printer = HelloPrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
if __name__ == '__main__':
main()
As fabs said, the key is to inherit from the generated HelloListener. There seems to be some pickiness on this issue, as you can see if you modify my HelloPrintListener to inherit directly from ANTLR's ParseTreeListener. I imagined that would work since the generated HelloListener just has empty methods, but I saw the same behavior you saw (listener methods never being called).
Even though the documentation for Python listeners are lacking, the available methods are similar to Java.
The antlr documentation has been updated to document the support for python 3 and python 4 targets. The examples from the antlr book converted to python3 can be found here, they are sufficient enough to get anyone started.

Pyparsing operating on context

How can I build a pyparsing program that allows operations being executed on a context/state object?
An example of my program looks like this:
load 'data.txt'
remove line 1
remove line 4
The first line should load a file and line 2 and 3 are commands that operate on the content of the file. As a result, I expect the content of the file after all commands have been executed.
load_cmd = Literal('load') + filename
remove_cmd = Literal('remove line') + line_no
more_cmd = ...
def load_action(s, loc, toks):
# load file, where should I store it?
load_cmd.setParseAction(load_action)
def remove_line_action(s, loc, toks):
# remove line, how to obtain data to operate on? where to write result?
remove_line_cmd.setParseAction(remove_cmd)
# Is this the right way to define a whole program, i.e. not only one line?
program = load_cmd + remove_cmd | more_cmd |...
# How do I obtain the result?
program.scanString("""
load 'data.txt'
remove line 1
remove line 4
""")
I have written a few pyparsing examples of this command-parsing style, you can find them online at:
http://pyparsing.wikispaces.com/file/view/simpleBool.py/451074414/simpleBool.py
http://pyparsing.wikispaces.com/file/view/eval_arith.py/68273277/eval_arith.py
I have also written a simple Adventure-style game processor, which accepts parsed command structures and executes them against a game "world", which functions as the command executor. I presented this at PyCon 2006, but the link from the conference page has gone stale - you can find it now at http://www.ptmcg.com/geo/python/confs/pyCon2006_pres2.html (the presentation is written using S5 - mouse over the lower right corner to see navigation buttons). The code is at http://www.ptmcg.com/geo/python/confs/adventureEngine.py.txt, and UML diagram for the code is at http://www.ptmcg.com/geo/python/confs/pyparsing_adventure.pdf.
The general pattern I have found to work best is similar to the old Model-View-Controller pattern.
The Model is your virtual machine, which maintains the context from command to command. In simple_bool the context is just the inferred local variable scope, since each parsed statement is just evaled. In eval_arith, this context is kept in the EvalConstant._vars dict, containing the names and values of pre-defined and parsed variables. In the Adventure engine, the context is kept in the Player object (containing attributes that point to the current Room and the collection of Items), which is passed to the parsed command object to execute the command.
The View is the parser itself. It extracts the pieces of each command and composes an instance of a command class. The interface to the command class's exec method depends on how you have set up the Model. But in general you can envision that the exec method you define will take the Model as one of, if not its only, parameter.
Then the Controller is a simple loop that implements the following pseudo-code:
while not finished
read a command, assign to commandstring
parse commandstring, use parsed results to create commandobj (null if bad parse)
if commandobj is not null:
commandobj.exec(context)
finished = context.is_finished()
If you implement your parser using pyparsing, then you can define your Command classes as subclasses of this abstract class:
class Command(object):
def __init__(self, s, l, t):
self.parameters = t
def exec(self, context):
self._do_exec(context)
When you define each command, the corresponding subclass can be passed directly as the command expression's parse action. For instance, a simplified GO command for moving through a maze would look like:
goExpression = Literal("GO") + oneOf("NORTH SOUTH EAST WEST")("direction")
goExpression.setParseAction(GoCommand)
For the abstract Command class above, a GoCommand class might look like:
class GoCommand(Command):
def _do_exec(self, context):
if context.is_valid_move(self.parameters.direction):
context.move(self.parameters.direction)
else:
context.report("Sorry, you can't go " +
self.parameters.direction +
" from here.")
By parsing a statement like "GO NORTH", you would get back not a ParseResults containing the tokens "GO" and "NORTH", but a GoCommand instance, whose parameters include a named token "direction", giving the direction parameter for the GO command.
So the design steps to do this are:
design your virtual machine, and its command interface
create a class to capture the state/context in the virtual machine
design your commands, and their corresponding Command subclasses
create the pyparsing parser expressions for each command
attach the Command subclass as a parse action to each command's pyparsing expression
create an overall parser by combining all the command expressions using '|'
implement the command processor loop
I would do something like this:
cmdStrs = '''
load
remove line
add line
some other command
'''
def loadParse(val): print 'Load --> ' + val
def removeParse(val): print 'remove --> ' + val
def addLineParse(val): print 'addLine --> ' + val
def someOtherCommandParse(val): print 'someOther --> ' + val
commands = [ l.strip() for l in cmdStrs.split('\n') if l.strip() !='' ]
functions = [loadParse,
removeParse,
addLineParse,
someOtherCommandParse]
funcDict = dict( zip(commands, functions) )
program = '''
# This is a comment
load 'data.txt' # This is another comment
remove line 1
remove line 4
'''
for l in program.split('\n'):
l = l.strip().split('#')[0].strip() # remove comments
if l == '': continue
commandFound = False
for c in commands:
if c in l:
funcDict[c](l.split(c)[-1])
commandFound = True
if not commandFound:
print 'Error: Unknown command : ', l
Of course, you can put the entire thing within a class and make it an object, but you see the general structure. If you have an object, then you can go ahead and create a version which can handle contextual/state information. Then, the functions above will simply be member functions.
Why do I get a sense that you are starting on Python after learning Haskell? Generally people go the other way. In Python you get state for free. You don't need Classes. You can use classes to handle more than one state within the same program :).

Categories