Line profiling python code running as background service - python

I am aware of the way independent scripts are profiled using kerprof/profile/cProfile. But how can I profile python web application running as background service/long running application

After some drill down and exploring potential solutions; I came up with following solution:
Add the following function into source file and decorate the original function to profile with #do_cprofile
import cProfile
def do_cprofile(func):
def profiled_func(*args, **kwargs):
profile = cProfile.Profile()
try:
profile.enable()
result = func(*args, **kwargs)
profile.disable()
return result
finally:
profile.dump_stats('/tmp/profile_bin.prof')
return profiled_func
Convert the generated /tmp/profile_bin.prof to human readable file
import pstats
f = open('/tmp/human_readable_profile.prof', 'w')
stats = pstats.Stats('/tmp/profile_bin.prof', stream=f)
stats.sort_stats('cumulative').print_stats()
f.close()

Related

Create a gzip file like object for unit testing

I want to test a Python function that reads a gzip file and extracts something from the file (using pytest).
import gzip
def my_function(file_path):
output = []
with gzip.open(file_path, 'rt') as f:
for line in f:
output.append('something from line')
return output
Can I create a gzip file like object that I can pass to my_function? The object should have defined content and should work with gzip.open()
I know that I can create a temporary gzip file in a fixture but this depends on the filesystem and other properties of the environment. Creating a file-like object from code would be more portable.
You can use the io and gzip libraries to create in-memory file objects. Example:
import io, gzip
def inmem():
stream = io.BytesIO()
with gzip.open(stream, 'wb') as f:
f.write(b'spam\neggs\n')
stream.seek(0)
return stream
You should never try to test outside code in a unit test. Only test the code you wrote. If you're testing gzip, then gzip is doing something wrong (they should be writing their own unit tests). Instead, do something like this:
from unittest import mock
#mock.Mock('gzip', return_value=b'<whatever you expect to be returned from gzip>')
def test_my_function(mock_gzip):
file_path = 'testpath'
output = my_function(file_path=file_path)
mock_gzip.open.assert_called_with(file_path)
assert output == b'<whatever you expect to be returned from your method>'
That's your whole unit test. All you want to know is that gzip.open() was called (and you assume it works or else gzip is failing and that's their problem) and that you got back what you expected from the method being tested. You specify what gzip returns based on what you expect it to return, but you don't actually call the function in your test.
It's a bit verbose but I'd do something like this (I have assumed that you saved my_function to a file called patch_one.py):
import patch_one # this is the file with my_function in it
from unittest.mock import patch
from unittest import TestCase
class MyTestCase(TestCase):
def test_my_function(self):
# because you used "with open(...) as f", we need a mock context
class MyContext:
def __enter__(self, *args, **kwargs):
return [1, 2] # note the two items
def __exit__(self, *args, **kwargs):
return None
# in case we want to know the arguments to open()
open_args = None
def f(*args, **kwargs):
def my_open(*args, **kwargs):
nonlocal open_args
open_args = args
return MyContext()
return my_open
# patch the gzip.open in our file under test
with patch('patch_one.gzip.open', new_callable=f):
# finally, we can call the function we want to test
ret_val = patch_one.my_function('not a real file path')
# note the two items, corresponding to the list in __enter__()
self.assertListEqual(['something from line', 'something from line'], ret_val)
# check the arguments, just for fun
self.assertEqual('rt', open_args[1])
If you want to try anything more complicated, I would recommend reading the unittest mock docs because how you import the "patch_one" file matters as does the string you pass to patch().
There will definitely be a way to do this with Mock or MagicMock but I find them a bit hard to debug so I went the long way round.

How to design a command line interface (CLI) which accepts Python functions or code?

I have a function that parses a given string with specific rules. I would like to design a CLI interface for this function. But the problem is I want that a user should be able to call this function via CLI using a READER & WRITER function of its own. To make it clear, here is a sample code and a demonstration of what I'm trying to explain.
# mylib.py
# piece of code that belongs to my lib
def parser(_id, text):
# parse the text & do some magic
return (_id, parsed_text)
# user-side code
def reader():
# read from a database
# or file or network or who knows where
yield (_id, text)
# user-side code
def writer(_id, text):
# write to somewhere
return True # or false depends on write action
A sample call should be something like this:
$ python mylib.py --reader <something-that-I-dont-know>
I don't want to use eval tricks but also I want that the user should be flexible while passing data to my library. Does this possible? Or should I try another approach?
With the help of #AlexHall, I've come up with the following solution:
import pathlib
import importlib.util
def load_module(filepath):
module_path = pathlib.Path(filepath)
abs_path = module_path.resolve()
module_name = module_path.stem
spec = importlib.util.spec_from_file_location(module_name, abs_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
return module
Using this function, I am be able to import any valid python module exists in the filesystem even if the module is not in the path.
Here is a sample usage:
parser = make_parser(prog="tokenizer")
args = parser.parse_args()
module = load_module(args.writer) # if nothing is passed, default action defined in the parser
writer = module.writer
module = load_module(args.reader)
reader = module.reader
# do what you want to do with them

Python mock running in pytest crashing my instance

I've got the following test module, running in Python 2.7 on FreeBSD:
from mock import patch
from t.conf import load_config
#patch('t.conf._load_file')
def test_open_file(mock_open):
string_read = load_config('foo')
print(repr(string_read))
It runs fine if I comment out the patch line. It crashes my machine after hanging for about 10 minutes if I leave it in. I'm using patch in other places without any issues. What's going on?
_load_file looks like this:
def _load_file(filename):
with open(filename, 'rb') as f:
return f.read()
And load_config looks like this:
def load_config(filename):
text = _load_file(filename)
data = yaml.safe_load(text)
return data

How to pass a value to a Pytest fixture

I am using Pytest to test an executable. This .exe file reads a configuration file on startup.
I have written a fixture to spawn this .exe file at the start of each test and closes it down at the end of the test. However, I cannot work out how to tell the fixture which configuration file to use. I want the fixture to copy a specified config file to a directory before spawning the .exe file.
#pytest.fixture
def session(request):
copy_config_file(specific_file) # how do I specify the file to use?
link = spawn_exe()
def fin():
close_down_exe()
return link
# needs to use config file foo.xml
def test_1(session):
session.talk_to_exe()
# needs to use config file bar.xml
def test_2(session):
session.talk_to_exe()
How do I tell the fixture to use foo.xml for test_1 function and bar.xml for test_2 function?
Thanks
John
One solution is to use pytest.mark for that:
import pytest
#pytest.fixture
def session(request):
m = request.node.get_closest_marker('session_config')
if m is None:
pytest.fail('please use "session_config" marker')
specific_file = m.args[0]
copy_config_file(specific_file)
link = spawn_exe()
yield link
close_down_exe(link)
#pytest.mark.session_config("foo.xml")
def test_1(session):
session.talk_to_exe()
#pytest.mark.session_config("bar.xml")
def test_2(session):
session.talk_to_exe()
Another approach would be to just change your session fixture slightly to delegate the creation of the link to the test function:
import pytest
#pytest.fixture
def session_factory(request):
links = []
def make_link(specific_file):
copy_config_file(specific_file)
link = spawn_exe()
links.append(link)
return link
yield make_link
for link in links:
close_down_exe(link)
def test_1(session_factory):
session = session_factory('foo.xml')
session.talk_to_exe()
def test_2(session):
session = session_factory('bar.xml')
session.talk_to_exe()
I prefer the latter as its simpler to understand and allows for more improvements later, for example, if you need to use #parametrize in a test based on the config value. Also notice the latter allows to spawn more than one executable in the same test.

Module seemingly reimported in memoized Python function

I'm writing a Python command line utility that involves converting a string into a TextBlob, which is part of a natural language processing module. Importing the module is very slow, ~300 ms on my system. For speediness, I created a memoized function that converts text to a TextBlob only the first time the function is called. Importantly, if I run my script over the same text twice, I want to avoid reimporting TextBlob and recomputing the blob, instead pulling it from the cache. That's all done and works fine, except, for some reason, the function is still very slow. In fact, it's as slow as it was before. I think this must be because the module is getting reimported even though the function is memoized and the import statement happens inside the memoized function.
The goal here is to fix the following code so that the memoized runs are as speedy as they ought to be, given that the result does not need to be recomputed.
Here's a minimal example of the core code:
#memoize
def make_blob(text):
import textblob
return textblob.TextBlob(text)
if __name__ == '__main__':
make_blob("hello")
And here's the memoization decorator:
import os
import shelve
import functools
import inspect
def memoize(f):
"""Cache results of computations on disk in a directory called 'cache'."""
path_of_this_file = os.path.dirname(os.path.realpath(__file__))
cache_dirname = os.path.join(path_of_this_file, "cache")
if not os.path.isdir(cache_dirname):
os.mkdir(cache_dirname)
cache_filename = f.__module__ + "." + f.__name__
cachepath = os.path.join(cache_dirname, cache_filename)
try:
cache = shelve.open(cachepath, protocol=2)
except:
print 'Could not open cache file %s, maybe name collision' % cachepath
cache = None
#functools.wraps(f)
def wrapped(*args, **kwargs):
argdict = {}
# handle instance methods
if hasattr(f, '__self__'):
args = args[1:]
tempargdict = inspect.getcallargs(f, *args, **kwargs)
for k, v in tempargdict.iteritems():
argdict[k] = v
key = str(hash(frozenset(argdict.items())))
try:
return cache[key]
except KeyError:
value = f(*args, **kwargs)
cache[key] = value
cache.sync()
return value
except TypeError:
call_to = f.__module__ + '.' + f.__name__
print ['Warning: could not disk cache call to ',
'%s; it probably has unhashable args'] % (call_to)
return f(*args, **kwargs)
return wrapped
And here's a demonstration that the memoization doesn't currently save any time:
❯ time python test.py
python test.py 0.33s user 0.11s system 100% cpu 0.437 total
~/Desktop
❯ time python test.py
python test.py 0.33s user 0.11s system 100% cpu 0.436 total
This is happening even though the function is correctly being memoized (print statements put inside the memoized function only give output the first time the script is run).
I've put everything together into a GitHub Gist in case it's helpful.
What about a different approach:
import pickle
CACHE_FILE = 'cache.pkl'
try:
with open(CACHE_FILE) as pkl:
obj = pickle.load(pkl)
except:
import slowmodule
obj = "something"
with open(CACHE_FILE, 'w') as pkl:
pickle.dump(obj, pkl)
print obj
Here we cache the object, not the module. Note that this will not give you any savings if the object your caching requires slowmodule. So in the above example you would see savings, since "something" is a string and doesn't require the slowmodule module to understand it. But if you did something like
obj = slowmodule.Foo("bar")
The unpickling process would automatically import slowmodule, negating any benefit of caching.
So if you can manipulate textblob.TextBlob(text) into something that, when unpickled doesn't require the textblob module, then you'll see savings using this approach.

Categories