python importlib seems to be sharing data between instances - python

Ok, so I am having a weird one. I am running python in a SideFX Hython (their custom build) implementation that is using PDG. The only real difference between Hython and vanilla Python is some internal functions for handling geometry data and compiled nodes, which shouldn't be an issue even though they are being used.
The way the code runs, I am generating a list of files from the disk which creates PDG work items. Those work items are then processed in parallel by PDG. Here is the code for that:
import importlib.util
import pdg
import os
from pdg.processor import PyProcessor
import json
class CustomProcessor(PyProcessor):
def __init__(self, node):
PyProcessor.__init__(self,node)
self.extractor_module = 'GeoExtractor'
def onGenerate(self, item_holder, upstream_items, generation_type):
for upstream_item in upstream_items:
new_item = item_holder.addWorkItem(parent=upstream_item, inProcess=True)
return pdg.result.Success
def onCookTask(self, work_item):
spec = importlib.util.spec_from_file_location("callback", "Geo2Custom.py")
GE = importlib.util.module_from_spec(spec)
spec.loader.exec_module(GE)
GE.convert(f"{work_item.attribValue('directory')}/{work_item.attribValue('filename')}{work_item.attribValue('extension')}", work_item.index, f'FRAME { work_item.index }', self.extractor_module)
return pdg.result.Success
def bulk_convert (path_pattern, extractor_module = 'GeoExtractor'):
type_registry = pdg.TypeRegistry.types()
try:
type_registry.registerNode(CustomProcessor, pdg.nodeType.Processor, name="customprocessor", label="Custom Processor", category="Custom")
except Exception:
pass
whereItWorks = pdg.GraphContext("testBed")
whatWorks = whereItWorks.addScheduler("localscheduler")
whatWorks.setWorkingDir(os.getcwd (), '$HIP')
whereItWorks.setValues(f'{whatWorks.name}', {'maxprocsmenu':-1, 'tempdirmenu':0, 'verbose':1})
findem = whereItWorks.addNode("filepattern")
whereItWorks.setValue(f'{findem.name}', 'pattern', path_pattern, 0)
generic = whereItWorks.addNode("genericgenerator")
whereItWorks.setValue(generic.name, 'itemcount', 4, 0)
custom = whereItWorks.addNode("customprocessor")
custom.extractor_module = extractor_module
node1 = [findem]
node2 = [custom]*len(node1)
for n1, n2 in zip(node1, node2):
whereItWorks.connect(f'{n1.name}.output', f'{n2.name}.input')
n2.cook(True)
for node in whereItWorks.graph.nodes():
node.dirty(False)
whereItWorks.disconnect(f'{n1.name}.output', f'{n2.name}.input')
print ("FULLY DONE")
import os
import hou
import traceback
import CustomWriter
import importlib
def convert (filename, frame_id, marker, extractor_module = 'GeoExtractor'):
Extractor = importlib.__import__ (extractor_module)
base, ext = os.path.splitext (filename)
if ext == '.sc':
base = os.path.splitext (base)[0]
dest_file = base + ".custom"
geo = hou.Geometry ()
geo.loadFromFile (filename)
try:
frame = Extractor.extract_geometry (geo, frame_id)
except Exception as e:
print (f'F{ frame_id } Geometry extraction failed: { traceback.format_exc () }.')
return None
print (f'F{ frame_id } Geometry extracted. Writing file { dest_file }.')
try:
CustomWriter.write_frame (frame, dest_file)
except Exception as e:
print (f'F{ frame_id } writing failed: { e }.')
print (marker + " SUCCESS")
The onCookTask code is run when the work item is processed.
Inside of the GeoExtractor.py program I am importing the geometry file defined by the work item, then converting it into a couple Pandas dataframes to collate and process the massive volumes of data quickly, which is then passed to a custom set of functions for writing binary files to disk from the Pandas data.
Everything appears to run flawlessly, until I check my output binaries and see that they escalate in file size much more than they should, indicating that either something is being shared between instances or not cleared from memory and subsequent loads of the extractor code is appending the dataframes which are named the same.
I have run the GeoExtractor code sequentially with the python instance closing between each file conversion using the exact same code and the files are fine, growing only very slowly as the geometry data volume grows, so the issue has to lie somewhere in the parallelization of it using PDG and calling the GeoExtractor.py code over and over for each work item.
I have contemplated moving the importlib stuff to the __init__() for the class leaving only the call to the member function in the onCookTask() function. Maybe even going so far as to pass a unique variable for each work item which is used inside GeoExtractor to create a closure of the internal functions so they are unique instances in memory.
I tried to do a stripped down version of GeoExtractor and since I'm not sure where the leak is, I just ended up pulling out comments with proprietary or superfluous information and changing some custom library names, but the file ended up kinda long so I am including a pastebin: https://pastebin.com/4HHS8D2W
As for CustomGeometry and CustomWriter, there is no working form of either of those libraries that will be NDA safe, so unfortunately they have to stay blackboxed. The CustomGeometry is a handful of container classes which organize all of the data coming out of the geometry, and the writer is a formatter/writer for the binary format we are utilizing. I am hoping the issue wouldn't be in either of them.
Edit 1: I fixed an issue in the example code.
Edit 2: Added larger examples.

Related

self.zone.locations==None when parsing zones.mpk

I've finally made it through all of the linting errors when I converted the project from the archaic Python 2.7 code to Python 3.11. When I run python MapperGUI-3.pyw, the GUI loads but something is keeping the daoc game path from being loaded properly and stored, resulting in the error message "Game not found".
I've beat my ahead against the functions and variables and I am sure I'm missing something obvious, but for the life of me I cannot figure out where it is going wrong parsing the game directory of 'C:/Program Files (x86)/Electronic Arts/Dark Age of Camelot/'
For whatever reason, the program is stuck thinking self.zone.locations==None and I confirmed this by manually declaring the variable to a valid one and things worked fine.
MapperGUI/modules/zones.py is the module that loads the zones from the .mpk file dynamically. I verified mapper/all_locations.txt is being created properly when zones.py is parsed and called from the main program.
Did I miss something in zones.py? When I ran the 2 to 3 conversion on it, nothing was changed and I am not seeing any runtime errors.
#zones.py
import os, sys
sys.path.append("../mapper")
import dempak
import datParser as ConfigParser
class Zones:
def __init__(self, gamepath):
try:
self.locations=[]
cp=ConfigParser.ConfigParser()
cp.readfp(dempak.getMPAKEntry(os.path.join(gamepath, 'zones', 'zones.mpk'), 'zones.dat'))
sections=cp.sections()
sections.sort()
for s in sections:
if s[:4] == 'zone':
id=int(s[4:])
enabled=int(cp.get(s, 'enabled'))
name=cp.get(s, 'name')
region=cp.get(s,'region'); # need region to id realm -- cch
try:
type=int(cp.get(s, 'type'))
except ConfigParser.NoOptionError:
type=0
if type==0 or type==3:
# add region to tuple -- cch
self.locations.append(("%03d" % id, name, region))
else: # ignore type 1,2,4 (city,dungeon,TOA city)
continue
dataFile=open("mapper/all_locations.txt","w")
for x in range(len(self.locations)):
dataFile.write("%s : %s\n" % (self.locations[x][0], self.locations[x][1]))
dataFile.close()
except IOError:
self.locations=None

How to solve class objecto has no atribute

beginner Python user here.
So, I´m trying to make a program that orders the files of my (many) Downloads folder.
I made a class object to work with the many folders:
class cContenedora:
def __int__(self, nCarp, dCarp): #nCarp Stands is the file name and dCarp Stands for file directory.
self.nCarp = nCarp
self.dCarp = dCarp
So, y wrote a instance like this:
Download = cContenedora()
Download.nCarp = "Downloads/"
#The side bar is for making a path to move my archives from with shutil.move(path, dest)
Download.dCarp = "/Users/MyName/Download/"
#This is for searching the folder with os.listdir(Something.dCarp)
Then, I wrote my function, and it goes something like this:
def ordenador(carpetaContenedora, formato, directorioFinal): #carpetaContenedora is a Download Folder
carpetaContenedora = cContenedora() #carpetaContenedora one of the class objects
dirCCont = os.listdir(carpetaContenedora.dCarp) #The to directory is carpetaContenedora.cCarp
for a in dirCCont:
if a.endswith(formato):
path = "/Users/Aurelio Induni/" + carpetaContenedora().nCarp + a
try:
shutil.move(path, directorioFinal)
print(Fore.GREEN + a + "fue movido exitosamente.")
except:
print(Fore.RED + "Error con el archivo" + a)
pass
for trys in range(len(listaCarpetasDestino)-1): #Is a list full of directories.
for container in listaCarpetasFuente: #A short list of all my Downloads Folder.
for formatx in listaFormatos: #listaFormatos is a list ful of format extensions like ".pdf"
#try: #I disabled this to see the error istead of "Error Total"
ordenador(container, formatx, listaCarpetasDestino[trys])
#except:
#print(Fore.RED + "Error Total") #I disabled this to see the error.
But every time I run it I get the following:
AttributeError: 'cContenedora' object has no attribute 'dCarp'
It says the error is in line 47 (the one with the os.listdir(carpetaContenedora.dCarp))
I´m sure is something small. Python is so amazing, but it also can be so frustrating not knowing what´s wrong.
There is a spelling mistake in the initialization of your instance. It should be "init" instead of "int".
In the class cContenedora, the function should be
class cContenedora:
def __init__(self, nCarp, dCarp):
self.nCarp = nCarp
self.dCarp = dCarp
Additionally, When you are passing in the parameter. Make sure to pass in both of the parameters in the line with Value.
CContenedora(nCarp="something",dCarp="something")
Your class initializer, i.e., __init__() function has 2 parameters nCarp and dCarp but when you are actually creating the object there are no parameters passed.
Your function ordenador takes the first parameter as carpetaContenedora, on the first line same variable is assigned a new object of cContenedora, at this line the original values you passed are lost forever.
This could be the reason it is giving for the error.
Refer this link for more details on how to create classes and instantiate the object.

Fastai - failed initiation of language model in Sentence Piece Processor, cache_dir parameter

I've been already browsing web for hours to find a solution for my, which i believe so might be a pretty petty issue.
I'm using fastai's Sentence Piece Processor (SPProcesor) at the very first steps of initiation of a language model.
My code for these steps looks like this:
bs = 48
processor = SPProcessor(lang='pl')
data_lm = (TextList.from_csv('', target_corpus, processor=processor)
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=bs)
)
data_lm.save(data_lm_file)
After execution i get an error which is as follows:
~/x/miniconda3/envs/fastai/lib/python3.6/site-packages/fastai/text/data.py in process(self, ds)
466 self.sp_model,self.sp_vocab = cache_dir/'spm.model',cache_dir/'spm.vocab'
467 if not getattr(self, 'vocab', False):
--> 468 with open(self.sp_vocab, 'r', encoding=self.enc) as f: self.vocab = Vocab([line.split('\t')[0] for line in f.readlines()])
469 if self.n_cpus <= 1: ds.items = self._encode_batch(ds.items)
470 else:
FileNotFoundError: [Errno 2] No such file or directory: 'tmp/spm/spm.vocab'
The proper outcome of the code executed above should be as following:
created folder named 'tmp', containing folder 'spm', within which should be placed 2 files named respectively: spm.vocab and spm.model.
What happens instead is that 'tmp' folder is created along with files named "cache_dir".vocab and "cache_dir".model inside my current directory.
Folder 'spm' is nowhere to be found.
I've found a sort of workaround solution.
It consists of manually creating a 'spm' folder inside 'tmp' and moving those 2 other mentioned above files into it, and changing their names to spm.vocab and spm.model.
That allows me to carry on with my processing yet I'd like to find a way to skip that neccessity of manually moving created files and else.
Maybe I need to pass some paramateres (probably cache_dir) with specific values before processing?
If you'd have any idea on how to solve that issue, please point me those.
I'd be grateful.
I can see similar error if I switch the code in fastai/text/data.py to an earlier version of this commit. Then, if I apply changes from the same commit it all works nicely. Now, the most recent version of the same file (the one which supposed to help with paths with spaces) seems to have yet another bug introduced there.
So pretty much it seems that the problem is that fastai is trying to give argument --model_prefix with quotes to the sentencepiece .SentencePieceTrainer.Train which makes it "misbehave".
One possibility for you would be to either (1) update to the later version of fastai (which might not help due to another bug in a newer version), or (2) manually apply changes from here to your installation's fastai/text/data.py. It's a very small change - just delete the line:
cache_dir = cache_dir/'spm'
and replace
f'--model_prefix="cache_dir" --vocab_size={vocab_sz} --model_type={model_type}']))
with the
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
In case you are not comfortable with updating the code of the installation you can monkey-patch the module by substituting existing train_sentencepiece function by writing fixed version in your code and then doing something like fastai.text.data.train_sentencepiece = my_fixed_train_sentencepiece before other calls.
So if you are using newer version of the library the code might look like this:
import fastai
from fastai.core import PathOrStr
from fastai.text.data import ListRules, get_default_size, quotemark, full_char_coverage_langs
from typing import Collection
def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None,
vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
char_coverage=None, tmp_dir='tmp', enc='utf8'):
"Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
from sentencepiece import SentencePieceTrainer
cache_dir = Path(path)/tmp_dir
os.makedirs(cache_dir, exist_ok=True)
if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
raw_text_path = cache_dir / 'all_text.out'
with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
SentencePieceTrainer.Train(" ".join([
f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
f"--user_defined_symbols={','.join(spec_tokens)}",
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
raw_text_path.unlink()
return cache_dir
fastai.text.data.train_sentencepiece = train_sentencepiece
And if you are using older version, then like the following:
import fastai
from fastai.core import PathOrStr
from fastai.text.data import ListRules, get_default_size, full_char_coverage_langs
from typing import Collection
def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None,
vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
char_coverage=None, tmp_dir='tmp', enc='utf8'):
"Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
from sentencepiece import SentencePieceTrainer
cache_dir = Path(path)/tmp_dir
os.makedirs(cache_dir, exist_ok=True)
if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
raw_text_path = cache_dir / 'all_text.out'
with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
SentencePieceTrainer.Train(" ".join([
f"--input={raw_text_path} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
f"--user_defined_symbols={','.join(spec_tokens)}",
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
raw_text_path.unlink()
return cache_dir
fastai.text.data.train_sentencepiece = train_sentencepiece

Loading and dumping multiple yaml files with ruamel.yaml (python)

Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.
First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.
For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml

Module seemingly reimported in memoized Python function

I'm writing a Python command line utility that involves converting a string into a TextBlob, which is part of a natural language processing module. Importing the module is very slow, ~300 ms on my system. For speediness, I created a memoized function that converts text to a TextBlob only the first time the function is called. Importantly, if I run my script over the same text twice, I want to avoid reimporting TextBlob and recomputing the blob, instead pulling it from the cache. That's all done and works fine, except, for some reason, the function is still very slow. In fact, it's as slow as it was before. I think this must be because the module is getting reimported even though the function is memoized and the import statement happens inside the memoized function.
The goal here is to fix the following code so that the memoized runs are as speedy as they ought to be, given that the result does not need to be recomputed.
Here's a minimal example of the core code:
#memoize
def make_blob(text):
import textblob
return textblob.TextBlob(text)
if __name__ == '__main__':
make_blob("hello")
And here's the memoization decorator:
import os
import shelve
import functools
import inspect
def memoize(f):
"""Cache results of computations on disk in a directory called 'cache'."""
path_of_this_file = os.path.dirname(os.path.realpath(__file__))
cache_dirname = os.path.join(path_of_this_file, "cache")
if not os.path.isdir(cache_dirname):
os.mkdir(cache_dirname)
cache_filename = f.__module__ + "." + f.__name__
cachepath = os.path.join(cache_dirname, cache_filename)
try:
cache = shelve.open(cachepath, protocol=2)
except:
print 'Could not open cache file %s, maybe name collision' % cachepath
cache = None
#functools.wraps(f)
def wrapped(*args, **kwargs):
argdict = {}
# handle instance methods
if hasattr(f, '__self__'):
args = args[1:]
tempargdict = inspect.getcallargs(f, *args, **kwargs)
for k, v in tempargdict.iteritems():
argdict[k] = v
key = str(hash(frozenset(argdict.items())))
try:
return cache[key]
except KeyError:
value = f(*args, **kwargs)
cache[key] = value
cache.sync()
return value
except TypeError:
call_to = f.__module__ + '.' + f.__name__
print ['Warning: could not disk cache call to ',
'%s; it probably has unhashable args'] % (call_to)
return f(*args, **kwargs)
return wrapped
And here's a demonstration that the memoization doesn't currently save any time:
❯ time python test.py
python test.py 0.33s user 0.11s system 100% cpu 0.437 total
~/Desktop
❯ time python test.py
python test.py 0.33s user 0.11s system 100% cpu 0.436 total
This is happening even though the function is correctly being memoized (print statements put inside the memoized function only give output the first time the script is run).
I've put everything together into a GitHub Gist in case it's helpful.
What about a different approach:
import pickle
CACHE_FILE = 'cache.pkl'
try:
with open(CACHE_FILE) as pkl:
obj = pickle.load(pkl)
except:
import slowmodule
obj = "something"
with open(CACHE_FILE, 'w') as pkl:
pickle.dump(obj, pkl)
print obj
Here we cache the object, not the module. Note that this will not give you any savings if the object your caching requires slowmodule. So in the above example you would see savings, since "something" is a string and doesn't require the slowmodule module to understand it. But if you did something like
obj = slowmodule.Foo("bar")
The unpickling process would automatically import slowmodule, negating any benefit of caching.
So if you can manipulate textblob.TextBlob(text) into something that, when unpickled doesn't require the textblob module, then you'll see savings using this approach.

Categories