Edit:
Firstly, thank you #martineau and #jonrsharpe for your prompt reply.
I was initially hesitant to write a verbose description, but I now realize that I am sacrificing clarity for brevity. (thanks #jonrsharpe for the link).
So here's my attempt to describe what I am upto as succinctly as possible:
I have implemented the Lempel-Ziv-Welch text file compression algorithm in form of a python package. Here's the link to the repository.
Basically, I have a compress class in the lzw.Compress module, which takes in as input the file name(and a bunch of other jargon parameters) and generates the compressed file which is then decompressed by the decompress class within the lzw.Decompress module generating the original file.
Now what I want to do is to compress and decompress a bunch of files of various sizes stored in a directory and save and visualize graphically the time taken for compression/decompression along with the compression ratio and other metrics. For this, I am iterating over the list of the file names and passing them as parameters to instantiate the compress class and begin compression by calling the encode() method on it as follows:
import os
os.chdir('/path/to/files/to/be/compressed/')
results = dict()
results['compress_time'] = []
results['other_metrics'] = []
file_path = '/path/to/files/to/be/compressed/'
comp_path = '/path/to/store/compressed/files/'
decomp_path = '/path/to/store/decompressed/file'
files = [_ for _ in os.listdir()]
for f in files:
from lzw.Compress import compress as comp
from lzw.Decompress import decompress as decomp
c = comp(file_path+f,comp_path) #passing the input file and the output path for storing compressed file.
c.encode()
#Then measure time required for comression using time.monotonic()
del c
del comp
d = decomp('/path/to/compressed/file',decomp_path) #Decompressing
d.decode()
#Then measure time required for decompression using
#time.monotonic()
#append metrics to lists in the results dict for this particular
#file
if decompressed_file_size != original_file_size:
print("error")
break
del d
del decomp
I have run this code independently for each file without the for loop and have achieved compression and decompression successfully. So there are no problems in the files I wish to compress.
What happens is that whenever I run this loop, the first file (the first iteration) runs successfully and the on the next iteration, after the entire process happens for the 2nd file, "error" is printed and the loop exits. I have tried reordering the list or even reversing it(maybe a particular file is having a problem), but to no avail.
For the second file/iteration, the decompressed file contents are dubious(not matching the original file). Typically, the decompressed file size is nearly double that of the original.
I strongly suspect that there is something to do with the variables of the class/package retaining their state somehow among different iterations of the loop. (To counter this I am deleting both the instance and the class at the end of the loop as shown in the above snippet, but no success.)
I have also tried to import the classes outside the loop, but no success.
P.S.: I am a python newbie and don't have much of an expertise, so forgive me for not being "pythonic" in my exposition and raising a rather naive issue.
Update:
Thanks to #martineau, one of the problem was regarding the importing of global variables from another submodule.
But there was another issue which crept in owing to my superficial knowledge about the 'del' operator in python3.
I have this trie data structure in my program which is basically just similar to a binary tree.
I had a self_destruct method to delete the tree as follows:
class trie():
def __init__(self):
self.next = {}
self.value = None
self.addr = None
def insert(self, word=str(),addr=int()):
node = self
for index,letter in enumerate(word):
if letter in node.next.keys():
node = node.next[letter]
else:
node.next[letter] = trie()
node = node.next[letter]
if index == len(word) - 1:
node.value = word
node.addr = addr
def self_destruct(self):
node = self
if node.next == {}:
return
for i in node.next.keys():
node.next[i].self_destruct()
del node
Turns out that this C-like recursive deletion of objects makes no sense in python as here simply its association in the namespace is removed while the real work is done by the garbage collector.
Still, its kinda weird why python is retaining the state/association of variables even on creating a new object(as shown in my loop snippet in the edit).
So 2 things solved the problem. Firstly, I removed the global variables and made them local to the module where I need them(so no need to import). Also, I deleted the self_destruct method of the trie and simple did: del root where root = trie() after use.
Thanks #martineau & #jonrsharpe.
Related
I have the following code:
def incrCount(root):
root.attrib['count'] = int(root.attrib['count']) + 1
# root.set('count', int(root.attrib['count']) + 1)
root = getXMLRoot('test.xml')
incrCount(root)
print root.attrib['count']
when I run it, the correct value is printed but that change is never visible in the file at the end of execution. I have tried both methods above to no success. Can anyone point out where I made the mistake?
As exemplified in the documentation (19.7.1.4. Modifying an XML File), you need to write back to file after all modification operations has been performed. Assuming that root references instance of ElementTree, you can use ElementTree.write() method for this purpose :
.....
root = getXMLRoot('test.xml')
incrCount(root)
print root.attrib['count']
root.write('test.xml')
This is more of a puzzle than a question per se, because I think I already have the right answer - I just don't know how to test it (if it works under all scenarios).
My program takes input from user about which modules it will need to load (in the form of an unsorted list or set). Some of those modules depend on other modules. The way module-dependancy information is stored is in a dictionary like this:
all_modules = { 'moduleE':[], 'moduleD':['moduleC'], 'moduleC':['moduleB'], 'moduleB':[], 'moduleA':['moduleD'] }
Where moduleE has no dependancies, and moduleD depends on moduleC, etc...
Also, there is no definitive lists of all possible modules, since the users can generate their own, and I create new ones from time to time, so this solution has to be fairly generic (and thus tested to work in all cases).
What I want to do is to get a list of modules to run in order, such that modules that are dependant on other modules are only run after their dependancies.
So I wrote the following code to try and do this:
def sort_dependencies(modules_to_sort,all_modules,recursions):
## Takes a small set of modules the user wants to run (as a list) and
## the full dependency tree (as a dict) and returns a list of all the
## modules/dependencies needed to be run, in the order to be run in.
## Cycles poorly detected as recursions going above 10.
## If that happens, this function returns False.
if recursions == 10:
return False
result = []
for module in modules_to_sort:
if module not in result:
result.append(module)
dependencies = all_modules[module]
for dependency in dependencies:
if dependency not in result:
result.append(dependency)
else:
result += [ result.pop(result.index(dependency)) ]
subdependencies = sort_dependencies(dependencies, all_modules, recursions+1)
if subdependencies == False:
return False
else:
for subdependency in subdependencies:
if subdependency not in result:
result.append(subdependency)
else:
result += [ result.pop(result.index(subdependency)) ]
return result
And it works like this:
>>> all_modules = { 'moduleE':[], 'moduleD':['moduleC'], 'moduleC':['moduleB'], 'moduleB':[], 'moduleA':['moduleD'] }
>>> sort_dependencies(['moduleA'],all_modules,0)
['moduleA', 'moduleD', 'moduleC', 'moduleB']
Note that 'moduleE' isn't returned, since the user doesn't need to run that.
Question is - does it work for any given all_modules dictionaries and any given required modules_to_load list? Is there a dependancy graph I can put in and a number of user-module-lists to try that, if they work, I can say all graphs/user-lists will work?
After the excellent advice by Marshall Farrier, it looks like i'm trying to do is a topological sort, so after watching this and this I implemented that as the following:
EDIT: Now with cyclic dependancy checking!
def sort_dependencies(all_modules):
post_order = []
tree_edges = {}
for fromNode,toNodes in all_modules.items():
if fromNode not in tree_edges:
tree_edges[fromNode] = 'root'
for toNode in toNodes:
if toNode not in tree_edges:
post_order += get_posts(fromNode,toNode,tree_edges)
post_order.append(fromNode)
return post_order
def get_posts(fromNode,toNode,tree_edges):
post_order = []
tree_edges[toNode] = fromNode
for dependancy in all_modules[toNode]:
if dependancy not in tree_edges:
post_order += get_posts(toNode,dependancy,tree_edges)
else:
parent = tree_edges[toNode]
while parent != 'root':
if parent == dependancy: print 'cyclic dependancy found!'; exit()
parent = tree_edges[parent]
return post_order + [toNode]
sort_dependencies(all_modules)
However, a topological sort like the one above sorts the whole tree, and doesn't actually return just the modules the user needs to run. Of course having the topological sort of the tree helps solve this problem, but it's not really the same question as the OP. I think for my data, the topological sort is probably best, but for a huge graph like all packages in apt/yum/pip/npm, it's probably better to use the original algorithm in the OP (which I don't know if it actually works in all scenarios...) as it only sorts what needs to be used.
So i'm leaving the question up an unanswered, because the question is really "How do i test this?"
My question is the exact opposite of this one.
This is an excerpt from my test file
f1 = open('seed1234','r')
f2 = open('seed7883','r')
s1 = eval(f1.read())
s2 = eval(f2.read())
f1.close()
f2.close()
####
test_sampler1.random_inst.setstate(s1)
out1 = test_sampler1.run()
self.assertEqual(out1,self.out1_regress) # this is fine and passes
test_sampler2.random_inst.setstate(s2)
out2 = test_sampler2.run()
self.assertEqual(out2,self.out2_regress) # this FAILS
Some info -
test_sampler1 and test_sampler2 are 2 object from a class that performs some stochastic sampling. The class has an attribute random_inst which is an object of type random.Random(). The file seed1234 contains a TestSampler's random_inst's state as returned by random.getstate() when it was given a seed of 1234 and you can guess what seed7883 is. What I did was I created a TestSampler in the terminal, gave it a random seed of 1234, acquired the state with rand_inst.getstate() and save it to a file. I then recreate the regression test and I always get the same output.
HOWEVER
The same procedure as above doesn't work for test_sampler2 - whatever I do not get the same random sequence of numbers. I am using python's random module and I am not importing it anywhere else, but I do use numpy in some places (but not numpy.random).
The only difference between test_sampler1 and test_sampler2 is that they are created from 2 different files. I know this is a big deal and it is totally dependent on the code I wrote but I also can't simply paste ~800 lines of code here, I am merely looking for some general idea of what I might be messing up...
What might be scrambling the state of test_sampler2's random number generator?
Solution
There were 2 separate issues with my code:
1
My script is a command line script and after I refactored it to use python's optparse library I found out that I was setting the seed for my sampler using something like seed = sys.argv[1] which meant that I was setting the seed to be a str, not an int - seed can take any hashable object and I found it the hard way. This explains why I would get 2 different sequences if I used the same seed - one if I run my script from the command line with sth like python sample 1234 #seed is 1234 and from my unit_tests.py file when I would create an object instance like test_sampler1 = TestSampler(seed=1234).
2
I have a function for discrete distribution sampling which I borrowed from here (look at the accepted answer). The code there was missing something fundamental: it was still non-deterministic in the sense that if you give it the same values and probabilities array, but transformed by a permutation (say values ['a','b'] and probs [0.1,0.9] and values ['b','a'] and probabilities [0.9,0.1]) and the seed is set and you will get the same random sample, say 0.3, by the PRNG, but since the intervals for your probabilities are different, in one case you'll get a b and in one an a. To fix it, I just zipped the values and probabilities together, sorted by probability and tadaa - I now always get the same probability intervals.
After fixing both issues the code worked as expected i.e. out2 started behaving deterministically.
The only thing (apart from an internal Python bug) that can change the state of a random.Random instance is calling methods on that instance. So the problem lies in something you haven't shown us. Here's a little test program:
from random import Random
r1 = Random()
r2 = Random()
for _ in range(100):
r1.random()
for _ in range(200):
r2.random()
r1state = r1.getstate()
r2state = r2.getstate()
with open("r1state", "w") as f:
print >> f, r1state
with open("r2state", "w") as f:
print >> f, r2state
for _ in range(100):
with open("r1state") as f:
r1.setstate(eval(f.read()))
with open("r2state") as f:
r2.setstate(eval(f.read()))
assert r1state == r1.getstate()
assert r2state == r2.getstate()
I haven't run that all day, but I bet I could and never see a failing assert ;-)
BTW, it's certainly more common to use pickle for this kind of thing, but it's not going to solve your real problem. The problem is not in getting or setting the state. The problem is that something you haven't yet found is calling methods on your random.Random instance(s).
While it's a major pain in the butt to do so, you could try adding print statements to random.py to find out what's doing it. There are cleverer ways to do that, but better to keep it dirt simple so that you don't end up actually debugging the debugging code.
I have written a simple MapReduce flow to read in lines from a CSV from a file on Google Cloud Storage and subsequently make an Entity. However, I can't seem to get it to run on more than one shard.
The code makes use of mapreduce.control.start_map and looks something like this.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'lines'})
I have shard_count in both places, because I'm not sure what methods actually need it. Setting shard_count anywhere from 8 to 32, doesn't change anything as the status page always says 1/1 shards running. To separate things, I've made everything run on a backend queue with a large number of instances. I've tried adjusting the queue parameters per this wiki. In the end, it seems to just run serially.
Any ideas? Thanks!
Update (Still no success):
In trying to isolate things, I tried making the call using direct calls to pipeline like so:
class ImportHandler(webapp2.RequestHandler):
def get(self, gsfile):
pipeline = LoadEntitiesPipeline2(gsfile)
pipeline.start(queue_name=get_queue_name("q-1"))
self.redirect(pipeline.base_path + "/status?root=" + pipeline.pipeline_id)
class LoadEntitiesPipeline2(base_handler.PipelineBase):
def run(self, gsfile):
yield mapreduce_pipeline.MapperPipeline(
'loadentities2_' + gsfile,
'backend.line_processor',
'mapreduce.input_readers.FileInputReader',
params={'files': [gsfile], 'format': 'lines'},
shards=32
)
With this new code, it still only runs on one shard. I'm starting to wonder if mapreduce.input_readers.FileInputReader is capable of parallelizing input by line.
It looks like FileInputReader can only shard via files. The format params only change the way mapper function got call. If you pass more than one files to the mapper, it will start to run on more than one shard. Otherwise it will only use one shard to process the data.
EDIT #1:
After dig deeper in the mapreduce library. MapReduce will decide whether or not to split file into pieces based on the can_split method return for each file type it defined. Currently, the only format which implement split method is ZipFormat. So, if your file format is not zip, it won't split the file to run on more than one shard.
#classmethod
def can_split(cls):
"""Indicates whether this format support splitting within a file boundary.
Returns:
True if a FileFormat allows its inputs to be splitted into
different shards.
"""
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/file_formats.py
But it looks like it is possible to write your own file format split method. You can try to hack and add split method on _TextFormat first and see if more than one shard running.
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
pass
EDIT #2:
An easy workaround would be left the FileInputReader run serially but move the time-cosuming task to parallel reduce stage.
def line_processor(line):
# serial
yield (random.randrange(1000), line)
def reducer(key, values):
# parallel
entities = []
for v in values:
entities.append(CREATE_ENTITY_FROM_VALUE(v))
db.put(entities)
EDIT #3:
If try to modify the FileFormat, here is an example (haven't been test yet)
from file_formats import _TextFormat, FORMATS
class _LinesSplitFormat(_TextFormat):
"""Read file line by line."""
NAME = 'split_lines'
def get_next(self):
"""Inherited."""
index = self.get_index()
cache = self.get_cache()
offset = sum(cache['infolist'][:index])
self.get_current_file.seek(offset)
result = self.get_current_file().readline()
if not result:
raise EOFError()
if 'encoding' in self._kwargs:
result = result.encode(self._kwargs['encoding'])
return result
#classmethod
def can_split(cls):
"""Inherited."""
return True
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
"""Inherited."""
if 'infolist' in cache:
infolist = cache['infolist']
else:
infolist = []
for i in opened_file:
infolist.append(len(i))
cache['infolist'] = infolist
index = start_index
while desired_size > 0 and index < len(infolist):
desired_size -= infolist[index]
index += 1
return desired_size, index
FORMATS['split_lines'] = _LinesSplitFormat
Then the new file format can be called via change the mapper_parameters from lines to split_line.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'split_lines'})
It looks to me like FileInputReader should be capable of sharding based on a quick reading of:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/input_readers.py
It looks like 'format': 'lines' should split using: self.get_current_file().readline()
Does it seem to be interpreting the lines correctly when it is working serially? Maybe the line breaks are the wrong encoding or something.
From experience FileInputReader will do a max of one shard per file.
Solution: Split your big files. I use split_file in https://github.com/johnwlockwood/karl_data to shard files before uploading them to Cloud Storage.
If the big files are already up there, you can use a Compute Engine instance to pull them down and do the sharding because the transfer speed will be fastest.
FYI: karld is in the cheeseshop so you can pip install karld
I am writing code with python that might run wild and do unexpected things. These might include trying to save very large arrays to disk and trying to allocate huge amounts of memory for arrays (more than is physically available on the system).
I want to run the code in a constrained environment in Mac OSX 10.7.5 with the following rules:
The program can write files to one specific directory and no others (i.e. it cannot modify files outside this directory but it's ok to read files from outside)
The directory has a maximum "capacity" so the program cannot save gigabytes worth of data
Program can allocate only a finite amount of memory
Does anyone have any ideas on how to set up such a controlled environment?
Thanks.
import os
stats = os.stat('possibly_big_file.txt')
if (stats.st_size > TOOBIG):
print "Oh no....."
A simple and naive solution, that can be expanded to achieve what you want:
WRITABLE_DIRECTORY = '/full/path/to/writable/directory'
class MaxSizeFile(object):
def __init__(self, fobj, max_bytes=float('+inf')):
self._fobj = fobj
self._max = max_bytes
self._cur = 0
def write(self, data):
# should take into account file position...
if self._cur + len(data) > self._max:
raise IOError('The file is too big!')
self._fobj.write(data)
self._cur += len(data)
def __getattr__(self, attr):
return getattr(self._fobj, attr)
def my_open(filename, mode='r', ..., max_size=float('+inf')):
if '+' in mode or 'w' in mode:
if os.path.dirname(filename) != WRITABLE_DIRECTORY:
raise OSError('Cannot write outside the writable directory.')
return MaxSizeFile(open(filename, mode, ...), max_size)
Then, instead using the built-in open you call my_open. The same can be done for the arrays. Instead of allocating the arrays directly you call a function that keeps track of how much memory has been allocated and eventually raises an exception.
Obviously this gives only really light constraints, but if the program wasn't written with the goal of causing problems it should be enough.