Me: I am running Python 2.3.3 without possibility to upgrade and i don't have much experience with Python. My method for learning is googling and reading tons of stackoverflow.
Background: I am creating a python script whose purpose is to take two directories as arguments and then perform comparisons/diff of all the files found within the two directories. The directories have sub-directories that also have to be included in the diff.
Each directory is a List and sub-directories are nested Lists and so on...
the two directories:
oldfiles/
a_tar_ball.tar
a_text_file.txt
nest1/
file_in_nest
nest1a/
file_in_nest
newfiles/
a_tar_ball.tar
a_text_file.txt
nest1/
file_in_nest
nest1a/
Problem: Normally all should go fine as all files in oldfiles should exist in newfiles but in the above example one of the 'file_in_nest' is missing in 'newfiles/'.
I wish to print an error message telling me which file that is missing but when i'm using the code structure below the current instance of my 'compare' function doesn't know any other directories but the closest one. I wonder if there is a built in error handling that can send information about files and directory up in the recursion ladder adding info to it as we go. If i would just print the filename of the missing file i would not know which one of them it might be as there are two 'file_in_nest' in 'oldfiles'
def compare(file_tree)
for counter, entry in enumerate(file_tree[0][1:]):
if not entry in file_tree[1]
# raise "some" error and send information about file back to the
# function calling this compare, might be another compare.
elif not isinstance(entry, basestring):
os.chdir(entry[0])
compare(entry)
os.chdir('..')
else:
# perform comparison (not relevant to the problem)
# detect if "some" error has been raised
# prepend current directory found in entry[0] to file information
break
def main()
file_tree = [['/oldfiles', 'a_tar_ball.tar', 'a_text_file.txt', \
[/nest1', 'file_in_nest', [/nest1a', 'file_in_nest']], \
'yet_another_file'], \
['/newfiles', 'a_tar_ball.tar', 'a_text_file.txt', \
[/nest1', 'file_in_nest', [/nest1a']], \
'yet_another_file']]
compare(file_tree)
# detect if "some" error has been raised and print error message
This is my first activity on stackoverflow other than reading som please tell me if i should improve on the question!
// Stefan
Well, it depends whether you want to report an error as an exception or as some form of status.
Let's say you want to go the 'exception' way and have the whole program crash if one file is missing, you can define your own exception saving the state from the callee to the caller:
class PathException(Exception):
def __init__(self, path):
self.path = path
Exception.__init__(self)
def compare(filetree):
old, new = filetree
for counter, entry in enumerate(old[1:]):
if entry not in new:
raise PathException(entry)
elif not isinstance(entry, basestring):
os.chdir(entry[0])
try:
compare(entry)
os.chdir("..")
except PathException as e:
os.chdir("..")
raise PathException(os.path.join(entry, e.path))
else:
...
Where you try a recursive call, and update any incoming exception with the information of the caller.
To see it on a smaller example, let's try to deep-compare two lists, and raise an exception if they are not equal:
class MyException(Exception):
def __init__(self, path):
self.path = path
Exception.__init__(self)
def assertEq(X, Y):
if hasattr(X, '__iter__') and hasattr(Y, '__iter__'):
for i, (x, y) in enumerate(zip(X, Y)):
try:
assertEq(x, y)
except MyException as e:
raise MyException([i] + e.path)
elif X != Y:
raise MyException([]) # Empty path -> Base case
This gives us:
>>> L1 = [[[1,2,3],[4,5],[[6,7,8],[7,9]]],[3,5,[7,8]]]
>>> assertEq(L1, L1)
Nothing happens (lists are similar), and:
>>> L1 = [[[1,2,3],[4,5],[[6,7,8],[7,9]]],[3,5,[7,8]]]
>>> L2 = [[[1,2,3],[4,5],[[6,7,8],[7,5]]],[3,5,[7,8]]] # Note the [7,9] -> [7,5]
>>> try:
... assertEq(L1, L2)
... except MyException as e:
... print "Diff at",e.path
Diff at [0, 2, 1, 1]
>>> print L1[0][2][1][1], L2[0][2][1][1]
9 5
Which gives the full path.
As recursive lists or paths are basically the same thing, it is easy to adapt it to your use case.
Another simple way of solving this would be to report this difference in files as a simple diff, similar to the others: you can return it as a difference between the old file and the (non-existent) new file, or return both the list of differences in files and the list of differences of files, in which case it is easy to update recursively the values as they are returned by the recursive calls.
Related
I'm organising python codes from jupyter notebook to OOP styles in text editors.
I'd like to make the program to ingest data, check for the max null values, and if it's above the threshold, print customised messages and stop the program.
for example:
class ProcessData():
def read_data(path):
df = pd.read_csv(path)
# Deal with NA values
try:
max_null = df.isnull().mean().max()
except max_null > 0.01:
raise Exception(f"Missing value percentage too high: {max_null}. Review data")
else:
df.dropna(inplace=True, inplace=True)
return df
Is this correct ? or overly complicated?
You just want to use if for this, except is expecting an Exception that inherits from BaseException
https://docs.python.org/3/library/exceptions.html
if condition:
raise Exception(f"failed due to {condition}")
You can also avoid creating a class that does only does one thing; simply creating a file with a .py extension and putting your function(s) into it is sufficient to create a new namespace
import myfile # namespace from myfile.py
myfile.function_in_myfile()
So recently I was working on a file.io recursive homework involving basic os methods and I have run into an error where the values in one part of the recursive function are not being passed on when I call the function again.
def findLargestFile(path):
findLargestFileHelper(path)
def findLargestFileHelper(path, size=0, pathToLargest=""):
if (os.path.isdir(path) == False):
if os.path.getsize(path) > size:
size = os.path.getsize(path)
pathToLargest = print(path)
else:
for filename in os.listdir(path):
findLargestFileHelper(path + "/" + filename, size, pathToLargest)
return pathToLargest
What I am trying to do is find the largest file in a folder, by recursively looping through all the folders till I find a file, and seeing if that is the largest file and if it is, pass the size in to "size" and pass the path to the file to "pathToLargest".
It seems that size is not passed in the else statement, and also when I just pass "path" to pathToLargest, it also is not working. (Pretty sure print(path) isn't the way to go about it either)
If someone could suggest what I should do instead, that would be greatly appreciated.
Edit:
Firstly, thank you #martineau and #jonrsharpe for your prompt reply.
I was initially hesitant to write a verbose description, but I now realize that I am sacrificing clarity for brevity. (thanks #jonrsharpe for the link).
So here's my attempt to describe what I am upto as succinctly as possible:
I have implemented the Lempel-Ziv-Welch text file compression algorithm in form of a python package. Here's the link to the repository.
Basically, I have a compress class in the lzw.Compress module, which takes in as input the file name(and a bunch of other jargon parameters) and generates the compressed file which is then decompressed by the decompress class within the lzw.Decompress module generating the original file.
Now what I want to do is to compress and decompress a bunch of files of various sizes stored in a directory and save and visualize graphically the time taken for compression/decompression along with the compression ratio and other metrics. For this, I am iterating over the list of the file names and passing them as parameters to instantiate the compress class and begin compression by calling the encode() method on it as follows:
import os
os.chdir('/path/to/files/to/be/compressed/')
results = dict()
results['compress_time'] = []
results['other_metrics'] = []
file_path = '/path/to/files/to/be/compressed/'
comp_path = '/path/to/store/compressed/files/'
decomp_path = '/path/to/store/decompressed/file'
files = [_ for _ in os.listdir()]
for f in files:
from lzw.Compress import compress as comp
from lzw.Decompress import decompress as decomp
c = comp(file_path+f,comp_path) #passing the input file and the output path for storing compressed file.
c.encode()
#Then measure time required for comression using time.monotonic()
del c
del comp
d = decomp('/path/to/compressed/file',decomp_path) #Decompressing
d.decode()
#Then measure time required for decompression using
#time.monotonic()
#append metrics to lists in the results dict for this particular
#file
if decompressed_file_size != original_file_size:
print("error")
break
del d
del decomp
I have run this code independently for each file without the for loop and have achieved compression and decompression successfully. So there are no problems in the files I wish to compress.
What happens is that whenever I run this loop, the first file (the first iteration) runs successfully and the on the next iteration, after the entire process happens for the 2nd file, "error" is printed and the loop exits. I have tried reordering the list or even reversing it(maybe a particular file is having a problem), but to no avail.
For the second file/iteration, the decompressed file contents are dubious(not matching the original file). Typically, the decompressed file size is nearly double that of the original.
I strongly suspect that there is something to do with the variables of the class/package retaining their state somehow among different iterations of the loop. (To counter this I am deleting both the instance and the class at the end of the loop as shown in the above snippet, but no success.)
I have also tried to import the classes outside the loop, but no success.
P.S.: I am a python newbie and don't have much of an expertise, so forgive me for not being "pythonic" in my exposition and raising a rather naive issue.
Update:
Thanks to #martineau, one of the problem was regarding the importing of global variables from another submodule.
But there was another issue which crept in owing to my superficial knowledge about the 'del' operator in python3.
I have this trie data structure in my program which is basically just similar to a binary tree.
I had a self_destruct method to delete the tree as follows:
class trie():
def __init__(self):
self.next = {}
self.value = None
self.addr = None
def insert(self, word=str(),addr=int()):
node = self
for index,letter in enumerate(word):
if letter in node.next.keys():
node = node.next[letter]
else:
node.next[letter] = trie()
node = node.next[letter]
if index == len(word) - 1:
node.value = word
node.addr = addr
def self_destruct(self):
node = self
if node.next == {}:
return
for i in node.next.keys():
node.next[i].self_destruct()
del node
Turns out that this C-like recursive deletion of objects makes no sense in python as here simply its association in the namespace is removed while the real work is done by the garbage collector.
Still, its kinda weird why python is retaining the state/association of variables even on creating a new object(as shown in my loop snippet in the edit).
So 2 things solved the problem. Firstly, I removed the global variables and made them local to the module where I need them(so no need to import). Also, I deleted the self_destruct method of the trie and simple did: del root where root = trie() after use.
Thanks #martineau & #jonrsharpe.
This is more of a puzzle than a question per se, because I think I already have the right answer - I just don't know how to test it (if it works under all scenarios).
My program takes input from user about which modules it will need to load (in the form of an unsorted list or set). Some of those modules depend on other modules. The way module-dependancy information is stored is in a dictionary like this:
all_modules = { 'moduleE':[], 'moduleD':['moduleC'], 'moduleC':['moduleB'], 'moduleB':[], 'moduleA':['moduleD'] }
Where moduleE has no dependancies, and moduleD depends on moduleC, etc...
Also, there is no definitive lists of all possible modules, since the users can generate their own, and I create new ones from time to time, so this solution has to be fairly generic (and thus tested to work in all cases).
What I want to do is to get a list of modules to run in order, such that modules that are dependant on other modules are only run after their dependancies.
So I wrote the following code to try and do this:
def sort_dependencies(modules_to_sort,all_modules,recursions):
## Takes a small set of modules the user wants to run (as a list) and
## the full dependency tree (as a dict) and returns a list of all the
## modules/dependencies needed to be run, in the order to be run in.
## Cycles poorly detected as recursions going above 10.
## If that happens, this function returns False.
if recursions == 10:
return False
result = []
for module in modules_to_sort:
if module not in result:
result.append(module)
dependencies = all_modules[module]
for dependency in dependencies:
if dependency not in result:
result.append(dependency)
else:
result += [ result.pop(result.index(dependency)) ]
subdependencies = sort_dependencies(dependencies, all_modules, recursions+1)
if subdependencies == False:
return False
else:
for subdependency in subdependencies:
if subdependency not in result:
result.append(subdependency)
else:
result += [ result.pop(result.index(subdependency)) ]
return result
And it works like this:
>>> all_modules = { 'moduleE':[], 'moduleD':['moduleC'], 'moduleC':['moduleB'], 'moduleB':[], 'moduleA':['moduleD'] }
>>> sort_dependencies(['moduleA'],all_modules,0)
['moduleA', 'moduleD', 'moduleC', 'moduleB']
Note that 'moduleE' isn't returned, since the user doesn't need to run that.
Question is - does it work for any given all_modules dictionaries and any given required modules_to_load list? Is there a dependancy graph I can put in and a number of user-module-lists to try that, if they work, I can say all graphs/user-lists will work?
After the excellent advice by Marshall Farrier, it looks like i'm trying to do is a topological sort, so after watching this and this I implemented that as the following:
EDIT: Now with cyclic dependancy checking!
def sort_dependencies(all_modules):
post_order = []
tree_edges = {}
for fromNode,toNodes in all_modules.items():
if fromNode not in tree_edges:
tree_edges[fromNode] = 'root'
for toNode in toNodes:
if toNode not in tree_edges:
post_order += get_posts(fromNode,toNode,tree_edges)
post_order.append(fromNode)
return post_order
def get_posts(fromNode,toNode,tree_edges):
post_order = []
tree_edges[toNode] = fromNode
for dependancy in all_modules[toNode]:
if dependancy not in tree_edges:
post_order += get_posts(toNode,dependancy,tree_edges)
else:
parent = tree_edges[toNode]
while parent != 'root':
if parent == dependancy: print 'cyclic dependancy found!'; exit()
parent = tree_edges[parent]
return post_order + [toNode]
sort_dependencies(all_modules)
However, a topological sort like the one above sorts the whole tree, and doesn't actually return just the modules the user needs to run. Of course having the topological sort of the tree helps solve this problem, but it's not really the same question as the OP. I think for my data, the topological sort is probably best, but for a huge graph like all packages in apt/yum/pip/npm, it's probably better to use the original algorithm in the OP (which I don't know if it actually works in all scenarios...) as it only sorts what needs to be used.
So i'm leaving the question up an unanswered, because the question is really "How do i test this?"
I have two functions—one that builds the path to a set of files and another that reads the files. Below are the two functions:
def pass_file_name(self):
self.log_files= []
file_name = self.path+"\\access_"+self.appliacation+".log"
if os.path.isfile(file_name):
self.log_files.append(file_name)
for i in xrange(7):
file_name = self.path+"\\access_"+self.appliacation+".log"+"."+str(i+1)
if os.path.isfile(file_name):
self.log_files.append(file_name)
return self.log_files
def read_log_files (self, log_file_names):
self.log_entrys = []
self.log_line = []
for i in log_file_names:
self.f = open(i)
for line in self.f:
self.log_line = line.split(" ")
#print self.log_line
self.log_entrys.append(self.log_line)
return self.log_entrys
What would be the best way to unit test these two functions?
You have two units here:
One that generate file paths
Second that reads them
Thus there should be two unit-test-cases (i.e. classes with tests). First would test only file paths generation. Second would test reading from predefined set of files you prepared in special subdirectory of tests directory, it should test in isolation from first test case.
In your case, you could probably have very short log files for tests. In this case for better readability and maintenance it is good idea to embed them right in test code. But in this case you'll have to improve your reading function a bit so it can take either file name or file-like object:
from cStringIO import StringIO
# ...
def test_some_log_reading_scenario(self):
log1 = '\n'.join([
'log line',
'another log line'
])
log2 = '\n'.join([
'another log another line',
'lala blah blah'
])
# ...
result = myobj.read_log_files([StringIO(log1), StringIO(log2)])
# assert result
Personally, I'd build a test harness that set up the required files before testing those two functions.
For each test case (where you expect the file to be present - remember to test failure cases too!), write some known logs into the appropriately named files; then call the functions under test and check the results.
I'm no expert but I'll give it a go. First a bit of refactoring: make them functional (remove all class stuff), remove unneeded things. This should make it much easier to test. You can always make the class call these functions if you really want it in a class.
def pass_file_name(base_filename, exists):
"""return a list of filenames that exist
based upon `base_filename`.
use `os.path.isfile` for `exists`"""
log_files = []
if exists(base_filename):
log_files.append(base_filename)
for i in range(1, 8):
filename = base_filename + "." + str(i)
if exists(filename):
log_files.append(filename)
return log_files
def read_log_files (self, log_files):
"""read and parse each line from log_files
use `pass_file_name` for `log_files`"""
log_entrys = []
for filename in log_files:
with open(filename) as myfile:
for line in myfile:
log_entrys.append(line.split())
return log_entrys
Now we can easily test pass_file_name by passing in a custom function to exists.
class Test_pass_file_name(unittest.TestCase):
def test_1(self):
"""assume every file exists
make sure all logs file are there"""
exists = lambda _: True
log_files = pass_file_name("a", exists)
self.assertEqual(log_files,
["a", "a.1", "a.2", "a.3",
"a.4", "a.5", "a.6", "a.7"])
def test_2(self):
"""assume no files exists
make sure nothing returned"""
exists = lambda _: False
log_files = pass_file_name("a", exists)
self.assertEqual(log_files, [])
# ...more tests here ...
As we assume os.path.isfile works we should have got pretty good testing of the first function. Though you could always have the test actually create some files then call pass_file_name with exists = os.path.isfile.
The second one is harder to test; I have been told that the best (unit)tests don't touch the network, databases, GUI or the hard-drive. So maybe some more refactoring would make it easier. Mocking open could work; or would could actually write some long file in the test function and read them in.
How do I mock an open used in a with statement (using the Mock framework in Python)?
Bind the open name in the module to a function that mocks the file opening.