I've been teaching myself Python at my new job, and really enjoying the language. I've written a short class to do some basic data manipulation, and I'm pretty confident about it.
But old habits from my structured/modular programming days are hard to break, and I know there must be a better way to write this. So, I was wondering if anyone would like to take a look at the following, and suggest some possible improvements, or put me on to a resource that could help me discover those for myself.
A quick note: The RandomItems root class was written by someone else, and I'm still wrapping my head around the itertools library. Also, this isn't the entire module - just the class I'm working on, and it's prerequisites.
What do you think?
import itertools
import urllib2
import random
import string
class RandomItems(object):
"""This is the root class for the randomizer subclasses. These
are used to generate arbitrary content for each of the fields
in a csv file data row. The purpose is to automatically generate
content that can be used as functional testing fixture data.
"""
def __iter__(self):
while True:
yield self.next()
def slice(self, times):
return itertools.islice(self, times)
class RandomWords(RandomItems):
"""Obtain a list of random real words from the internet, place them
in an iterable list object, and provide a method for retrieving
a subset of length 1-n, of random words from the root list.
"""
def __init__(self):
urls = [
"http://dictionary-thesaurus.com/wordlists/Nouns%285,449%29.txt",
"http://dictionary-thesaurus.com/wordlists/Verbs%284,874%29.txt",
"http://dictionary-thesaurus.com/wordlists/Adjectives%2850%29.txt",
"http://dictionary-thesaurus.com/wordlists/Adjectives%28929%29.txt",
"http://dictionary-thesaurus.com/wordlists/DescriptiveActionWords%2835%29.txt",
"http://dictionary-thesaurus.com/wordlists/WordsThatDescribe%2886%29.txt",
"http://dictionary-thesaurus.com/wordlists/DescriptiveWords%2886%29.txt",
"http://dictionary-thesaurus.com/wordlists/WordsFunToUse%28100%29.txt",
"http://dictionary-thesaurus.com/wordlists/Materials%2847%29.txt",
"http://dictionary-thesaurus.com/wordlists/NewsSubjects%28197%29.txt",
"http://dictionary-thesaurus.com/wordlists/Skills%28341%29.txt",
"http://dictionary-thesaurus.com/wordlists/TechnicalManualWords%281495%29.txt",
"http://dictionary-thesaurus.com/wordlists/GRE_WordList%281264%29.txt"
]
self._words = []
for url in urls:
urlresp = urllib2.urlopen(urllib2.Request(url))
self._words.extend([word for word in urlresp.read().split("\r\n")])
self._words = list(set(self._words)) # Removes duplicates
self._words.sort() # sorts the list
def next(self):
"""Return a single random word from the list
"""
return random.choice(self._words)
def get(self):
"""Return the entire list, if needed.
"""
return self._words
def wordcount(self):
"""Return the total number of words in the list
"""
return len(self._words)
def sublist(self,size=3):
"""Return a random segment of _size_ length. The default is 3 words.
"""
segment = []
for i in range(size):
segment.append(self.next())
#printable = " ".join(segment)
return segment
def random_name(self):
"""Return a string-formatted list of 3 random words.
"""
words = self.sublist()
return "%s %s %s" % (words[0], words[1], words[2])
def main():
"""Just to see it work...
"""
wl = RandomWords()
print wl.wordcount()
print wl.next()
print wl.sublist()
print 'Three Word Name = %s' % wl.random_name()
#print wl.get()
if __name__ == "__main__":
main()
Here are my five cents:
Constructor should be called __init__.
You could abolish some code by using random.sample, it does what your next() and sublist() does but it's prepackaged.
Override __iter__ (define the method in your class) and you can get rid of RandomIter. You can read more about at it in the docs (note Py3K, some stuff may not be relevant for lower version). You could use yield for this which as you may or may not know creates a generator, thus wasting little to no memory.
random_name could use str.join instead. Note that you may need to convert the values if they are not guaranteed to be strings. This can be done through [str(x) for x in iterable] or in-built map.
First knee-jerk reaction: I would offload your hard-coded URLs into a constructor parameter passed to the class and perhaps read from configuration somewhere; this will allow for easier change without necessitating a redeploy.
The drawback of this is that consumers of the class have to know where those URLs are stored... so you could create a companion class whose only job is to know what the URLs are (i.e. in configuration, or even hard-coded) and how to get them. You could allow the consumer of your class to provide the URLs, or if they are not provided, the class could hit up the companion class for the URLs.
Related
I have a code that gets some dirty data as input, then parse them, clean them, munge them, etc and then should return a value.
At the moment I structured it as a class where the __init__ method receives the input and calls the other methods in a giving sequence.
My class at the moment looks something like this:
class myProcedure:
def __init__(self, dirty_data, file_name):
self.variable1, self.variable2 = self.clean_data(dirty_data)
self.variable3 = self.get_data_from_file(file_name)
self.do_something()
def clean_data(self, dirty_data):
#clean the data
return variable1, variable2
def get_data_from_file(self, file_name):
#load some data
return loaded_data
def do_something(self):
#the interesting part goes here
self.result = the_result
Using a class instead of sparse functions allows to share data more easily. In my real code I have few tens of variable that get shared. The alternative would be to put them all in a dict or having each function to take 10-20 inputs. I find both this solutions a bit cumbersome
At the moment I must call it as:
useless_class_obj = myProcedure(dirty_data, file_name)
interesting_stuff = useless_class_obj.result
My concerns come form the fact that, once run, useless_class_obj does not have any purpose anymore and is just a useless piece of junk.
I think it would be more elegant to be able to use the class as:
interesting_stuff = myProcedure(dirty_data, file_name)
however this would require __init__ to return something different than self.
Is there a better way to do this?
Am I doing this in a bad or hard-to-read way?
Well... you could also do...
interesting_stuff = myProcedure(dirty_data, file_name).result
Considering the following example of post-processing using inheritance in python (from this website):
import os
class FileCat(object):
def cat(self, filepath):
f = file(filepath)
lines = f.readlines()
f.close()
return lines
class FileCatNoEmpty(FileCat):
def cat(self, filepath):
lines = super(FileCatNoEmpty, self).cat(filepath)
nonempty_lines = [l for l in lines if l != '\n']
return nonempty_lines
Basically, when we are post-processing, we don't really care about the original invocation, we just want to work with the data returned by the function.
So ideally, in my opinion, there should be no need for us to have redeclare the original function signature, just to be able to forward it to the original function.
If FileCat class had 100 different functions (cat1,cat2,cat3,...) that returned the same type of data and we wanted to use a post-processed NoEmpty version, then we would have to define the same 100 functions signatures in FileCatNoEmpty just to forward the calls.
So the question is: Is there a more elegant way of solving this problem?
That is, something like the FileCatNoEmpty class that would automatically make available all methods from FileCat but that still allows us to process the returned value?
Something like
class FileCatNoEmpty(FileCat):
# Any method with whatever arguments
def f(self,args):
lines = super(FileCatNoEmpty, self).f(args)
nonempty_lines = [l for l in lines if l != '\n']
return nonempty_lines
Or maybe even another solution that does not uses inheritance.
Thanks!
This answer, using a wrapper class that receives the original one in the constructor (instead of inheriting from it), solves the problem:
https://stackoverflow.com/a/4723921/3444175
I wanted to shorten my code, since i`m having more functions like this. I was wondering if I could use getattr() to do something like this guy asked.
Well, here it goes what I`ve got:
def getAllMarkersFrom(db, asJSON=False):
'''Gets all markers from given database. Returns list or Json string'''
markers = []
for marker in db.markers.find():
markers.append(marker)
if not asJSON:
return markers
else:
return json.dumps(markers, default=json_util.default)
def getAllUsersFrom(db, asJSON=False):
'''Gets all users from given database. Returns list or Json string'''
users = []
for user in db.users.find():
users.append(user)
if not asJSON:
return users
else:
return json.dumps(users, default=json_util.default)
I`m using pymongo and flask helpers on JSON.
What I wanted is to make a single getAllFrom(x,db) function that accepts any type of object. I don`t know how to do this, but I wanted to call db.X.find() where X is passed through the function.
Well, there it is. Hope you can help me. Thank you!
There's hardly any real code in either of those functions. Half of each is a slow recreation of the list() constructor. Once you get rid of that, you're left with a conditional, which can easily be condensed to a single line. So:
def getAllUsersFrom(db, asJSON=False):
users = list(db.users.find())
return json.dumps(users, default=json_util.default) if asJSON else users
This seems simple enough to me to not bother refactoring. There are some commonalities between the two functions, but breaking them out wouldn't reduce the number of lines of code any further.
One direction for possible simplification, however, is to not pass in a flag to tell the function what format to return. Let the caller do that. If they want it as a list, there's list(). For JSON, you can provide your own helper function. So, just write your functions to return the desired iterator:
def getAllUsersFrom(db):
return db.users.find()
def getAllMarkersFrom(db):
return db.markers.find()
And the helper function to convert the result to JSON:
def to_json(cur):
return json.dumps(list(cur), default=json_util.default)
So then, putting it all together, you just call:
markers = list(getAllMarkersFrom(mydb))
or:
users = to_json(getAllUsersFrom(mydb))
As you need.
If you really want a generic function for requesting various types of records, that'd be:
def getAllRecordsFrom(db, kind):
return getattr(db, kind).find()
Then call it:
users = list(getAllRecordsFrom(mydb, "users"))
etc.
I would say that its better to have separate functions for each task. And then you can have decorators for common functionality between different functions. For example:
#to_json
def getAllUsersFrom(db):
return list(db.users.find())
enjoy!
I made a program that extracts the text from a HTML file. It recurses down the HTML document and returns the list of tags. For eg,
input < li >no way < b > you < /b > are doing this < /li >
output ['no','way','you','are'...].
Here is a highly simplified pseudocode for this:
def get_leaves(node):
kids=getchildren(node)
for i in kids:
if leafnode(i):
get_leaves(i)
else:
a=process_leaf(i)
list_of_leaves.append(a)
def calling_fn():
list_of_leaves=[] #which is now in global scope
get_leaves(rootnode)
print list_of_leaves
I am now using list_of_leaves in a global scope from the calling function. The calling_fn() declares this variable, get_leaves() appends to this.
My question is, how do I modify my function so that I am able to do something like list_of_leaves=get_leaves(rootnode), ie without using a global variable?
I dont want each instance of the function to duplicate the list, as the list can get quite big.
Please dont critisize the design of this particular pseudocode, as I simplified this. It is meant for another purpose: extracting tokens along with associated tags using BeautifulSoup
You can pass the result list as optional argument.
def get_leaves(node, list_of_leaves=None):
list_of_leaves = [] if list_of_leaves is None else list_of_leaves
kids=getchildren(node)
for i in kids:
if leafnode(i):
get_leaves(i, list_of_leaves)
else:
a=process_leaf(i)
list_of_leaves.append(a)
def calling_fn():
result = []
get_leaves(rootnode, list_of_leaves=result)
print result
Python objects are always passed by reference. This has been discussed before here. Some of the built-in types are immutable (e.g. int, string), so you cannot modify them in place (a new string is created when you concatenate two strings and assign them to a variable). Instance of mutable types (e.g. list) can be modified in place. We are taking advantage of this by passing the original list for accumulating result in our recursive calls.
For extracting text from HTML in a real application, using a mature library like BeautifulSoup or lxml.html is always a much better option (as others have suggested).
No need to pass an accumulator to the function or accessing it through a global name if you turn get_leaves() into a generator:
def get_leaves(node):
for child in getchildren(node):
if leafnode(child):
for each in get_leaves(child):
yield each
else:
yield process_leaf(child)
def calling_fn():
list_of_leaves = list(get_leaves(rootnode))
print list_of_leaves
Use a decent HTML parser like BeautifulSoup instead of trying smarter than existing software.
#pillmincher's generator answer is the best, but as another alternative, you can turn your function into a class:
class TagFinder:
def __init__(self):
self.leaves = []
def get_leaves(self, node):
kids = getchildren(node)
for i in kids:
if leafnode(i):
self.get_leaves(i)
else:
a = process_leaf(i)
self.list_of_leaves.append(a)
def calling_fn():
finder = TagFinder()
finder.get_leaves(rootnode)
print finder.list_of_leaves
Your code likely involves a number of helper functions anyway, like leafnode, so a class also helps group them all together into one unit.
As a general question about recursion, this is a good one. It is common to have a recursive function that accumulates data into some collection. Either the collection needs to be a global variable (bad) or it is passed to the recursive function. When collections are passed in almost every language, only a reference is passed so you do not have to worry about space. Someone just posted an answer showing how to do this.
Basically, I have a list like: [START, 'foo', 'bar', 'spam', eggs', END] and the START/END identifiers are necessary for later so I can compare later on. Right now, I have it set up like this:
START = object()
END = object()
This works fine, but it suffers from the problem of not working with pickling. I tried doing it the following way, but it seems like a terrible method of accomplishing this:
class START(object):pass
class END(object):pass
Could anybody share a better means of doing this? Also, the example I have set up above is just an oversimplification of a different problem.
If you want an object that's guaranteed to be unique and can also be guaranteed to get restored to exactly the same identify if pickled and unpickled right back, top-level functions, classes, class instances, and if you care about is rather than == also lists (and other mutables), are all fine. I.e., any of:
# work for == as well as is
class START(object): pass
def START(): pass
class Whatever(object): pass
START = Whatever()
# if you don't care for "accidental" == and only check with `is`
START = []
START = {}
START = set()
None of these is terrible, none has any special advantage (depending if you care about == or just is). Probably def wins by dint of generality, conciseness, and lighter weight.
You can define a Symbol class for handling START and END.
class Symbol:
def __init__(self, value):
self.value = value
def __eq__(self, other):
return isinstance(other, Symbol) and other.value == self.value
def __repr__(self):
return "<sym: %r>" % self.value
def __str__(self):
return str(self.value)
START = Symbol("START")
END = Symbol("END")
# test pickle
import pickle
assert START == pickle.loads(pickle.dumps(START))
assert END == pickle.loads(pickle.dumps(END))
Actually, I like your solution.
A while back I was hacking on a Python module, and I wanted to have a special magical value that could not appear anywhere else. I spent some time thinking about it and the best I came up with is the same trick you used: declare a class, and use the class object as the special magical value.
When you are checking for the sentinel, you should of course use the is operator, for object identity:
for x in my_list:
if x is START:
# handle start of list
elif x is END:
# handle end of list
else:
# handle item from list
If your list didn't have strings, I'd just use "start", "end" as Python makes the comparison O(1) due to interning.
If you do need strings, but not tuples, the complete cheapskate method is:
[("START",), 'foo', 'bar', 'spam', eggs', ("END",)]
PS: I was sure your list was numbers before, not strings, but I can't see any revisions so I must have imagined it
I think maybe this would be easier to answer if you were more explicit about what you need this for, but my inclination if faced with a problem like this would be something like:
>>> START = os.urandom(16).encode('hex')
>>> END = os.urandom(16).encode('hex')
Pros of this approach, as I'm seeing it
Your markers are strings (can pickle or otherwise easily serialize, eg to JSON or a DB, without any special effort)
Very unlikely to collide either accidentally or on purpose
Will serialize and deserialize to identical values, even across process restarts, which (I think) would not be the case for object() or an empty class.
Cons(?)
Each time they are newly chosen they will be completely different. (This being good or bad depends on details you have not provided, I would think).