Are there any guidelines on when to stop chaining methods and instead break up the chain into several expressions?
Consider e.g. this Python code, which build up a dictionary, with word as key and the corresponding count as the value:
def build_dict(filename):
with open(filename, 'r') as f:
dict = defaultdict(int)
for word in f.read().lower().split(): # too much?
dict[word] += 1
return dict
Is chaining 3 methods okay? Would I gain any noticable benefit by split the expression up?
What would be the point of chaining only two? If you do method chaining, do it right.
It's more an issue of formatting, if it gets to much for a single line, I prefer
(x.Foo()
.Bar()
.FooBar()
.Barf());
Another issue can be debuggers, that force you to trace into Foo if you want to trace into Bar.
This is largely a matter of personal preference, but if the text in f isn't going to be used elsewhere then that's fine. The point at which it becomes unclear to a casual reader what the chain actually returns is the point at which it's too long. The only benefits to splitting it up are that you can use intermediate results and you may gain clarity.
One reason not to use long chains is that it obscures traceback error messages.
When an exception is raised anywhere in the long chain, the traceback error message only tells you the line on which the Exception occurred, not which part of the chain.
If you are confident that no exception will occur, then
for word in f.read().lower().split():
dict[word] += 1
might be preferable to
contents=f.read()
contents=contents.lower()
words=contents.split()
for word in words:
d[word] += 1
because memory is consumed by the string contents and the list words and is not released until this block of code ends (assuming no other references are made to the same objects). So if memory is tight, you might want to consider chaining.
If memory is not a problem, and particularly if words or contents could be used again later in the code, then assigning a variable to reference them will of course be faster since the read, lower and/or split methods won't have to be called again.
Related
I'm relatively new to Python, and I keep seeing examples like:
def max_wordnum(texts):
count = 0
for text in texts:
if len(text.split()) > count:
count = len(text.split())
return count
Is the repeated len(text.split()) somehow optimized away by the interpreter/compiler in Python, or will this just take twice the CPU cycles of storing len(text.split()) in a variable?
Duplicate expressions are not "somehow optimized away". Use a local variable to capture and re-use a result that is 'known not to change' and 'takes some not-insignificant time' to create; or where using a variable increases clarity.
In this case, it's impossible for Python to know that 'text.split()' is pure - a pure function is one with no side-effects and always returns the same value for the given input.
Trivially: Python, being a dynamically-typed language, doesn't even know the type of 'text' before it actually gets a value, so generalized optimization of this kind is not possible. (Some classes may provide their own internal 'cache optimizations', but digressing..)
As: even a language like C#, with static typing, won't/can't optimize away general method calls - as, again, there is no basic enforceable guarantee of purity in C#. (ie. What if the method returned a different value on the second call or wrote to the console?)
But: a Haskell, a Purely Functional language, has the option to not 'evaluate' the call twice, being a different language with different rules...
Even if python did optimize this (which isn't the case), the code is copy/paste all over and more difficult to maintain, so creating a variable to hold the result of a complex computation is always a good idea.
A better idea yet is to use max with a key function in this case:
return max(len(text.split()) for text in texts)
this is also faster.
Also note that len(text.split()) creates a list and you just count the items. A better way would be to count the spaces (if words are separated by only one space) by doing
return max(text.count(" ") for text in texts) + 1
if there can be more than 1 space, use regex and finditer to avoid creating lists:
return max(sum(1 for _ in re.finditer("\s+",text)) for text in texts) + 1
note the 1 value added in the end to correct the value (number of separators is one less than the number of words)
As an aside, even if the value isn't cached, you still can use complex expressions in loops with range:
for i in range(len(text.split())):
the range object is created at the start, and the expression is only evaluated once (as opposed as C loops for instance)
I need to extract (a lot of) info from different text files.
I wonder if there is a shorter and more efficient way than the following:
First part: (N lines long)
N1 = re.compile(r'')
N2 = re.compile(r'')
.
Nn = re.compile(r'')
Second part: (2N lines long)
with open(filename) as f:
for line in f:
if N1.match(line):
var1 = N1.match(line).group(x).strip()
elif N2.match(line):
var2 = N1.match(line).group(x).strip()
elif Nn.match(line):
varn = Nn
Do you recommend having the re.compile vars (part 1) separate from the part 2. What do you people use in this cases? Perhaps a function pasing the regex as argument? and call it every time.
In my case N is 30, meaning I have 90 lines for feeding a dictionary with very little, or no logic at all.
I’m going to attempt to answer this without really knowing what you are actually doing there. So this answer might help you, or it might not.
First of all, what re.compile does is pre-compile a regular expression, so you can use it later and do not have to compile it every time you use it. This is primarily useful when you have a regular expression that is used multiple times throughout your program. But if the expression is only used a few times, then there is not really that much of a benefit to compiling it up front.
So you should ask yourself, how often the code runs that attempts to match all those expressions. Is it just once during the script execution? Then you can make your code simpler by inlining the expressions. Since you’re running the matches for each line in a file, pre-compiling likely makes sense here.
But just because you pre-compiled the expression, that does not mean that you should be sloppy and match the same expression too often. Look at this code:
if N1.match(line):
var1 = N1.match(line).group(x).strip()
Assuming there is a match, this will run N1.match() twice. That’s an overhead you should avoid since matching expressions can be relatively expensive (depending on the expression), even if the expression is already pre-compiled.
Instead, just match it once, and then reuse the result:
n1_match = N1.match(line)
if n1_match:
var1 = n1_match.group(x).strip()
Looking at your code, your regular expressions also appear to be mutally exclusive—or at least you only ever use the first match and skip the remaining ones. In that case, you should make sure that you order your checks
so that the most common checks are done first. That way, you avoid running too many expressions that won’t match anyway. Also, try to order them so that more complex expressions are ran less often.
Finally, you are collecting the match result in separate variables varN. At this point, I’m questioning what exactly you are doing there, since after all your if checks, you do not have a clear way of figuring out what the result was and which variable to use. At this point, it might make more sense to just collect it in a single variable, or to move specific logic within the condition bodies. But it’s difficult to tell with the amount of information you gave.
As mentionned in re module documentation, the regexes you pass through re methods are cached: depending on the number of expressions you have, caching them yourself might not be useful.
That being said, you should make a list of your regexes, so that a simple for loop would allow you to test all your patterns.
regexes = map(re.compile, ['', '', '', '', ...])
vars = ['']*len(regexes)
with open(filename) as f:
for line in f:
for i,regex in enumerate(regexes):
if regex.match(line):
var[i] = regex.match(line).group(x).strip()
break # break here if you only want the first match for any given line.
I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.
First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:
from urllib2 import urlopen
class Word(object):
def __init__(self):
self.word_count = {}
def alpha_only(self, word):
"""Converts word to lowercase and removes any non-alphabetic characters."""
x = ''
for letter in word:
s = letter.lower()
if s in 'abcdefghijklmnopqrstuvwxyz':
x += s
if len(x) > 0:
return x
def count(self, line):
"""Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
Adds each word to word_count if not already present, if present increases the count by 1."""
words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
for word in words:
if word in self.word_count:
self.word_count[word] += 1
elif word != None:
self.word_count[word] = 1
class File(object):
def __init__(self,book):
self.book = urlopen(book)
self.word = Word()
def strip_line(self,line):
"""Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
splits up the line and passes it along to the count method of the word object."""
s = line.strip('\n\r\t')
if s > 1:
self.word.count(s)
def process_book(self):
"""Main processing loop, will not begin processing until the first line after the line containing "START".
After processing it will close the file."""
begin = False
for line in self.book:
if begin == True:
self.strip_line(line)
elif 'START' in line:
begin = True
self.book.close()
book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')
book.process_book()
count = book.word.word_count
So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count or it gives me the exception ValueError: too many values to unpack, which rules out using list comprehension or mapping to a function to perform any kind of sorting.
I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items() lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count and added the following:
s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)
# Delete the original dict, it is no longer needed
del book.word.word_count
Now I finally have a sorted list of words, s_count. PHEW! So, my questions are:
Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by count.items() have been preferable? But that would probably slow it down, right?
This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?
I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?
Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.
Are you sure that for k,v in count: gives the exception ValueError: too many values to unpack? I expect it to give ValueError: need more than 1 value to unpack.
When you use a dict as an iterator (eg in a for loop) you just get the keys, you don't get the values. If you want key, value pairs you need to use the dict's iteritems() method as mentioned by figs in the comment (or in Python 3 the items() method).
Of course, you can always do something like:
for k in count:
print k, count[k]
...
I think that most of your questions are more suited to Code Review than to Stack Overflow. But since you've asked so nicely here, I'll mention a few points. :)
It's rather inefficient to build up a string char by char, so your alpha_only() method would be better if it collected chars in a list then used the str.join() method to join them into a single string. The usual Python idiom would do that using a list comprehension.
The list comprehension in your count() method calls alpha_only() twice for each word, which is in efficient.
You could make your strip() call simpler by using the default argument, as that strips all white space (and you don't need to preserve space chars in this application). Similarly, using split() with its default arg will split on any runs of blank space, which is probably better in this application, since giving an arg of a single space means that you'll get some empty strings in the list returned by split if there are any runs of multiple spaces within a line.
...
You mention hashing in your question, and whether it's useful for this application. Yes, it is. Python dictionaries actually use hashing of their keys, so you don't need to worry about the details. And yes, a dictionary is a good data structure to use for this task. There are fancier forms of dictionary that make things a bit simpler, but to use them does require importing a (standard) module. But using a dictionary (of some flavour or another) to hold data and then generating a list of tuples from it for final sorting is a fairly common practice in Python. And there's no need to specifically delete the dictionary when you've finished with it if the program's about to terminate anyway.
...
As for the duplicated call of alpha_only(), whenever you find yourself doing that sort of thing it's a sign that a list comprehension isn't really suitable for the task and that you should just use a normal for loop so that you can save the result of the function call rather than having to recalculate it. Eg,
words = []
for word in line.split():
word = self.alpha_only(word)
if word is not None:
words.append(word)
Suppose I have a function and I want to analyze the run-time of the exact number of steps:
def function(L):
print ("Hello")
i = 0 # 1 step
while i < len(L): # 3 steps
print (L[i] + 1)
i += 2 # 2 steps
print ("Goodbye")
I was wondering if print statements count as a step?
"Step" isn't a well-defined term in this context -- as pswaminathan points out, usually what we care about isn't a precise numerical value but the mathematical behavior in a broader sense: as the problem size gets bigger (in this case, if you increase the size of the input L) does the execution time stay the same? Grow linearly? Quadratically? Exponentially?
Explicitly counting steps can be a part of that analysis -- it can be a good way to get an intuitive handle on an algorithm. But we don't have a consistent way of determining what counts as a "step" and what doesn't. You could be looking at lines of code -- but in Python, for example, a list comprehension can express a long, complicated loop in a single line. Or you could be counting CPU instructions, but in a higher-level language full of abstractions that's hopelessly difficult. So you pick a heuristic -- in straightforward code like your example, "every execution of a single line of code is a step" is a decent rule. And in that case, you'd certainly count the print statements. If you wanted to get in a little deeper, you could look at the bytecode as tobias_k suggests, to get a sense for what the instructions look like behind Python's syntax.
But there's no single agreed-upon rule. You mention it's for homework; in that case, only your instructor knows what definition they want you to use. That said, the simple answer to your question is most likely "yes, the print statements count."
If your task is to count the exact number of steps, then yes, print would count as a step. But note also that your second print is at least three steps long: list access, addition, and print.
In fact, print (and other 'atomic' statements) might actually be worth many "steps", depending on how you interpret step, e.g., CPU cycles, etc. This may be overkill for your assignment, but to be accurate, it might be worth having a look at the generated byte code. Try this:
import dis
print dis.dis(function)
This will give you the full list of more-or-less atomic steps in your function, e.g., loading a function, passing arguments to that function, popping elements from the stack, etc. According to this, even your first print is worth three steps (in Python 2.6):
2 0 LOAD_CONST 1 ('Hello')
3 PRINT_ITEM
4 PRINT_NEWLINE
How to interpret this: The first number (2) is the line number (i.e., all the following instructions are for that line alone); the central numbers (0) are jump labels (used by, e.g., if-statements and loops), followed by the actual instructions (LOAD_CONST); the right column holds the arguments for those instructions ('Hello'). For more details, see this answer.
I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!
str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.
If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO
Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).
You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.