I'm trying to use the timeit module in Python (EDIT: We are using Python 3) to decide between a couple of different code flows. In our code, we have a series of if-statements that test for the existence of a character code in a string, and if it's there replace it like this:
if "<substring>" in str_var:
str_var = str_var.replace("<substring>", "<new_substring>")
We do this a number of times for different substrings. We're debating between that and using just the replace like this:
str_var = str_var.replace("<substring>", "<new_substring>")
We tried to use timeit to determine which one was faster. If the first code-block above is "stmt1" and the second is "stmt2", and our setup string looks like
str_var = '<string><substring><more_string>',
our timeit statements will look like this:
timeit.timeit(stmt=stmt1, setup=setup)
and
timeit.timeit(stmt=stmt2, setup=setup)
Now, running it just like that, on 2 of our laptops (same hardware, similar processing load) stmt1 (the statement with the if-statement) runs faster even after multiple runs (3-4 hundredths of a second vs. about a quarter of a second for stmt2).
However, if we define functions to do both things (including the setup creating the variable) like so:
def foo():
str_var = '<string><substring><more_string>'
if "<substring>" in str_var:
str_var = str_var.replace("<substring>", "<new_substring>")
and
def foo2():
str_var = '<string><substring><more_string>'
str_var = str_var.replace("<substring>", "<new_substring>")
and run timeit like:
timeit.timeit("foo()", setup="from __main__ import foo")
timeit.timeit("foo2()", setup="from __main__ import foo2")
the statement without the if-statement (foo2) runs faster, contradicting the non-functioned results.
Are we missing something about how Timeit works? Or how Python handles a case like this?
edit here is our actual code:
>>> def foo():
s = "hi 1 2 3"
s = s.replace('1','5')
>>> def foo2():
s = "hi 1 2 3"
if '1' in s:
s = s.replace('1','5')
>>> timeit.timeit(foo, "from __main__ import foo")
0.4094226634183542
>>> timeit.timeit(foo2, "from __main__ import foo2")
0.4815539780738618
vs this code:
>>> timeit.timeit("""s = s.replace("1","5")""", setup="s = 'hi 1 2 3'")
0.18738432400277816
>>> timeit.timeit("""if '1' in s: s = s.replace('1','5')""", setup="s = 'hi 1 2 3'")
0.02985000199987553
I think I've got it.
Look at this code:
timeit.timeit("""if '1' in s: s = s.replace('1','5')""", setup="s = 'hi 1 2 3'")
In this code, setup is run exactly once. That means that s becomes a "global". As a result, it gets modified to hi 5 2 3 in the first iteration and in now returns False for all successive iterations.
See this code:
timeit.timeit("""if '1' in s: s = s.replace('1','5'); print(s)""", setup="s = 'hi 1 2 3'")
This will print out hi 5 2 3 a single time because the print is part of the if statement. Contrast this, which will fill up your screen with a ton of hi 5 2 3s:
timeit.timeit("""s = s.replace("1","5"); print(s)""", setup="s = 'hi 1 2 3'")
So the problem here is that the non-function with if test is flawed and is giving you false timings, unless repeated calls on an already processed string is what you were trying to test. (If it is what you were trying to test, your function versions are flawed.) The reason the function with if doesn't fair better is because it's running the replace on a fresh copy of the string for each iteration.
The following test does what I believe you intended since it doesn't re-assign the result of the replace back to s, leaving it unmodified for each iteration:
>>> timeit.timeit("""if '1' in s: s.replace('1','5')""", setup="s = 'hi 1 2 3'"
0.3221409016812231
>>> timeit.timeit("""s.replace('1','5')""", setup="s = 'hi 1 2 3'")
0.28558505721252914
This change adds a lot of time to the if test and adds a little bit of time to the non-if test for me, but I'm using Python 2.7. If the Python 3 results are consistent, though, these results suggest that in saves a lot of time when the strings rarely need any replacing. If they usually do require replacement, it appears in costs a little bit of time.
Made even weirder by looking at the disassembled code. The second block has the if version (which clocks in faster for me using timeit just as in the OP's example).
Yet, by looking at the op codes, it purely appears to have 7 extra op codes, starting with the first BUILD_MAP and also involving one extra POP_JUMP_IF_TRUE (presumably for the if statement check itself). Before and after that, all codes are the same.
This would suggest that building and performing the check in the if statement somehow reduces the computation time for then checking within the call to replace. How can we see specific timing information for the different op codes?
In [55]: dis.disassemble_string("s='HI 1 2 3'; s = s.replace('1','4')")
0 POP_JUMP_IF_TRUE 10045
3 PRINT_NEWLINE
4 PRINT_ITEM_TO
5 SLICE+2
6 <49>
7 SLICE+2
8 DELETE_SLICE+0
9 SLICE+2
10 DELETE_SLICE+1
11 <39>
12 INPLACE_MODULO
13 SLICE+2
14 POP_JUMP_IF_TRUE 15648
17 SLICE+2
18 POP_JUMP_IF_TRUE 29230
21 LOAD_NAME 27760 (27760)
24 STORE_GLOBAL 25955 (25955)
27 STORE_SLICE+0
28 <39>
29 <49>
30 <39>
31 <44>
32 <39>
33 DELETE_SLICE+2
34 <39>
35 STORE_SLICE+1
In [56]: dis.disassemble_string("s='HI 1 2 3'; if '1' in s: s = s.replace('1','4')")
0 POP_JUMP_IF_TRUE 10045
3 PRINT_NEWLINE
4 PRINT_ITEM_TO
5 SLICE+2
6 <49>
7 SLICE+2
8 DELETE_SLICE+0
9 SLICE+2
10 DELETE_SLICE+1
11 <39>
12 INPLACE_MODULO
13 SLICE+2
14 BUILD_MAP 8294
17 <39>
18 <49>
19 <39>
20 SLICE+2
21 BUILD_MAP 8302
24 POP_JUMP_IF_TRUE 8250
27 POP_JUMP_IF_TRUE 15648
30 SLICE+2
31 POP_JUMP_IF_TRUE 29230
34 LOAD_NAME 27760 (27760)
37 STORE_GLOBAL 25955 (25955)
40 STORE_SLICE+0
41 <39>
42 <49>
43 <39>
44 <44>
45 <39>
46 DELETE_SLICE+2
47 <39>
48 STORE_SLICE+1
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 months ago.
Improve this question
Which one takes more time to compile in python? This one?
if age > 30:
if height > 5:
print('perfect')
or this one?
if age > 30 and height > 5:
print('perfect')
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def x():
... if age > 30 and height > 5:
... print('perfect')
...
>>> def y():
... if age > 30:
... if height > 5:
... print('perfect')
...
>>> import dis
>>> dis.dis(x)
2 0 LOAD_GLOBAL 0 (age)
2 LOAD_CONST 1 (30)
4 COMPARE_OP 4 (>)
6 POP_JUMP_IF_FALSE 24
8 LOAD_GLOBAL 1 (height)
10 LOAD_CONST 2 (5)
12 COMPARE_OP 4 (>)
14 POP_JUMP_IF_FALSE 24
3 16 LOAD_GLOBAL 2 (print)
18 LOAD_CONST 3 ('perfect')
20 CALL_FUNCTION 1
22 POP_TOP
>> 24 LOAD_CONST 0 (None)
26 RETURN_VALUE
>>> dis.dis(y)
2 0 LOAD_GLOBAL 0 (age)
2 LOAD_CONST 1 (30)
4 COMPARE_OP 4 (>)
6 POP_JUMP_IF_FALSE 24
3 8 LOAD_GLOBAL 1 (height)
10 LOAD_CONST 2 (5)
12 COMPARE_OP 4 (>)
14 POP_JUMP_IF_FALSE 24
4 16 LOAD_GLOBAL 2 (print)
18 LOAD_CONST 3 ('perfect')
20 CALL_FUNCTION 1
22 POP_TOP
>> 24 LOAD_CONST 0 (None)
26 RETURN_VALUE
>>>
In my test, they produced identical compiled bytecode.
Boolean conditions are evaluated using short-circuit logic. Any performance difference between the two would be negligible, if any.
Both the conditions are equivalent even in the case of time complexity. I'm attaching another post and you may view that both cases are equivalent since at compile the instruction sets are same.
nested if vs. and condition
This is what basically happens when you compare stuff with and:
def _and(*args):
for arg in args:
if not arg: return arg
return arg
As you can see, it does the same as the nested ifs. Therefore, there wouldn't be much difference between the two.
I'm not an expert for python in any way, but based on my knowledge of compilers for C and C++, here is my answer.
When you write a logical condition, the compiler will try to load the code it thinks is the most likely to come up next, in order to run faster.
so if you write
if (age_of_bob > 200) {
foo()
} else {
bar()
}
since the guy named bob has very little chance of being over 200 years old, the compiler will try to preload the bar() code instead of foo(). (the example is trash but you get it)
Of course this doesn't always work, but compilers are smart and very often, they load the correct code in advance. (read about instruction pipelining and branchless programming)
so in your example, which is python and not C, the interpreter would have to make this kind of guess twice in the first example, and once in the second. Of course this would matter more if you had else clauses.
Now, this is only a guess as I'm not quite sure that the python interpreter does things like the gcc compiler, but if there is a difference, it could come from here. The only way to make sure, is to do a benchmark. Beware of testing only this section of the code and not the whole code in case you have other variables that may change the results. Run it 100000 times and check if there really is a difference.
I found out something weird.
I defined two test functions as such:
def with_brackets(n=10000):
d = dict()
for i in range(n):
d["hello"] = i
def with_setitem(n=10000):
d = dict()
st = d.__setitem__
for i in range(n):
st("hello", i)
One would expect the two functions to be roughly the same execution speed. However:
>>> timeit(with_brackets, number=1000)
0.6558860000222921
>>> timeit(with_setitem, number=1000)
0.9857697170227766
There is possibly something I missed, but it does seem like setitem is almost twice as long, and I don't really understand why. Isn't dict[key] = x supposed to call __setitem__?
(Using CPython 3.9)
Edit: Using timeit instead of time
Isn't dict[key] = x supposed to call __setitem__?
Strictly speaking, no. Running both your functions through dis.dis, we get (I am only including the for loop):
>>> dis.dis(with_brackets)
...
>> 22 FOR_ITER 12 (to 36)
24 STORE_FAST 3 (i)
5 26 LOAD_FAST 0 (n)
28 LOAD_FAST 1 (d)
30 LOAD_CONST 1 ('hello')
32 STORE_SUBSCR
34 JUMP_ABSOLUTE 22
...
Vs
>>> dis.dis(with_setitem)
...
>> 28 FOR_ITER 14 (to 44)
30 STORE_FAST 4 (i)
6 32 LOAD_FAST 2 (setitem)
34 LOAD_CONST 1 ('hello')
36 LOAD_FAST 0 (n)
38 CALL_FUNCTION 2
40 POP_TOP
42 JUMP_ABSOLUTE 28
...
The usage of __setitem__ involves a function call (see the usage of CALL_FUNCTION and POP_TOP instead of just STORE_SUBSCR - that's the difference underneath the hood), and function calls do add some amount of overhead, so using the bracket accessor leads to more optimised opcode.
Fowler's Extract Variable refactoring method, formerly Introduce Explaining Variable, says use a temporary variable to make code clearer for humans. The idea is to elucidate complex code by introducing an otherwise unneeded local variable, and naming that variable for exposition purposes. It also advocates this kind of explaining over comments.. Other languages optimize away temporary variables so there's no cost in time or space resources. Why doesn't Python do this?
In [3]: def multiple_of_six_fat(n):
...: multiple_of_two = n%2 == 0
...: multiple_of_three = n%3 == 0
...: return multiple_of_two and multiple_of_three
...:
In [4]: def multiple_of_six_lean(n):
...: return n%2 == 0 and n%3 == 0
...:
In [5]: import dis
In [6]: dis.dis(multiple_of_six_fat)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (2)
6 BINARY_MODULO
7 LOAD_CONST 2 (0)
10 COMPARE_OP 2 (==)
13 STORE_FAST 1 (multiple_of_two)
3 16 LOAD_FAST 0 (n)
19 LOAD_CONST 3 (3)
22 BINARY_MODULO
23 LOAD_CONST 2 (0)
26 COMPARE_OP 2 (==)
29 STORE_FAST 2 (multiple_of_three)
4 32 LOAD_FAST 1 (multiple_of_two)
35 JUMP_IF_FALSE_OR_POP 41
38 LOAD_FAST 2 (multiple_of_three)
>> 41 RETURN_VALUE
In [7]: dis.dis(multiple_of_six_lean)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (2)
6 BINARY_MODULO
7 LOAD_CONST 2 (0)
10 COMPARE_OP 2 (==)
13 JUMP_IF_FALSE_OR_POP 29
16 LOAD_FAST 0 (n)
19 LOAD_CONST 3 (3)
22 BINARY_MODULO
23 LOAD_CONST 2 (0)
26 COMPARE_OP 2 (==)
>> 29 RETURN_VALUE
Because Python is a highly dynamic language, and references can influence behaviour.
Compare the following, for example:
>>> id(object()) == id(object())
True
>>> ob1 = object()
>>> ob2 = object()
>>> id(ob1) == id(ob2)
False
Had Python 'optimised' the ob1 and ob2 variables away, behaviour would have changed.
Python object lifetime is governed by reference counts. Add weak references into the mix plus threading, and you'll see that optimising away variables (even local ones) can lead to undesirable behaviour changes.
Besides, in Python, removing those variables would hardly have changed anything from a performance perspective. The local namespace is already highly optimised (values are looked up by index in an array); if you are worried about the speed of dereferencing local variables, you are using the wrong programming language for that time critical section of your project.
Issue 2181 (optimize out local variables at end of function) has some interesting points:
It can make debugging harder since the symbols no longer
exist. Guido says only do it for -O.
Might break some usages of inspect or sys._getframe().
Changes the lifetime of objects. For example myfunc in the following example might fail after optimization because at the moment Python guarantees that the file object is not closed before the function exits. (bad style, but still)
def myfunc():
f = open('somewhere', 'r')
fd = f.fileno()
return os.fstat(fd)
cannot be rewritten as:
def bogus():
fd = open('somewhere', 'r').fileno()
# the file is auto-closed here and fd becomes invalid
return os.fstat(fd)
A core developer says that "it is unlikely to give any speedup in real-world code, I don't think we should add complexity to the compiler."
I just ran across the dissembler function in python. But i couldn't make out what it means. Can anyone explain the working and use, based on the results of the factorial function (based on recursion and loop)
The recursive code and the corresponding dis code:
>>> def fact(n):
... if n==1:
... return 1
... return n*fact(n-1)
...
>>> dis.dis(fact)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (1)
6 COMPARE_OP 2 (==)
9 POP_JUMP_IF_FALSE 16
3 12 LOAD_CONST 1 (1)
15 RETURN_VALUE
4 >> 16 LOAD_FAST 0 (n)
19 LOAD_GLOBAL 0 (fact)
22 LOAD_FAST 0 (n)
25 LOAD_CONST 1 (1)
28 BINARY_SUBTRACT
29 CALL_FUNCTION 1
32 BINARY_MULTIPLY
33 RETURN_VALUE
And the factorial function using loop gives the following result:
def factor(n):
... f=1
... while n>1:
... f*=n
... n-=1
...
>>> dis.dis(factor)
2 0 LOAD_CONST 1 (1)
3 STORE_FAST 1 (f)
3 6 SETUP_LOOP 36 (to 45)
>> 9 LOAD_FAST 0 (n)
12 LOAD_CONST 1 (1)
15 COMPARE_OP 4 (>)
18 POP_JUMP_IF_FALSE 44
4 21 LOAD_FAST 1 (f)
24 LOAD_FAST 0 (n)
27 INPLACE_MULTIPLY
28 STORE_FAST 1 (f)
5 31 LOAD_FAST 0 (n)
34 LOAD_CONST 1 (1)
37 INPLACE_SUBTRACT
38 STORE_FAST 0 (n)
41 JUMP_ABSOLUTE 9
>> 44 POP_BLOCK
>> 45 LOAD_CONST 0 (None)
48 RETURN_VALUE
Can anyone tell me how to determine which one is faster?
To measure how fast something is running, use the timeit module, which comes with Python.
The dis module is used to get some idea of what the bytecode may look like; and its very specific to cpython.
One use of it is to see what, when and how storage is assigned for variables in a loop or method. However, this is a specialized module that is not normally used for efficiency calculations; use timeit to figure out how fast something is, and then dis to get an understanding of what is going on under the hood - to arrive at a possible why.
It's impossible to determine which one will be faster simply by looking at the bytecode; each VM has a different cost associated with each opcode and so runtimes can vary widely.
The dis.dis() function disassembles a function into its bytecode interpretation.
Timing
As stated by Ignacio, the pure length of the bytecode does not accurately represent the running time due to differences in how python interpreters actually run opcode and the timeit module would be what you want to use there.
Actual Purpose
There are several uses of this function, but they are not things that most people would end up doing. You can look at the output to help as part of the process of optimizing or debugging speed issues. It would also likely prove useful in working directly on the python interpreter, or writing your own. You can look at the documentation here to see a full list of the opcodes (though, just as that page will state, it's perfectly likely to change between versions of python).
Overall, this is not something you'd really use much in a production application (unless your application is a python disassembler!) but when you really, really need to optimize your code and debug at the lowest level, this is where the function would come in handy.
I have written this code to convert string in such format "0(532) 222 22 22" to integer such as 05322222222 .
class Phone():
def __init__(self,input):
self.phone = input
def __str__(self):
return self.phone
#convert to integer.
def to_int(self):
return int((self.phone).replace(" ","").replace("(","").replace(")",""))
test = Phone("0(532) 222 22 22")
print test.to_int()
It feels very clumsy to use 3 replace methods to solve this. I am curious if there is a better solution?
p = "0(532) 222 22 22"
print ''.join([x for x in p if x.isdigit()])
Note that you'll "lose" the leading zero if you want to convert it to int (like you suggested in the title). If you want to do that, just wrap the above in a int() call. A telephone number does make more sense as a string though (in my opinion).
In Python 2.6 or 2.7,
(self.phone).translate(None,' ()') will remove any spaces or ( or ) from the phone string. See Python 2.6 doc on str.translate for details.
In Python 3.x, str.translate() takes a mapping (rather than two strings as shown above). The corresponding snippet therefore is something like the following, using str.maketrans() to produce the mapping.
'(self.phone).translate(str.maketrans('','', '()-/ '))
See Python 3.1 doc on str.translate for details.
How about just using regular expressions?
Example:
>>> import re
>>> num = '0(532) 222 22 22'
>>> re.sub('[\D]', '', num) # Match all non-digits ([\D]), replace them with empty string, where found in the `num` variable.
'05322222222'
The suggestion made by ChristopheD will work just fine, but is not as efficient.
The following is a test program to demonstrate this using the dis module (See Doug Hellman's PyMOTW on the module here for more detailed info).
TEST_PHONE_NUM = '0(532) 222 22 22'
def replace_method():
print (TEST_PHONE_NUM).replace(" ","").replace("(","").replace(")","")
def list_comp_is_digit_method():
print ''.join([x for x in TEST_PHONE_NUM if x.isdigit()])
def translate_method():
print (TEST_PHONE_NUM).translate(None,' ()')
import re
def regex_method():
print re.sub('[\D]', '', TEST_PHONE_NUM)
if __name__ == '__main__':
from dis import dis
print 'replace_method:'
dis(replace_method)
print
print
print 'list_comp_is_digit_method:'
dis(list_comp_is_digit_method)
print
print
print 'translate_method:'
dis(translate_method)
print
print
print "regex_method:"
dis(phone_digit_strip_regex)
print
Output:
replace_method:
5 0 LOAD_GLOBAL 0 (TEST_PHONE_NUM)
3 LOAD_ATTR 1 (replace)
6 LOAD_CONST 1 (' ')
9 LOAD_CONST 2 ('')
12 CALL_FUNCTION 2
15 LOAD_ATTR 1 (replace)
18 LOAD_CONST 3 ('(')
21 LOAD_CONST 2 ('')
24 CALL_FUNCTION 2
27 LOAD_ATTR 1 (replace)
30 LOAD_CONST 4 (')')
33 LOAD_CONST 2 ('')
36 CALL_FUNCTION 2
39 PRINT_ITEM
40 PRINT_NEWLINE
41 LOAD_CONST 0 (None)
44 RETURN_VALUE
phone_digit_strip_list_comp:
3 0 LOAD_CONST 1 ('0(532) 222 22 22')
3 STORE_FAST 0 (phone)
4 6 LOAD_CONST 2 ('')
9 LOAD_ATTR 0 (join)
12 BUILD_LIST 0
15 DUP_TOP
16 STORE_FAST 1 (_[1])
19 LOAD_GLOBAL 1 (test_phone_num)
22 GET_ITER
23 FOR_ITER 30 (to 56)
26 STORE_FAST 2 (x)
29 LOAD_FAST 2 (x)
32 LOAD_ATTR 2 (isdigit)
35 CALL_FUNCTION 0
38 JUMP_IF_FALSE 11 (to 52)
41 POP_TOP
42 LOAD_FAST 1 (_[1])
45 LOAD_FAST 2 (x)
48 LIST_APPEND
49 JUMP_ABSOLUTE 23
52 POP_TOP
53 JUMP_ABSOLUTE 23
56 DELETE_FAST 1 (_[1])
59 CALL_FUNCTION 1
62 PRINT_ITEM
63 PRINT_NEWLINE
64 LOAD_CONST 0 (None)
67 RETURN_VALUE
translate_method:
11 0 LOAD_GLOBAL 0 (TEST_PHONE_NUM)
3 LOAD_ATTR 1 (translate)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 1 (' ()')
12 CALL_FUNCTION 2
15 PRINT_ITEM
16 PRINT_NEWLINE
17 LOAD_CONST 0 (None)
20 RETURN_VALUE
phone_digit_strip_regex:
8 0 LOAD_CONST 1 ('0(532) 222 22 22')
3 STORE_FAST 0 (phone)
9 6 LOAD_GLOBAL 0 (re)
9 LOAD_ATTR 1 (sub)
12 LOAD_CONST 2 ('[\\D]')
15 LOAD_CONST 3 ('')
18 LOAD_GLOBAL 2 (test_phone_num)
21 CALL_FUNCTION 3
24 PRINT_ITEM
25 PRINT_NEWLINE
26 LOAD_CONST 0 (None)
29 RETURN_VALUE
The translate method will be the most efficient, though relies on py2.6+. regex is slightly less efficient, but more compatible (which I see a requirement for you). The original replace method will add 6 additional instructions per replacement, while all of the others will stay constant.
On a side note, store your phone numbers as strings to deal with leading zeros, and use a phone formatter where needed. Trust me, it's bitten me before.
SilentGhost: dis.dis does demonstrate underlying conceptual / executional complexity. after all, the OP complained about the original replacement chain being too ‘clumsy’, not too ‘slow’.
i recommend against using regular expressions where not inevitable; they just add conceptual overhead and a speed penalty otherwise. to use translate() here is IMHO just the wrong tool, and nowhere as conceptually simple and generic as the original replacement chain.
so you say tamaytoes, and i say tomahtoes: the original solution is quite good in terms of clarity and genericity. it is not clumsy at all. in order to make it a little denser and more parametrized, consider changing it to
phone_nr_translations = [
( ' ', '', ),
( '(', '', ),
( ')', '', ), ]
def sanitize_phone_nr( phone_nr ):
R = phone_nr
for probe, replacement in phone_nr_translations:
R = R.replace( probe, replacement )
return R
in this special application, of course, what you really want to do is just cancelling out any unwanted characters, so you can simplify this:
probes = ' ()'
def sanitize_phone_nr( phone_nr ):
R = phone_nr
for probe in probes:
R = R.replace( probe, '' )
return R
coming to think of it, it is not quite clear to me why you want to turn a phone nr into an integer—that is simply the wrong data type. this can be demonstrated by the fact that at least in mobile nets, + and # and maybe more are valid characters in a dial string (dial, string—see?).
but apart from that, sanitizing a user input phone nr to get out a normalized and safe representation is a very, very valid concern—only i feel that your methodology is too specific. why not re-write the sanitizing method to something very generic without becoming more complex? after all, how can you be sure your users never input other deviant characters in that web form field?
so what you want is really not to dis-allow specific characters (there are about a hundred thousand defined codepoints in unicode 5.1, so how do catch up with those?), but to allow those very characters that are deemed legal in dial strings. and you can do that with a regular expression...
from re import compile as _new_regex
illegal_phone_nr_chrs_re = _new_regex( r"[^0-9#+]" )
def sanitize_phone_nr( phone_nr ):
return illegal_phone_nr_chrs_re.sub( '', phone_nr )
...or with a set:
legal_phone_nr_chrs = set( '0123456789#+' )
def sanitize_phone_nr( phone_nr ):
return ''.join(
chr for chr in phone_nr
if chr in legal_phone_nr_chrs )
that last stanza could well be written on a single line. the disadvantage of this solution would be that you iterate over the input characters from within Python, not making use of the potentially speeder C traversal as offered by str.replace() or even a regular expression. however, performance would in any case be dependent on the expected usage pattern (i am sure you truncate your phone nrs first thing, right? so those would be many small strings to be processed, not few big ones).
notice a few points here: i strive for clarity, which is why i try to avoid over-using abbreviations. chr for character, nr for number and R for the return value (more likely to be, ugh, retval where used in the standard library) are in my style book. programming is about getting things understood and done, not about programmers writing code that approaches the spatial efficiency of gzip. now look, the last solution does fairly much what the OP managed to get done (and more), in...
legal_phone_nr_chrs = set( '0123456789#+' )
def sanitize_phone_nr( phone_nr ): return ''.join( chr for chr in phone_nr if chr in legal_phone_nr_chrs )
...two lines of code if need be, whereas the OP’s code...
class Phone():
def __init__ ( self, input ): self.phone = self._sanitize( input )
def __str__ ( self ): return self.phone
def _sanitize ( self, input ): return input.replace( ' ', '' ).replace( '(', '' ).replace( ')', '' )
...can hardly be compressed below four lines. see what additional baggage that strictly-OOP solution gives you? i believe it can be left out of the picture most of the time.