Python, a smarter way of string to integer conversion - python

I have written this code to convert string in such format "0(532) 222 22 22" to integer such as 05322222222 .
class Phone():
def __init__(self,input):
self.phone = input
def __str__(self):
return self.phone
#convert to integer.
def to_int(self):
return int((self.phone).replace(" ","").replace("(","").replace(")",""))
test = Phone("0(532) 222 22 22")
print test.to_int()
It feels very clumsy to use 3 replace methods to solve this. I am curious if there is a better solution?

p = "0(532) 222 22 22"
print ''.join([x for x in p if x.isdigit()])
Note that you'll "lose" the leading zero if you want to convert it to int (like you suggested in the title). If you want to do that, just wrap the above in a int() call. A telephone number does make more sense as a string though (in my opinion).

In Python 2.6 or 2.7,
(self.phone).translate(None,' ()') will remove any spaces or ( or ) from the phone string. See Python 2.6 doc on str.translate for details.
In Python 3.x, str.translate() takes a mapping (rather than two strings as shown above). The corresponding snippet therefore is something like the following, using str.maketrans() to produce the mapping.
'(self.phone).translate(str.maketrans('','', '()-/ '))
See Python 3.1 doc on str.translate for details.

How about just using regular expressions?
Example:
>>> import re
>>> num = '0(532) 222 22 22'
>>> re.sub('[\D]', '', num) # Match all non-digits ([\D]), replace them with empty string, where found in the `num` variable.
'05322222222'
The suggestion made by ChristopheD will work just fine, but is not as efficient.
The following is a test program to demonstrate this using the dis module (See Doug Hellman's PyMOTW on the module here for more detailed info).
TEST_PHONE_NUM = '0(532) 222 22 22'
def replace_method():
print (TEST_PHONE_NUM).replace(" ","").replace("(","").replace(")","")
def list_comp_is_digit_method():
print ''.join([x for x in TEST_PHONE_NUM if x.isdigit()])
def translate_method():
print (TEST_PHONE_NUM).translate(None,' ()')
import re
def regex_method():
print re.sub('[\D]', '', TEST_PHONE_NUM)
if __name__ == '__main__':
from dis import dis
print 'replace_method:'
dis(replace_method)
print
print
print 'list_comp_is_digit_method:'
dis(list_comp_is_digit_method)
print
print
print 'translate_method:'
dis(translate_method)
print
print
print "regex_method:"
dis(phone_digit_strip_regex)
print
Output:
replace_method:
5 0 LOAD_GLOBAL 0 (TEST_PHONE_NUM)
3 LOAD_ATTR 1 (replace)
6 LOAD_CONST 1 (' ')
9 LOAD_CONST 2 ('')
12 CALL_FUNCTION 2
15 LOAD_ATTR 1 (replace)
18 LOAD_CONST 3 ('(')
21 LOAD_CONST 2 ('')
24 CALL_FUNCTION 2
27 LOAD_ATTR 1 (replace)
30 LOAD_CONST 4 (')')
33 LOAD_CONST 2 ('')
36 CALL_FUNCTION 2
39 PRINT_ITEM
40 PRINT_NEWLINE
41 LOAD_CONST 0 (None)
44 RETURN_VALUE
phone_digit_strip_list_comp:
3 0 LOAD_CONST 1 ('0(532) 222 22 22')
3 STORE_FAST 0 (phone)
4 6 LOAD_CONST 2 ('')
9 LOAD_ATTR 0 (join)
12 BUILD_LIST 0
15 DUP_TOP
16 STORE_FAST 1 (_[1])
19 LOAD_GLOBAL 1 (test_phone_num)
22 GET_ITER
23 FOR_ITER 30 (to 56)
26 STORE_FAST 2 (x)
29 LOAD_FAST 2 (x)
32 LOAD_ATTR 2 (isdigit)
35 CALL_FUNCTION 0
38 JUMP_IF_FALSE 11 (to 52)
41 POP_TOP
42 LOAD_FAST 1 (_[1])
45 LOAD_FAST 2 (x)
48 LIST_APPEND
49 JUMP_ABSOLUTE 23
52 POP_TOP
53 JUMP_ABSOLUTE 23
56 DELETE_FAST 1 (_[1])
59 CALL_FUNCTION 1
62 PRINT_ITEM
63 PRINT_NEWLINE
64 LOAD_CONST 0 (None)
67 RETURN_VALUE
translate_method:
11 0 LOAD_GLOBAL 0 (TEST_PHONE_NUM)
3 LOAD_ATTR 1 (translate)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 1 (' ()')
12 CALL_FUNCTION 2
15 PRINT_ITEM
16 PRINT_NEWLINE
17 LOAD_CONST 0 (None)
20 RETURN_VALUE
phone_digit_strip_regex:
8 0 LOAD_CONST 1 ('0(532) 222 22 22')
3 STORE_FAST 0 (phone)
9 6 LOAD_GLOBAL 0 (re)
9 LOAD_ATTR 1 (sub)
12 LOAD_CONST 2 ('[\\D]')
15 LOAD_CONST 3 ('')
18 LOAD_GLOBAL 2 (test_phone_num)
21 CALL_FUNCTION 3
24 PRINT_ITEM
25 PRINT_NEWLINE
26 LOAD_CONST 0 (None)
29 RETURN_VALUE
The translate method will be the most efficient, though relies on py2.6+. regex is slightly less efficient, but more compatible (which I see a requirement for you). The original replace method will add 6 additional instructions per replacement, while all of the others will stay constant.
On a side note, store your phone numbers as strings to deal with leading zeros, and use a phone formatter where needed. Trust me, it's bitten me before.

SilentGhost: dis.dis does demonstrate underlying conceptual / executional complexity. after all, the OP complained about the original replacement chain being too ‘clumsy’, not too ‘slow’.
i recommend against using regular expressions where not inevitable; they just add conceptual overhead and a speed penalty otherwise. to use translate() here is IMHO just the wrong tool, and nowhere as conceptually simple and generic as the original replacement chain.
so you say tamaytoes, and i say tomahtoes: the original solution is quite good in terms of clarity and genericity. it is not clumsy at all. in order to make it a little denser and more parametrized, consider changing it to
phone_nr_translations = [
( ' ', '', ),
( '(', '', ),
( ')', '', ), ]
def sanitize_phone_nr( phone_nr ):
R = phone_nr
for probe, replacement in phone_nr_translations:
R = R.replace( probe, replacement )
return R
in this special application, of course, what you really want to do is just cancelling out any unwanted characters, so you can simplify this:
probes = ' ()'
def sanitize_phone_nr( phone_nr ):
R = phone_nr
for probe in probes:
R = R.replace( probe, '' )
return R
coming to think of it, it is not quite clear to me why you want to turn a phone nr into an integer—that is simply the wrong data type. this can be demonstrated by the fact that at least in mobile nets, + and # and maybe more are valid characters in a dial string (dial, string—see?).
but apart from that, sanitizing a user input phone nr to get out a normalized and safe representation is a very, very valid concern—only i feel that your methodology is too specific. why not re-write the sanitizing method to something very generic without becoming more complex? after all, how can you be sure your users never input other deviant characters in that web form field?
so what you want is really not to dis-allow specific characters (there are about a hundred thousand defined codepoints in unicode 5.1, so how do catch up with those?), but to allow those very characters that are deemed legal in dial strings. and you can do that with a regular expression...
from re import compile as _new_regex
illegal_phone_nr_chrs_re = _new_regex( r"[^0-9#+]" )
def sanitize_phone_nr( phone_nr ):
return illegal_phone_nr_chrs_re.sub( '', phone_nr )
...or with a set:
legal_phone_nr_chrs = set( '0123456789#+' )
def sanitize_phone_nr( phone_nr ):
return ''.join(
chr for chr in phone_nr
if chr in legal_phone_nr_chrs )
that last stanza could well be written on a single line. the disadvantage of this solution would be that you iterate over the input characters from within Python, not making use of the potentially speeder C traversal as offered by str.replace() or even a regular expression. however, performance would in any case be dependent on the expected usage pattern (i am sure you truncate your phone nrs first thing, right? so those would be many small strings to be processed, not few big ones).
notice a few points here: i strive for clarity, which is why i try to avoid over-using abbreviations. chr for character, nr for number and R for the return value (more likely to be, ugh, retval where used in the standard library) are in my style book. programming is about getting things understood and done, not about programmers writing code that approaches the spatial efficiency of gzip. now look, the last solution does fairly much what the OP managed to get done (and more), in...
legal_phone_nr_chrs = set( '0123456789#+' )
def sanitize_phone_nr( phone_nr ): return ''.join( chr for chr in phone_nr if chr in legal_phone_nr_chrs )
...two lines of code if need be, whereas the OP’s code...
class Phone():
def __init__ ( self, input ): self.phone = self._sanitize( input )
def __str__ ( self ): return self.phone
def _sanitize ( self, input ): return input.replace( ' ', '' ).replace( '(', '' ).replace( ')', '' )
...can hardly be compressed below four lines. see what additional baggage that strictly-OOP solution gives you? i believe it can be left out of the picture most of the time.

Related

Is dict.__setitem__(key, x) slower (or faster) than dict[key] = x, and why?

I found out something weird.
I defined two test functions as such:
def with_brackets(n=10000):
d = dict()
for i in range(n):
d["hello"] = i
def with_setitem(n=10000):
d = dict()
st = d.__setitem__
for i in range(n):
st("hello", i)
One would expect the two functions to be roughly the same execution speed. However:
>>> timeit(with_brackets, number=1000)
0.6558860000222921
>>> timeit(with_setitem, number=1000)
0.9857697170227766
There is possibly something I missed, but it does seem like setitem is almost twice as long, and I don't really understand why. Isn't dict[key] = x supposed to call __setitem__?
(Using CPython 3.9)
Edit: Using timeit instead of time
Isn't dict[key] = x supposed to call __setitem__?
Strictly speaking, no. Running both your functions through dis.dis, we get (I am only including the for loop):
>>> dis.dis(with_brackets)
...
>> 22 FOR_ITER 12 (to 36)
24 STORE_FAST 3 (i)
5 26 LOAD_FAST 0 (n)
28 LOAD_FAST 1 (d)
30 LOAD_CONST 1 ('hello')
32 STORE_SUBSCR
34 JUMP_ABSOLUTE 22
...
Vs
>>> dis.dis(with_setitem)
...
>> 28 FOR_ITER 14 (to 44)
30 STORE_FAST 4 (i)
6 32 LOAD_FAST 2 (setitem)
34 LOAD_CONST 1 ('hello')
36 LOAD_FAST 0 (n)
38 CALL_FUNCTION 2
40 POP_TOP
42 JUMP_ABSOLUTE 28
...
The usage of __setitem__ involves a function call (see the usage of CALL_FUNCTION and POP_TOP instead of just STORE_SUBSCR - that's the difference underneath the hood), and function calls do add some amount of overhead, so using the bracket accessor leads to more optimised opcode.

What happens if you send() to a generator *expression* in Python?

I was surprised to find that, say,
ge=(x*x for x in [1,2,3])
accepts the .send method. The argument of the first call must be None, as with any other generator , but the behaviour of further calls, say, ans=ge.send(99) seems identical to ans=next(ge).
Where goes my 99? There are no yield expressions within ge, nothing to be assigned. Is the value injected simply discarded (as I suspect), or there is some Mystery involved?
Has anybody seen that?
Same thing as if you send to the equivalent generator created with a generator function:
def genfunc(outer_iterable):
for x in outer_iterable:
yield x*x
ge = genfunc([1, 2, 3])
which is to say, the send argument gets discarded.
We can disassemble the bytecode for further confirmation:
import dis
ge=(x*x for x in [1,2,3])
print('Genexp:')
dis.dis(ge)
def genfunc(outer_iterable):
for x in outer_iterable:
yield x*x
ge = genfunc([1, 2, 3])
print()
print('Generator function:')
dis.dis(ge)
Output:
Genexp:
3 0 LOAD_FAST 0 (.0)
>> 3 FOR_ITER 15 (to 21)
6 STORE_FAST 1 (x)
9 LOAD_FAST 1 (x)
12 LOAD_FAST 1 (x)
15 BINARY_MULTIPLY
16 YIELD_VALUE
17 POP_TOP
18 JUMP_ABSOLUTE 3
>> 21 LOAD_CONST 0 (None)
24 RETURN_VALUE
Generator function:
9 0 SETUP_LOOP 23 (to 26)
3 LOAD_FAST 0 (outer_iterable)
6 GET_ITER
>> 7 FOR_ITER 15 (to 25)
10 STORE_FAST 1 (x)
10 13 LOAD_FAST 1 (x)
16 LOAD_FAST 1 (x)
19 BINARY_MULTIPLY
20 YIELD_VALUE
21 POP_TOP
22 JUMP_ABSOLUTE 7
>> 25 POP_BLOCK
>> 26 LOAD_CONST 0 (None)
29 RETURN_VALUE
The genexp and the generator created through the generator function have very similar disassemblies, and in both, the YIELD_VALUE is immediately followed by a POP_TOP that discards any value sent in from send.
Thank you, all. So yes, the arg of send is discarded, but the fact that send is accepted seems to be an anomaly.
Another, related bug has been already commented here (page 32139885), the yield expression should be forbidden in genexps, but it isn't. The form ge=((yield x*x) for x in [1,2,3]) is accepted, and .send() works.
The answer returned by send is then a mixture of elements in the internal iterable, and the args of send... If I am not mistaken, GvR wrote that in Python 3.8 this (the yield expression) will be treated as an error, and in 3.7 it should signal that it is deprecated. (People agreed that it was confusing.)
But I tested that in Python 3.7 (Anaconda, Windows 64), and I got no deprecation warning. Anyway, this seems to be a real bug, not a feature to be deprecated. I believe that for the moment there is nothing more to say...
JK

Where did I go wrong in this time complexity calculation?

I have this function in Python:
digit_sum = 0
while number > 0:
digit_sum += (number % 10)
number = number // 10
For determining the time complexity, I applied the following logic:
Line 1: 1 basic operation (assignment), gets executed 1 time so gets a value of 1
Line 2: 2 basic operations (reading the variable 'number' and comparing against zero), gets executed n+1 times so gets a value of 2*(n+1)
Line 3: 4 basic operations (reading the variable 'number', %10, computing the sum, and assignment), gets executed n times so gets a value of 4*n
Line 4: 3 basic operations (reading the variable 'number', //10 and assignment), gets executed n times so gets a value of 3*n
This brings me to a total of 1 + 2n+2 + 4n + 3n = 9n+3
But my textbook has a solution of 8n+3. Where did I go wrong in my logic?
Thanks,
Alex
When talking about complexity all you really care about is asymptotic complexity. Here, O(n). The 8 or 9 or 42 doesn't really matter, especially as there is no way for you to know.
Thus counting "operations" is pointless. It exposes the architectural details of the underlying processor (be it an actual hw proc or an interpreter). The only way to actually get the "real" count of operations would be to have a look at a specific implementation (for instance, say CPython 3.6.0) and see the bytecode it generates from your program.
Here is what my CPython 2.7.12 does:
>>> def test(number):
... digit_sum = 0
... while number > 0:
... digit_sum += (number % 10)
... number = number // 10
...
>>> import dis
>>> dis.dis(test)
2 0 LOAD_CONST 1 (0)
3 STORE_FAST 1 (digit_sum)
3 6 SETUP_LOOP 40 (to 49)
>> 9 LOAD_FAST 0 (number)
12 LOAD_CONST 1 (0)
15 COMPARE_OP 4 (>)
18 POP_JUMP_IF_FALSE 48
4 21 LOAD_FAST 1 (digit_sum)
24 LOAD_FAST 0 (number)
27 LOAD_CONST 2 (10)
30 BINARY_MODULO
31 INPLACE_ADD
32 STORE_FAST 1 (digit_sum)
5 35 LOAD_FAST 0 (number)
38 LOAD_CONST 2 (10)
41 BINARY_FLOOR_DIVIDE
42 STORE_FAST 0 (number)
45 JUMP_ABSOLUTE 9
>> 48 POP_BLOCK
>> 49 LOAD_CONST 0 (None)
52 RETURN_VALUE
I let you draw your own conclusions as to what you want to actually count as a basic operation. Python interpreter interprets bytecodes one after the other, so arguably you have 15 "basic operations" inside your loop. That's the closest you can get to a meaningful number. Still, every operation in there has different runtimes so that 15 carries no valuable information.
Also, keep in mind this is specific to CPython 2.7.12. It's very likely another version will generate something else, taking advantage of new bytecodes that might make it possible to express some operations in a simpler way.

what and how is the dissembler function used for in python?

I just ran across the dissembler function in python. But i couldn't make out what it means. Can anyone explain the working and use, based on the results of the factorial function (based on recursion and loop)
The recursive code and the corresponding dis code:
>>> def fact(n):
... if n==1:
... return 1
... return n*fact(n-1)
...
>>> dis.dis(fact)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (1)
6 COMPARE_OP 2 (==)
9 POP_JUMP_IF_FALSE 16
3 12 LOAD_CONST 1 (1)
15 RETURN_VALUE
4 >> 16 LOAD_FAST 0 (n)
19 LOAD_GLOBAL 0 (fact)
22 LOAD_FAST 0 (n)
25 LOAD_CONST 1 (1)
28 BINARY_SUBTRACT
29 CALL_FUNCTION 1
32 BINARY_MULTIPLY
33 RETURN_VALUE
And the factorial function using loop gives the following result:
def factor(n):
... f=1
... while n>1:
... f*=n
... n-=1
...
>>> dis.dis(factor)
2 0 LOAD_CONST 1 (1)
3 STORE_FAST 1 (f)
3 6 SETUP_LOOP 36 (to 45)
>> 9 LOAD_FAST 0 (n)
12 LOAD_CONST 1 (1)
15 COMPARE_OP 4 (>)
18 POP_JUMP_IF_FALSE 44
4 21 LOAD_FAST 1 (f)
24 LOAD_FAST 0 (n)
27 INPLACE_MULTIPLY
28 STORE_FAST 1 (f)
5 31 LOAD_FAST 0 (n)
34 LOAD_CONST 1 (1)
37 INPLACE_SUBTRACT
38 STORE_FAST 0 (n)
41 JUMP_ABSOLUTE 9
>> 44 POP_BLOCK
>> 45 LOAD_CONST 0 (None)
48 RETURN_VALUE
Can anyone tell me how to determine which one is faster?
To measure how fast something is running, use the timeit module, which comes with Python.
The dis module is used to get some idea of what the bytecode may look like; and its very specific to cpython.
One use of it is to see what, when and how storage is assigned for variables in a loop or method. However, this is a specialized module that is not normally used for efficiency calculations; use timeit to figure out how fast something is, and then dis to get an understanding of what is going on under the hood - to arrive at a possible why.
It's impossible to determine which one will be faster simply by looking at the bytecode; each VM has a different cost associated with each opcode and so runtimes can vary widely.
The dis.dis() function disassembles a function into its bytecode interpretation.
Timing
As stated by Ignacio, the pure length of the bytecode does not accurately represent the running time due to differences in how python interpreters actually run opcode and the timeit module would be what you want to use there.
Actual Purpose
There are several uses of this function, but they are not things that most people would end up doing. You can look at the output to help as part of the process of optimizing or debugging speed issues. It would also likely prove useful in working directly on the python interpreter, or writing your own. You can look at the documentation here to see a full list of the opcodes (though, just as that page will state, it's perfectly likely to change between versions of python).
Overall, this is not something you'd really use much in a production application (unless your application is a python disassembler!) but when you really, really need to optimize your code and debug at the lowest level, this is where the function would come in handy.

Is it better to save the length of a list that I use several time?

I know about inlining, and from what I checked it is not done by the Python's compiler.
My question is : is there any optimizations with the python's compiler which transforms :
print myList.__len__()
for i in range(0, myList.__len__()):
print i + myList.__len__()
to
l = myList.__len__()
print l
for i in range(0, l):
print i + l
So is it done by the compiler ?
If it is not : is it worth it to do it by myself ?
Bonus question (not so related) : I like to have a lot of functions (better for readability IMHO)... like there is no inlining in Python is this something to avoid (lots of functions) ?
No, there isn't. You can check what Python does by compiling the code to byte-code using the dis module:
>>> def test():
... print myList.__len__()
... for i in range(0, myList.__len__()):
... print i + myList.__len__()
...
>>> import dis
>>> dis.dis(test)
2 0 LOAD_GLOBAL 0 (myList)
3 LOAD_ATTR 1 (__len__)
6 CALL_FUNCTION 0
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 SETUP_LOOP 44 (to 58)
14 LOAD_GLOBAL 2 (range)
17 LOAD_CONST 1 (0)
20 LOAD_GLOBAL 0 (myList)
23 LOAD_ATTR 1 (__len__)
26 CALL_FUNCTION 0
29 CALL_FUNCTION 2
32 GET_ITER
>> 33 FOR_ITER 21 (to 57)
36 STORE_FAST 0 (i)
4 39 LOAD_FAST 0 (i)
42 LOAD_GLOBAL 0 (myList)
45 LOAD_ATTR 1 (__len__)
48 CALL_FUNCTION 0
51 BINARY_ADD
52 PRINT_ITEM
53 PRINT_NEWLINE
54 JUMP_ABSOLUTE 33
>> 57 POP_BLOCK
>> 58 LOAD_CONST 0 (None)
61 RETURN_VALUE
As you can see, the __len__ attribute is looked up and called each time.
Python cannot know what a given method will return between calls, the __len__ method is no exception. If python were to try to optimize that by assuming the value returned would be the same between calls, you'd run into countless different problems, and we haven't even tried to use multi-threading yet.
Note that you would be much better off using len(myList), and not call the __len__() hook directly:
print len(myList)
for i in xrange(len(myList):
print i + len(myList)
No, the optimization you're asking about is not done by the CPython compiler. In fact hardly any optimizations are done by the CPython compiler.
To see for yourself, import dis and disassemble a function with code like you're asking about: dis.dis(func).
The reason this isn't optimized is that it is entirely possible that an attribute (even a method like __len__) will be a completely different object the next time it is accessed. This rarely happens, of course, but Python supports it.
Attribute access does consume time, so storing a reference to an attribute you will be using repeatedly (especially in a local variable) can make your code run faster. However, it decreases readability, so I'd wait until you know that a given piece of code is a bottleneck before applying it. In your case, the time spent printing is easily going to overwhelm the attribute access.
In the final analysis, if performance were paramount you'd be using something other than Python in the first place, no?

Categories