I have just started learning python and worried that if I use dict.get(key,default_value) or I define my own method for it....so do they have any differences:
[1st method]:
dict={}
for c in string:
if c in dict:
dict[c]+=1
else:
dict[c]=1
and the other dict.get() method that python provides
for c in string:
dict[c]=dict.get(c,0)+1
do they have any differences on efficiency or speed...or they are just the same and 2nd one only saves writing few more lines of code...
For this specific case, use either a collections.Counter() or a collections.defaultdict() object instead:
import collections
dct = collections.defaultdict(int)
for c in string:
dict[c] += 1
or
dct = collections.Counter(string)
Both are subclasses of the standard dict type. The Counter type adds some more helpful functionality like summing two counters or listing the most common entities that have been counted. The defaultdict class can also be given other default types; use defaultdict(list) for example to collect things into lists per key.
When you want to compare performance of two different approaches, you want to use the timeit module:
>>> import timeit
>>> def intest(dct, values):
... for c in values:
... if c in dct:
... dct[c]+=1
... else:
... dct[c]=1
...
>>> def get(dct, values):
... for c in values:
... dct[c] = dct.get(c, 0) + 1
...
>>> values = range(10) * 10
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, intest as test; dct={}')
22.210275888442993
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, get as test; dct={}')
27.442166090011597
This shows that using in is a little faster.
There is, however, a third option to consider; catching the KeyError exception:
>>> def tryexcept(dct, values):
... for c in values:
... try:
... dct[c] += 1
... except KeyError:
... dct[c] = 1
...
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, tryexcept as test; dct={}')
18.023509979248047
which happens to be the fastest, because only 1 in 10 cases are for a new key.
Last but not least, the two alternatives I proposed:
>>> def default(dct, values):
... for c in values:
... dct[c] += 1
...
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, default as test; from collections import defaultdict; dct=defaultdict(int)')
15.277361154556274
>>> timeit.timeit('Counter(values)', 'from __main__ import values; from collections import Counter')
38.657804012298584
So the Counter() type is slowest, but defaultdict is very fast indeed. Counter()s do a lot more work though, and the extra functionality can bring ease of development and execution speed benefits elsewhere.
Related
Why does the following code work while the code after it breaks?
I'm not sure how to articulate my question in english, so I attached the smallest code I could come up with to highlight my problem.
(Context: I'm trying to create a terminal environment for python, but for some reason the namespaces seem to be messed up, and the below code seems to be the essence of my problem)
No errors:
d={}
exec('def a():b',d)
exec('b=None',d)
exec('a()',d)
Errors:
d={}
exec('def a():b',d)
d=d.copy()
exec('b=None',d)
d=d.copy()
exec('a()',d)
It is because the d does not use the globals provided by exec; it uses the mapping to which it stored the reference in the first exec. While you set 'b' in the new dictionary, you never set b in the globals of that function.
>>> d={}
>>> exec('def a():b',d)
>>> exec('b=None',d)
>>> d['a'].__globals__ is d
True
>>> 'b' in d['a'].__globals__
True
vs
>>> d={}
>>> exec('def a():b',d)
>>> d = d.copy()
>>> exec('b=None',d)
>>> d['a'].__globals__ is d
False
>>> 'b' in d['a'].__globals__
False
If exec didn't work this way, then this too would fail:
mod.py
b = None
def d():
b
main.py
from mod import d
d()
A function will remember the environment where it was first created.
It is not possible to change the dictionary that an existing function points to. You can either modify its globals explicitly, or you can make another function object altogether:
from types import FunctionType
def rebind_globals(func, new_globals):
f = FunctionType(
code=func.__code__,
globals=new_globals,
name=func.__name__,
argdefs=func.__defaults__,
closure=func.__closure__
)
f.__kwdefaults__ = func.__kwdefaults__
return f
def foo(a, b=1, *, c=2):
print(a, b, c, d)
# add __builtins__ so that `print` is found...
new_globals = {'d': 3, '__builtins__': __builtins__}
new_foo = rebind_globals(foo, new_globals)
new_foo(a=0)
If a dictionary contains something to which you can hold a reference, you can default-or-update it with one dictionary lookup:
d.setdefault('k', []).append(2)
However, modifying dictionary entries in the same manner is not possible if they're numbers:
d.setdefault('k', 0) += 1 # doesn't work
Instead, you need to do two dict lookups, one for read and one for write:
d['a'] = d.get('a', 0) + 1
This doesn't seem like a great idea for dictionaries with a huge number of keys. So, is there a way to do a default-or-update operation on dictionaries containing numbers? Or, phrased another way, what's the most performant way to apply a default-or-update operation on such dictionaries?
A quick test suggests that collections.defaultdict is about 2.5 times faster than your double-lookup (tested on Python 2.6):
>>> import timeit
>>> s1 = "d = dict((str(n), 0) for n in range(1000000))"
>>> timeit.repeat("d['a'] = d.get('a', 0) + 1", setup=s1)
[0.17711305618286133, 0.17411494255065918, 0.17812514305114746]
>>> s2 = """
... from collections import defaultdict
... d = defaultdict(int, ((str(n), 0) for n in range(1000000)))
... """
>>> timeit.repeat("d['a'] += 1", setup=s2)
[0.07185506820678711, 0.07294416427612305, 0.12155508995056152]
I have the list of strings from the Amazon S3 API service which contain the full file path, like this:
fileA.jpg
fileB.jpg
images/
I want to put partition folders and files into different lists.
How can I divide them?
I was thinking of regex like this:
for path in list:
if re.search("/$",path)
dir_list.append(path)
else
file_list.append(path)
is there any better way?
Don't use a regular expression; just use .endswith('/'):
for path in lst:
if path.endswith('/'):
dir_list.append(path)
else:
file_list.append(path)
.endswith() performs better than a regular expression and is simpler to boot:
>>> sample = ['fileA.jpg', 'fileB.jpg', 'images/'] * 30
>>> import random
>>> random.shuffle(sample)
>>> from timeit import timeit
>>> import re
>>> def re_partition(pattern=re.compile(r'/$')):
... for e in sample:
... if pattern.search(e): pass
... else: pass
...
>>> def endswith_partition():
... for e in sample:
... if e.endswith('/'): pass
... else: pass
...
>>> timeit('f()', 'from __main__ import re_partition as f, sample', number=10000)
0.2553541660308838
>>> timeit('f()', 'from __main__ import endswith_partition as f, sample', number=10000)
0.20675897598266602
From Filter a list into two parts, an iterable version:
from itertools import tee
a, b = tee((p.endswith("/"), p) for p in paths)
dirs = (path for isdir, path in a if isdir)
files = (path for isdir, path in b if not isdir)
It allows to consume an infinite stream of paths from the service if both dirs and files generators are advanced nearly in sync.
You could use itertools module for item grouping:
import itertools
items = ["fileA.jpg","fileB.jpg","images/"]
sorter = lambda x:x.endswith("/")
items = sorted(items, key=sorter) #in case items are not sorted
files, dirs = [tuple(i[1]) for i in itertools.groupby(items, sorter)]
print(files, dirs)
I wonder which of the following is done quicker for a tuple (also for a list or an int) :
a_tuple = ('a', 'b',)
if (len(a_tuple) != 0): pass
if (len(a_tuple) > 0): pass
I did some timeit experiment and the result is rather similar (vary each time I run timeit for 100000 iterations). I just wonder if there is a time benefit.
Use not a_tuple (True if empty) or tuple (True if not empty) instead of testing for the length:
if a_tuple:
pass
Or, as a demonstration speaks louder than words:
>>> if not ():
... print('empty!')
...
empty!
>>> if (1, 0):
... print('not empty!')
...
not empty!
Apart from the fact that this is a micro optimization, testing for the falsy-ness of the empty tuple is faster too. When in doubt about speed, use the timeit module:
>>> import timeit
>>> a_tuple = (1,0)
>>> def ft_bool():
... if a_tuple:
... pass
...
>>> def ft_len_gt():
... if len(a_tuple) > 0:
... pass
...
>>> def ft_len_ne():
... if len(a_tuple) != 0:
... pass
...
>>> timeit.timeit('ft()', 'from __main__ import ft_bool as ft')
0.17232918739318848
>>> timeit.timeit('ft()', 'from __main__ import ft_len_gt as ft')
0.2506139278411865
>>> timeit.timeit('ft()', 'from __main__ import ft_len_ne as ft')
0.23904109001159668
In python, is there a difference between calling clear() and assigning {} to a dictionary? If yes, what is it?
Example:d = {"stuff":"things"}
d.clear() #this way
d = {} #vs this way
If you have another variable also referring to the same dictionary, there is a big difference:
>>> d = {"stuff": "things"}
>>> d2 = d
>>> d = {}
>>> d2
{'stuff': 'things'}
>>> d = {"stuff": "things"}
>>> d2 = d
>>> d.clear()
>>> d2
{}
This is because assigning d = {} creates a new, empty dictionary and assigns it to the d variable. This leaves d2 pointing at the old dictionary with items still in it. However, d.clear() clears the same dictionary that d and d2 both point at.
d = {} will create a new instance for d but all other references will still point to the old contents.
d.clear() will reset the contents, but all references to the same instance will still be correct.
In addition to the differences mentioned in other answers, there also is a speed difference. d = {} is over twice as fast:
python -m timeit -s "d = {}" "for i in xrange(500000): d.clear()"
10 loops, best of 3: 127 msec per loop
python -m timeit -s "d = {}" "for i in xrange(500000): d = {}"
10 loops, best of 3: 53.6 msec per loop
As an illustration for the things already mentioned before:
>>> a = {1:2}
>>> id(a)
3073677212L
>>> a.clear()
>>> id(a)
3073677212L
>>> a = {}
>>> id(a)
3073675716L
In addition to #odano 's answer, it seems using d.clear() is faster if you would like to clear the dict for many times.
import timeit
p1 = '''
d = {}
for i in xrange(1000):
d[i] = i * i
for j in xrange(100):
d = {}
for i in xrange(1000):
d[i] = i * i
'''
p2 = '''
d = {}
for i in xrange(1000):
d[i] = i * i
for j in xrange(100):
d.clear()
for i in xrange(1000):
d[i] = i * i
'''
print timeit.timeit(p1, number=1000)
print timeit.timeit(p2, number=1000)
The result is:
20.0367929935
19.6444659233
Mutating methods are always useful if the original object is not in scope:
def fun(d):
d.clear()
d["b"] = 2
d={"a": 2}
fun(d)
d # {'b': 2}
Re-assigning the dictionary would create a new object and wouldn't modify the original one.
One thing not mentioned is scoping issues. Not a great example, but here's the case where I ran into the problem:
def conf_decorator(dec):
"""Enables behavior like this:
#threaded
def f(): ...
or
#threaded(thread=KThread)
def f(): ...
(assuming threaded is wrapped with this function.)
Sends any accumulated kwargs to threaded.
"""
c_kwargs = {}
#wraps(dec)
def wrapped(f=None, **kwargs):
if f:
r = dec(f, **c_kwargs)
c_kwargs = {}
return r
else:
c_kwargs.update(kwargs) #<- UnboundLocalError: local variable 'c_kwargs' referenced before assignment
return wrapped
return wrapped
The solution is to replace c_kwargs = {} with c_kwargs.clear()
If someone thinks up a more practical example, feel free to edit this post.
In addition, sometimes the dict instance might be a subclass of dict (defaultdict for example). In that case, using clear is preferred, as we don't have to remember the exact type of the dict, and also avoid duplicate code (coupling the clearing line with the initialization line).
x = defaultdict(list)
x[1].append(2)
...
x.clear() # instead of the longer x = defaultdict(list)