Related
I'm currently having a small side project in which I want to sort a 20GB file on my machine as fast as possible. The idea is to chunk the file, sort the chunks, merge the chunks. I just used pyenv to time the radixsort code with different Python versions and saw that 2.7.18 is way faster than 3.6.10, 3.7.7, 3.8.3 and 3.9.0a. Can anybody explain why Python 3.x is slower than 2.7.18 in this simple example? Were there new features added?
import os
def chunk_data(filepath, prefixes):
"""
Pre-sort and chunk the content of filepath according to the prefixes.
Parameters
----------
filepath : str
Path to a text file which should get sorted. Each line contains
a string which has at least 2 characters and the first two
characters are guaranteed to be in prefixes
prefixes : List[str]
"""
prefix2file = {}
for prefix in prefixes:
chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
prefix2file[prefix] = open(chunk, "w")
# This is where most of the execution time is spent:
with open(filepath) as fp:
for line in fp:
prefix2file[line[:2]].write(line)
Execution times (multiple runs):
2.7.18: 192.2s, 220.3s, 225.8s
3.6.10: 302.5s
3.7.7: 308.5s
3.8.3: 279.8s, 279.7s (binary mode), 295.3s (binary mode), 307.7s, 380.6s (wtf?)
3.9.0a: 292.6s
The complete code is on Github, along with a minimal complete version
Unicode
Yes, I know that Python 3 and Python 2 deal different with strings. I tried opening the files in binary mode (rb / wb), see the "binary mode" comments. They are a tiny bit faster on a couple of runs. Still, Python 2.7 is WAY faster on all runs.
Try 1: Dictionary access
When I phrased this question, I thought that dictionary access might be a reason for this difference. However, I think the total execution time is way less for dictionary access than for I/O. Also, timeit did not show anything important:
import timeit
import numpy as np
durations = timeit.repeat(
'a["b"]',
repeat=10 ** 6,
number=1,
setup="a = {'b': 3, 'c': 4, 'd': 5}"
)
mul = 10 ** -7
print(
"mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
np.mean(durations) / mul,
np.std(durations) / mul
)
)
print("min = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max = {:0.1f} * 10^-7".format(np.max(durations) / mul))
Try 2: Copy time
As a simplified experiment, I tried to copy the 20GB file:
cp via shell: 230s
Python 2.7.18: 237s, 249s
Python 3.8.3: 233s, 267s, 272s
The Python stuff is generated by the following code.
My first thought was that the variance is quite high. So this could be the reason. But then, the variance of chunk_data execution time is also high, but the mean is noticeably lower for Python 2.7 than for Python 3.x. So it seems not to be an I/O scenario as simple as I tried here.
import time
import sys
import os
version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)
if os.path.isfile("numbers-tmp.txt"):
os.remove("numers-tmp.txt")
t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
for line in fin:
fout.write(line)
t1 = time.time()
print("Python {}: {:0.0f}s".format(version, t1 - t0))
My System
Ubuntu 20.04
Thinkpad T460p
Python through pyenv
This is a combination of multiple effects, mostly the fact that Python 3 needs to perform unicode decoding/encoding when working in text mode and if working in binary mode it will send the data through dedicated buffered IO implementations.
First of all, using time.time to measure execution time uses the wall time and hence includes all sorts of Python unrelated things such as OS-level caching and buffering, as well as buffering of the storage medium. It also reflects any interference with other processes that require the storage medium. That's why you are seeing these wild variations in timing results. Here are the results for my system, from seven consecutive runs for each version:
py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6] # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4] # 615.84 +/- 15.09
Despite the large variation it seems that these results indeed indicate different timings as can be confirmed for example by a statistical test:
>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636
i.e. there's only a 2% chance that the timings emerged from the same distribution.
We can get a more precise picture by measuring the process time rather than the wall time. In Python 2 this can be done via time.clock while Python 3.3+ offers time.process_time. These two functions report the following timings:
py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8] # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4] # 171.16 +/- 0.16
Now there's much less spread in the data since the timings reflect the Python process only.
This data suggests that Python 3 takes about 53.7 seconds longer to execute. Given the large amount of lines in the input file (550_000_000) this amounts to about 97.7 nanoseconds per iteration.
The first effect causing increased execution time are unicode strings in Python 3. The binary data is read from the file, decoded and then encoded again when it is written back. In Python 2 all strings are stored as binary strings right away, so this doesn't introduce any encoding/decoding overhead. You don't see this effect clearly in your tests because it disappears in the large variation introduced by various external resources which are reflected in the wall time difference. For example we can measure the time it takes for a roundtrip from binary to unicode to binary:
In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
This does include two attribute lookups as well as two function calls, so the actual time needed is smaller than the value reported above. To see the effect on execution time, we can change the test script to use binary modes "rb" and "wb" instead of text modes "r" and "w". This reduces the timing results for Python 3 as follows:
py3_binary_mode = [200.6, 203.0, 207.2] # 203.60 +/- 2.73
That reduces the process time by about 21.3 seconds or 38.7 nanoseconds per iteration. This is in agreement with timing results for the roundtrip benchmark minus timing results for name lookups and function calls:
In [2]: class C:
...: def f(self): pass
...:
In [3]: x = C()
In [4]: %timeit x.f()
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: %timeit x
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Here %timeit x measures the additional overhead of resolving the global name x and hence the attribute lookup and function call make 82.2 - 17.8 == 64.4 seconds. Subtracting this overhead twice from the above roundtrip data gives 162 - 2*64.4 == 33.2 seconds.
Now there's still a difference of 32.4 seconds between Python 3 using binary mode and Python 2. This comes from the fact that all the IO in Python 3 goes through the (quite complex) implementation of io.BufferedWriter .write while in Python 2 the file.write method proceeds fairly straightforward to fwrite.
We can check the types of the file objects in both implementations:
$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>
$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>
Here we also need to note that the above timing results for Python 2 have been obtained by using text mode, not binary mode. Binary mode aims to support all objects implementing the buffer protocol which results in additional work being performed also for strings (see also this question). If we switch to binary mode also for Python 2 then we obtain:
py2_binary_mode = [212.9, 213.9, 214.3] # 213.70 +/- 0.59
which is actually a bit larger than the Python 3 results (18.4 ns / iteration).
The two implementations also differ in other details such as the dict implementation. To measure this effect we can create a corresponding setup:
from __future__ import print_function
import timeit
N = 10**6
R = 7
results = timeit.repeat(
"d[b'10'].write",
setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))", # requires file 'test' to exist
repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)
This gives the following results for Python 2 and Python 3:
Python 2: ~ 56.9 nanoseconds
Python 3: ~ 78.1 nanoseconds
This additional difference of about 21.2 nanoseconds amounts to about 12 seconds for the full 550M iterations.
The above timing code checks the dict lookup for only one key, so we also need to verify that there are no hash collisions:
$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90
Using a list's insert function is much slower than achieving the same effect using slice assignment:
> python -m timeit -n 100000 -s "a=[]" "a.insert(0,0)"
100000 loops, best of 5: 19.2 usec per loop
> python -m timeit -n 100000 -s "a=[]" "a[0:0]=[0]"
100000 loops, best of 5: 6.78 usec per loop
(Note that a=[] is only the setup, so a starts empty but then grows to 100,000 elements.)
At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible:
> python -m timeit -n 100000 -s "a=[]" "a.insert(-1,0)"
100000 loops, best of 5: 79.1 nsec per loop
Why is the presumably simpler dedicated "insert single element" function so much slower?
I can also reproduce it at repl.it:
from timeit import repeat
for _ in range(3):
for stmt in 'a.insert(0,0)', 'a[0:0]=[0]', 'a.insert(-1,0)':
t = min(repeat(stmt, 'a=[]', number=10**5))
print('%.6f' % t, stmt)
print()
# Example output:
#
# 4.803514 a.insert(0,0)
# 1.807832 a[0:0]=[0]
# 0.012533 a.insert(-1,0)
#
# 4.967313 a.insert(0,0)
# 1.821665 a[0:0]=[0]
# 0.012738 a.insert(-1,0)
#
# 5.694100 a.insert(0,0)
# 1.899940 a[0:0]=[0]
# 0.012664 a.insert(-1,0)
I use Python 3.8.1 32-bit on Windows 10 64-bit.
repl.it uses Python 3.8.1 64-bit on Linux 64-bit.
I think it's probably just that they forgot to use memmove in list.insert. If you take a look at the code list.insert uses to shift elements, you can see it's just a manual loop:
for (i = n; --i >= where; )
items[i+1] = items[i];
while list.__setitem__ on the slice assignment path uses memmove:
memmove(&item[ihigh+d], &item[ihigh],
(k - ihigh)*sizeof(PyObject *));
memmove typically has a lot of optimization put into it, such as taking advantage of SSE/AVX instructions.
Im trying to compile an expression that contains an UndefinedFunction which has an implementation provided. (Alternatively: an expression which contains a Symbol which represents a call to an external numerical function)
Is there a way to do this? Either with autowrap or codegen or perhaps with some manual editing of the generated files?
The following naive example does not work:
import sympy as sp
import numpy as np
from sympy.abc import *
from sympy.utilities.lambdify import implemented_function
from sympy.utilities.autowrap import autowrap, ufuncify
def g_implementation(a):
"""represents some numerical function"""
return a*5
# sympy wrapper around the above function
g = implemented_function('g', g_implementation)
# some random expression using the above function
e = (x+g(a))**2+100
# try to compile said expression
f = autowrap(e, backend='cython')
# fails with "undefined reference to `g'"
EDIT:
I have several large Sympy expressions
The expressions are generated automatically (via differentiation and such)
The expressions contain some "implemented UndefinedFunctions" that call some numerical functions (i.e. NOT Sympy expressions)
The final script/program that has to evaluate the expressions for some input will be called quite often. That means that evaluating the expression in Sympy (via evalf) is definitely not feasible. Even compiling just in time (lambdify, autowrap, ufuncify, numba.jit) produces too much of an overhead.
Basically I want to create a binary python extension for those expressions without having to implement them by hand in C, which I consider too error prone.
OS is Windows 7 64bit
You may want to take a look at this answer about serialization of SymPy lambdas (generated by lambdify).
It's not exactly what you asked but may alleviate your problem with startup performance. Then lambdify-ied functions will mostly count only execution time.
You may also take a look at Theano. It has nice integration with SymPy.
Ok, this could do the trick, I hope, if not let me know and I'll try again.
I compare a compiled version of an expression using Cython against the lambdified expression.
from sympy.utilities.autowrap import autowrap
from sympy import symbols, lambdify
def wraping(expression):
return autowrap(expression, backend='cython')
def lamFunc(expression, x, y):
return lambdify([x,y], expr)
x, y = symbols('x y')
expr = ((x - y)**(25)).expand()
print expr
binary_callable = wraping(expr)
print binary_callable(1, 2)
lamb = lamFunc(expr, x, y)
print lamb(1,2)
which outputs:
x**25 - 25*x**24*y + 300*x**23*y**2 - 2300*x**22*y**3 + 12650*x**21*y**4 - 53130*x**20*y**5 + 177100*x**19*y**6 - 480700*x**18*y**7 + 1081575*x**17*y**8 - 2042975*x**16*y**9 + 3268760*x**15*y**10 - 4457400*x**14*y**11 + 5200300*x**13*y**12 - 5200300*x**12*y**13 + 4457400*x**11*y**14 - 3268760*x**10*y**15 + 2042975*x**9*y**16 - 1081575*x**8*y**17 + 480700*x**7*y**18 - 177100*x**6*y**19 + 53130*x**5*y**20 - 12650*x**4*y**21 + 2300*x**3*y**22 - 300*x**2*y**23 + 25*x*y**24 - y**25
-1.0
-1
If I time the execution times the autowraped function is 10x faster (depending on the problem I have also observed cases where the factor was as little as two):
%timeit binary_callable(12, 21)
100000 loops, best of 3: 2.87 µs per loop
%timeit lamb(12, 21)
100000 loops, best of 3: 28.7 µs per loop
So here wraping(expr) wraps your expression expr and returns a wrapped object binary_callable. This you can use at any time to do numerical evaluation.
EDIT: I have done this on Linux/Ubuntu and Windows OS, both seem to work fine!
Let's use, for example, numpy.sin()
The following code will return the value of the sine for each value of the array a:
import numpy
a = numpy.arange( 1000000 )
result = numpy.sin( a )
But my machine has 32 cores, so I'd like to make use of them. (The overhead might not be worthwhile for something like numpy.sin() but the function I actually want to use is quite a bit more complicated, and I will be working with a huge amount of data.)
Is this the best (read: smartest or fastest) method:
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
result = pool.map( numpy.sin, a )
or is there a better way to do this?
There is a better way: numexpr
Slightly reworded from their main page:
It's a multi-threaded VM written in C that analyzes expressions, rewrites them more efficiently, and compiles them on the fly into code that gets near optimal parallel performance for both memory and cpu bounded operations.
For example, in my 4 core machine, evaluating a sine is just slightly less than 4 times faster than numpy.
In [1]: import numpy as np
In [2]: import numexpr as ne
In [3]: a = np.arange(1000000)
In [4]: timeit ne.evaluate('sin(a)')
100 loops, best of 3: 15.6 ms per loop
In [5]: timeit np.sin(a)
10 loops, best of 3: 54 ms per loop
Documentation, including supported functions here. You'll have to check or give us more information to see if your more complicated function can be evaluated by numexpr.
Well this is kind of interesting note if you run the following commands:
import numpy
from multiprocessing import Pool
a = numpy.arange(1000000)
pool = Pool(processes = 5)
result = pool.map(numpy.sin, a)
UnpicklingError: NEWOBJ class argument has NULL tp_new
wasn't expecting that, so whats going on, well:
>>> help(numpy.sin)
Help on ufunc object:
sin = class ufunc(__builtin__.object)
| Functions that operate element by element on whole arrays.
|
| To see the documentation for a specific ufunc, use np.info(). For
| example, np.info(np.sin). Because ufuncs are written in C
| (for speed) and linked into Python with NumPy's ufunc facility,
| Python's help() function finds this page whenever help() is called
| on a ufunc.
yep numpy.sin is implemented in c as such you can't really use it directly with multiprocessing.
so we have to wrap it with another function
perf:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.032201
Multithreaded 10.550432
wow, wasn't expecting that either, well theres a couple of issues for starters we are using a python function even if its just a wrapper vs a pure c function, and theres also the overhead of copying the values, multiprocessing by default doesn't share data, as such each value needs to be copy back/forth.
do note that if properly segment our data:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = [numpy.arange(100000) for _ in xrange(10)]
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.150192
Multithreaded 0.055083
So what can we take from this, multiprocessing is great but we should always test and compare it sometimes its faster and sometimes its slower, depending how its used ...
Granted you are not using numpy.sin but another function I would recommend you first verify that indeed multiprocessing will speed up the computation, maybe the overhead of copying values back/forth may affect you.
Either way I also do believe that using pool.map is the best, safest method of multithreading code ...
I hope this helps.
SciPy actually has a pretty good writeup on this subject here.
I've been told os.path.join is horribly slow in python and I should use string concatenation ('%s/%s' % (x, y)) instead. Is there really that big a difference and if so how can I track it?
$ python -mtimeit -s 'import os.path' 'os.path.join("/root", "file")'
1000000 loops, best of 3: 1.02 usec per loop
$ python -mtimeit '"/root" + "file"'
10000000 loops, best of 3: 0.0223 usec per loop
So yes, it's nearly 50 times slower. 1 microsecond is still nothing though, so I really wouldn't factor the difference in. Use os.path.join: it's cross-platform, more readable and less bug-prone.
EDIT: Two people have now commented that the import explains the difference. This is not true, as -s is a setup flag thus the import is not factored into the reported runtime. Read the docs.
I don't know who told you not to use it, but they're wrong.
Even if it were slow, it would never be slow to a program-breaking extent. I've never noticed it being remotely slow.
It's key to cross-platform programming. Line separators etc. differ by platform, and os.path.join will always join paths correctly regardless of platform.
Readability. Everyone knows what join is doing. People might have to do a double take for string concatenation for paths.
Also be aware that periods in function calls are known to be slow. Compare:
python -mtimeit -s "import os.path;x=range(10)" "os.path.join(x)"
1000000 loops, best of 3: 0.405 usec per loop
python -mtimeit -s "from os.path import join;x=range(10)" "join(x)"
1000000 loops, best of 3: 0.29 usec per loop
So that's a slowdown of 40% just by having periods in your function invocation syntax.
Curiously, these two are different speeds:
$ python -mtimeit -s "from os.path import sep;join=sep.join;x=map(str,range(10))" "join(x)"
1000000 loops, best of 3: 0.253 usec per loop
$ python -mtimeit -s "from os.path import join;x=map(str,range(10))" "join(x)"
1000000 loops, best of 3: 0.285 usec per loop
It may be nearly 50 times faster, but unless you're doing it in a CPU bound tight inner loop, the speed difference isn't going to matter at all. The portability difference on the other hand will make the difference between whether or not your program can be easily ported to a non-Unix platform or not.
So, please use os.path.join unless you've profiled and discovered that it really is a major impediment to your program's performance.
You should use os.path.join simply for portability.
I don't get the point of comparing os.path.join (which works for any number or parts, on any platform) with something as trivial as string formatting two paths.
To answer the question in the title, "Is Python's os.path.join slow?" you have to at least compare it with a remotely similar function to find out what speed you can expect from a function like this.
As you can see below, compared to a similar function, there is nothing slow about os.path.join:
python -mtimeit -s "x = tuple(map(str, range(10)))" "'/'.join(x)"
1000000 loops, best of 3: 0.26 usec per loop
python -mtimeit -s "from os.path import join;x = tuple(range(10))" "join(x)"
1000000 loops, best of 3: 0.27 usec per loop
python -mtimeit -s "x = tuple(range(3))" "('/%s'*len(x)) % x"
1000000 loops, best of 3: 0.456 usec per loop
python -mtimeit -s "x = tuple(map(str, range(3)))" "'/'.join(x)"
10000000 loops, best of 3: 0.178 usec per loop
In this hot controversy, I dare to propose:
(I know, I know , there is timeit, but I'm not so trained with timeit, and clock() seems to me to be sufficient for the case)
import os
from time import clock
separ = os.sep
ospath = os.path
ospathjoin = os.path.join
A,B,C,D,E,F,G,H = [],[],[],[],[],[],[],[]
n = 1000
for essays in xrange(100):
te = clock()
for i in xrange(n):
xa = os.path.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
A.append(clock()-te)
te = clock()
for i in xrange(n):
xb = ospath.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
B.append(clock()-te)
te = clock()
for i in xrange(n):
xc = ospathjoin('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
C.append(clock()-te)
te = clock()
for i in xrange(n):
xd = 'C:\WINNT\system32'+os.sep+'Microsoft\Crypto'+os.sep+'RSA\MachineKeys'
D.append(clock()-te)
te = clock()
for i in xrange(n):
xe = '%s\\%s\\%s' % ('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
E.append(clock()-te)
te = clock()
for i in xrange(n):
xf = 'C:\WINNT\system32'+separ+'Microsoft\Crypto'+separ+'RSA\MachineKeys'
F.append(clock()-te)
te = clock()
for i in xrange(n):
xg = os.sep.join(('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys'))
G.append(clock()-te)
te = clock()
for i in xrange(n):
xh = separ.join(('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys'))
H.append(clock()-te)
print min(A), "os.path.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')"
print min(B), "ospath.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')"
print min(C), "ospathjoin('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')"
print min(D), "'C:\WINNT\system32'+os.sep+'Microsoft\Crypto'+os.sep+'RSA\MachineKeys'"
print min(E), "'%s\\%s\\%s' % ('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')"
print min(F), "'C:\WINNT\system32'+separ+'Microsoft\Crypto'+separ+'RSA\MachineKeys'"
print min(G), "os.sep.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')"
print min(H), "separ.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')"
print 'xa==xb==xc==xd==xe==xf==xg==xh==',xa==xb==xc==xd==xe==xf==xg==xh
result
0.0284533369465 os.path.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
0.0277652606686 ospath.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
0.0272489939364 ospathjoin('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
0.00398598145854 'C:\WINNT\system32'+os.sep+'Microsoft\Crypto'+os.sep+'RSA\MachineKeys'
0.00375075603184 '%s\%s\%s' % ('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
0.00330824168994 'C:\WINNT\system32'+separ+'Microsoft\Crypto'+separ+'RSA\MachineKeys'
0.00292467338726 os.sep.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
0.00261401937956 separ.join('C:\WINNT\system32','Microsoft\Crypto','RSA\MachineKeys')
True
with
separ = os.sep
ospath = os.path
ospathjoin = os.path.join
Everyone sholud know one inevident feature of os.path.join()
os.path.join( 'a', 'b' ) == 'a/b'
os.path.join( 'a', '/b' ) == '/b'