Hi I want to calculated the summation of integers from 1 to N by using threading to speed up. Thus I wrote:
import threading
N = int(input())
sum = 0
i = 1
lock = threading.Lock()
def thread_worker():
global sum
global i
lock.acquire()
sum += i
i += 1
lock.release()
for j in range(N):
w = threading.Thread(target = thread_worker)
w.start()
Because I don't want to mess up my variable i, I used the lock function. However the speed of code is not improved since only one thread can access the shared variable at a time.
So I want to know is there anyway I can improve the threading runtime? Will adding more variables and locks be helpful?
Thanks!
I feel like your question deserves two answers:
How to use threading in the right way.
One for how to speed up the summation of integers between 1 and N.
How to use threading
There is always some overhead to threading: Creating threads, locking, etc. Even if this overhead is really small, some overhead will remain. This means that threading is only worth it, if each thread has some significant work to be done. Your function does almost nothing inside the lock.acquire()/release(). Threads would work, if you were doing some more complicated work there. But you are using the lock function correctly -- this is just not a good problem for threading.
Fast Summation of 1..N
Measure, Measure, Measure
I'm running Python 3.7, and have pytest-benchmark 3.2.2 installed. Pytest-benchmark runs a function for you an appropriate number of times, and reports statistics on the runtime.
import threading
import pytest
def compute_variant1(N: int) -> int:
global sum_
global i
sum_ = 0
i = 1
lock = threading.Lock()
def thread_worker():
global sum_
global i
lock.acquire()
sum_ += i
i += 1
lock.release()
threads = []
for j in range(N):
threads.append(threading.Thread(target = thread_worker))
threads[-1].start()
for t in threads:
t.join()
return sum_
#pytest.mark.parametrize("func", [
compute_variant1])
#pytest.mark.parametrize("N,expected", [(10, 55), (30, 465), (100, 5050)])
def test_var1(benchmark, func, N, expected):
result = benchmark(func, N=N )
assert result == expected
Run this with: py.test --benchmark-histogram=bench And open the generated bench.svg It also prints this table:
----------------------------------------------------------------------------------------------------- benchmark: 3 tests -----------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_var1[10-55-compute_variant1] 570.0710 (1.0) 4,900.3160 (1.51) 658.8392 (1.0) 270.6056 (1.36) 602.5220 (1.0) 42.6373 (1.0) 19;82 1,517.8209 (1.0) 529 1
test_var1[30-465-compute_variant1] 1,701.2970 (2.98) 3,237.5830 (1.0) 1,879.8490 (2.85) 198.3905 (1.0) 1,802.5160 (2.99) 146.4465 (3.43) 59;43 531.9576 (0.35) 432 1
test_var1[100-5050-compute_variant1] 5,809.7500 (10.19) 13,354.3520 (4.12) 6,698.4778 (10.17) 1,413.1428 (7.12) 6,235.4355 (10.35) 766.0440 (17.97) 6;7 149.2876 (0.10) 74 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simpler Algorithms are Faster
Often it is really helpful to compare a fast implementation against the most straightforward one.
The simplest implementation I can think of -- without using Python tricks -- is a simple loop:
def compute_variant2(N):
sum_ = 0
for i in range(1, N+1):
sum_ += i
return sum_
Python has a function called sum() which takes a list or any iterable like range(N):
def compute_variant3(N: int) -> int:
return sum(range(1, N+1))
Here are the results (I've removed some columns):
------------------------------------------------------- benchmark: 3 tests ------------------------------------------------
Name (time in us) Mean StdDev Median Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------
test_var1[100-5050-compute_variant2] 4.2328 (1.0) 1.6411 (1.0) 4.1150 (1.0) 163106 1
test_var1[100-5050-compute_variant3] 4.7113 (1.11) 1.6773 (1.02) 4.5560 (1.11) 141744 1
test_var1[100-5050-compute_variant1] 6,404.1856 (>1000.0) 668.2502 (407.21) 6,257.6385 (>1000.0) 106 1
---------------------------------------------------------------------------------------------------------------------------
As you can see the variant1 based on threading is many, many times slower than the sequential implementations.
Even faster using Mathematics
Carl Friedrich Gauss, when he was a child 1, re-discovered aclose-form solution to our problem: N * (N + 1) / 2.
This can be expressed in Python:
def compute_variant4(N: int) -> int:
return N * (N + 1) // 2
Let's compare that to the other fast implementations (I'll leave out the first implementation):
And as you can see in the table: The last variant is faster, and importantly independent of N.
---------------------------------------------------------------------------------------------------- benchmark: 12 tests -----------------------------------------------------------------------------------------------------
Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
var1[10-55-compute_variant4] 162.0100 (1.0) 1,053.6800 (1.0) 170.1799 (1.0) 32.5862 (1.0) 163.7200 (1.0) 8.1800 (1.59) 1135;1225 5,876.1353 (1.0) 58310 100
var1[13-91-compute_variant4] 162.1400 (1.00) 1,354.3200 (1.29) 181.5037 (1.07) 46.2132 (1.42) 176.1800 (1.08) 5.1500 (1.0) 2214;14374 5,509.5296 (0.94) 61452 100
var1[30-465-compute_variant4] 188.8900 (1.17) 1,874.2800 (1.78) 200.4983 (1.18) 40.8533 (1.25) 191.0000 (1.17) 9.8300 (1.91) 1245;1342 4,987.5732 (0.85) 51919 100
var1[100-5050-compute_variant4] 192.9091 (1.19) 5,938.7273 (5.64) 209.0628 (1.23) 75.8845 (2.33) 198.5455 (1.21) 10.9091 (2.12) 1879;4696 4,783.2508 (0.81) 194515 22
var1[10-55-compute_variant2] 676.1000 (4.17) 18,987.8000 (18.02) 719.4231 (4.23) 194.5556 (5.97) 689.0000 (4.21) 34.9000 (6.78) 1447;2199 1,390.0027 (0.24) 125898 10
var1[13-91-compute_variant2] 753.9000 (4.65) 12,103.8000 (11.49) 799.2837 (4.70) 201.3654 (6.18) 766.5000 (4.68) 38.1000 (7.40) 1554;3049 1,251.1203 (0.21) 124441 10
var1[10-55-compute_variant3] 1,021.0000 (6.30) 77,718.8000 (73.76) 1,157.6125 (6.80) 544.1982 (16.70) 1,098.2000 (6.71) 73.0000 (14.17) 3802;12244 863.8469 (0.15) 186672 5
var1[13-91-compute_variant3] 1,127.6000 (6.96) 44,606.4000 (42.33) 1,279.9332 (7.52) 476.7700 (14.63) 1,200.2000 (7.33) 90.0000 (17.48) 4022;21018 781.2908 (0.13) 172147 5
var1[30-465-compute_variant2] 1,304.7500 (8.05) 48,218.5000 (45.76) 1,457.3923 (8.56) 550.7975 (16.90) 1,385.7500 (8.46) 80.2500 (15.58) 3944;22221 686.1570 (0.12) 177086 4
var1[30-465-compute_variant3] 1,738.6667 (10.73) 86,587.3333 (82.18) 1,935.1659 (11.37) 762.7118 (23.41) 1,860.3333 (11.36) 128.6667 (24.98) 2474;6870 516.7516 (0.09) 176773 3
var1[100-5050-compute_variant2] 3,891.0000 (24.02) 181,009.0000 (171.79) 4,218.4608 (24.79) 1,721.8413 (52.84) 3,998.0000 (24.42) 210.0000 (40.78) 1788;2670 237.0533 (0.04) 171881 1
var1[100-5050-compute_variant3] 4,200.0000 (25.92) 76,885.0000 (72.97) 4,516.5853 (26.54) 1,452.5713 (44.58) 4,343.0000 (26.53) 210.0000 (40.78) 1204;2311 221.4062 (0.04) 153587 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Related
I'm running some runtime tests to understand what I can gain from parallelization and how it affects runtime (linearly?).
For a given integer n I successively compute the n-th Fibonacci number and vary the degree of parallelization by allowing to compute each Fibonacci number i in {0,1,...,n} by using up to 16 parallel processes.
import pandas as pd
import time
import multiprocessing as mp
# n-te Fibonacci Zahl
def f(n: int):
if n in {0, 1}:
return n
return f(n - 1) + f(n - 2)
if __name__ == "__main__":
K = range(1, 16 + 1)
n = 100
N = range(n)
df_dauern = pd.DataFrame(index=K, columns=N)
for _n in N:
_N = range(_n)
print(f'\nn = {_n}')
for k in K:
start = time.time()
pool = mp.Pool(k)
pool.map(f, _N)
pool.close()
pool.join()
ende = time.time()
dauer = ende - start
m, s = divmod(dauer, 60)
h, m = divmod(m, 60)
h, m, s = round(h), round(m), round(s)
df_dauern.loc[k, _n] = f'{h}:{m}:{s}'
print(f'... k = {k:02d}, Dauer: {h}:{m}:{s}')
df_dauern.to_excel('Dauern.xlsx')
In the following DataFrame I display the duration (h:m:s) for n in {45, 46, 47}.
45 46 47
1 0:9:40 0:15:24 0:24:54
2 0:7:24 0:13:23 0:22:59
3 0:5:3 0:9:37 0:19:7
4 0:7:18 0:7:19 0:15:29
5 0:7:21 0:7:17 0:15:35
6 0:3:41 0:9:34 0:9:36
7 0:3:40 0:9:46 0:9:34
8 0:3:41 0:9:33 0:9:33
9 0:3:39 0:9:33 0:9:33
10 0:3:39 0:9:32 0:9:32
11 0:3:39 0:9:34 0:9:45
12 0:3:40 0:6:4 0:9:37
13 0:3:39 0:5:54 0:9:32
14 0:3:39 0:5:55 0:9:32
15 0:3:40 0:5:53 0:9:33
16 0:3:39 0:5:55 0:9:33
In my opinion the results are odd in two dimensions. First, the duration is not monotonically decreasing for increasing parallelization and second, runtime is not linearly decreasing (that is, double processes, half runtime).
Is this behavior to be expected?
Is this behavior due to the chosen example of computing Fibonacci numbers?
How is it even possible that runtime increases with increasing parallelization (e.g. always when moving from 2 to 3 parallel processes)?
How come it does not make a difference whether I use 6 or 16 parallel processes?
it's because of multiprocessing scheduling algorithm and the fact that the task has factorial complexity, by default the pool will choose a chunksize that is relative to the number of workers
basically multiprocessing splits work into equal chunks to reduce serialization overhead, the chunksize is given by.
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
chunksize += bool(extra)
for 4 and 5 workers, the chunk size is the same (3), and 99.9% of the time is taken by the last 3 tasks, which are scheduled in the same core (because they are in 1 chunk), so 1 core ends up doing 99.9% of the work regardless of the core count, the extra 3 seconds are most likely scheduling overhead (more workers = more scheduling), you'll get a speedup if you set the chunksize=1 in pool.map parameters manually, as each of these 3 tasks will be scheduled to a different core.
for worker number higher than 6, the chunksize is calculated to be 2, but you have an odd number of tasks, which means you will always wait for the last task that is scheduled, which is the longest one, the entire 3:40 minutes are in a single function, it cannot be broken down further, so it doesn't matter if you launch 6 workers or a 100, you are still limited by the slowest task (or actually the slowest chunk).
I am a beginner with parallel processing and I currently experiment with a simple program to understand how Ray works.
import numpy as np
import time
from pprint import pprint
import ray
ray.init(num_cpus = 4) # Specify this system has 4 CPUs.
data_rows = 800
data_cols = 10000
batch_size = int(data_rows/4)
# Prepare data
np.random.RandomState(100)
arr = np.random.randint(0, 100, size=[data_rows, data_cols])
data = arr.tolist()
# Solution Without Paralleization
def howmany_within_range(row, minimum, maximum):
"""Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
results = []
start = time.time()
for row in data:
results.append(howmany_within_range(row, minimum=75, maximum=100))
end = time.time()
print("Without parallelization")
print("-----------------------")
pprint(results[:5])
print("Total time: ", end-start, "sec")
# Parallelization with ray
results = []
y = []
z = []
w = []
#ray.remote
def solve(data, minimum, maximum):
count = 0
count_row = 0
for i in data:
for n in i:
if minimum <= n <= maximum:
count = count + 1
count_row = count
count = 0
return count_row
start = time.time()
results = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(0, batch_size)])
y = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(1*batch_size, 2*batch_size)])
z = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(2*batch_size, 3*batch_size)])
w = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(3*batch_size, 4*batch_size)])
end = time.time()
results += y+z+w
print("With parallelization")
print("--------------------")
print(results[:5])
print("Total time: ", end-start, "sec")
I am getting much slower performance with Ray:
$ python3 raytest.py
Without parallelization
-----------------------
[2501, 2543, 2530, 2410, 2467]
Total time: 0.5162293910980225 sec
(solve pid=26294)
With parallelization
--------------------
[2501, 2543, 2530, 2410, 2467]
Total time: 1.1760196685791016 sec
In fact, if I scale up the input data I get messages in the terminal with the pid of the function and the program stalls.
Essentially, I try to split computations in batches of rows and assign each computation to a cpu core. What am I doing wrong?
there are two main problems when it comes to multiprocessing (your code)
there's an overhead associated with spawning the new processes to do your work.
there's an overhead associated with transferring data between different processes.
in order to spawn a new process, a new instance of the python interpreter is created and initialized (due to the GIL). also when you transfer data between processes, this data has to be serialized/deserialized at the sender/receiver, which in your program is happening twice (once from main process to workers, and again from workers to the main process.), so in short your program is spending all it's time paying this overhead instead of doing the actual computation.
if you want to utilize the benefit of multiprocessing in python you should have more computation being done at the workers using as little data transfer as possible, the way I usually determine if using multiprocessing will be a good idea is if the task is going to take more than 5 seconds to complete on a single cpu.
another good idea to reduce data transfer is slicing your arrays in chucks (multiple rows) instead of a single row per function call, as each row has to be serialized separately, which adds extra overhead.
I have this code for computing fibonacci numbers using cache (dictionary).
cache = {}
def dynamic_fib(n):
print n
if n == 0 or n == 1:
return 1
if not (n in cache):
print "caching %d" % n
cache[n] = dynamic_fib(n-1) + dynamic_fib(n-2)
return cache[n]
if __name__ == "__main__":
start = time.time()
print "DYNAMIC: ", dynamic_fib(2000)
print (time.time() - start)
I works fine with small numbers, but with more than 1000 as an input, it seems to stop.
This is the result with 2000 as an input.
....
caching 1008
1007
caching 1007
1006
caching 1006
1005
caching 1005
This is a result with 1000 as an input.
....
8
caching 8
7
caching 7
6
caching 6
5
caching 5
It looks like that after 995 storage into the dictionary, it just hangs.
What might be wrong in this? What debugging technique can I use to see what went wrong in python?
I run python on Mac OS X 10.7.5, I have 4G bytes of RAM, so I think some KB (or even MB) of memory usage doesn't matter much.
Python has a default recursion limit set to 1000.
You need to increase it in your program.
import sys
sys.setrecursionlimit(5000)
From : http://docs.python.org/2/library/sys.html#sys.setrecursionlimit
sys.setrecursionlimit(limit)
Set the maximum depth of the Python interpreter stack to limit.
This limit prevents infinite recursion from causing an overflow of the C
stack and crashing Python.
The highest possible limit is platform-dependent. A user may need to set the
limit higher when she has a program that requires deep recursion and a
platform that supports a higher limit. This should bedone with care, because
a too-high limit can lead to a crash.
You don't really gain anything by storing the cache as a dictionary since in order to calculate f(n) you need to know f(n-1) (and f(n-2)). In other words, your dictionary will always have keys from 2-n. You might as well just use a list instead (it's only an extra 2 elements). Here's a version which caches properly and doesn't hit the recursion limit (ever):
import time
cache = [1,1]
def dynamic_fib(n):
#print n
if n >= len(cache):
for i in range(len(cache),n):
dynamic_fib(i)
cache.append(dynamic_fib(n-1) + dynamic_fib(n-2))
print "caching %d" % n
return cache[n]
if __name__ == "__main__":
start = time.time()
a = dynamic_fib(4000)
print "Dynamic",a
print (time.time() - start)
Note that you could do the same thing with a dict, but I'm almost positive that a list will be faster.
Just for fun, here's a bunch of options (and timings!):
def fib_iter(n):
a, b = 1, 1
for i in xrange(n):
a, b = b, a + b
return a
memo_iter = [1,1]
def fib_iter_memo(n):
if n == 0:
return 1
else:
try:
return memo_iter[n+1]
except IndexError:
a,b = memo_iter[-2:]
for i in xrange(len(memo_iter),n+2):
a, b = b, a + b
memo_iter.append(a)
return memo_iter[-1]
dyn_cache = [1,1]
def dynamic_fib(n):
if n >= len(dyn_cache):
for i in xrange(len(dyn_cache),n):
dynamic_fib(i)
dyn_cache.append(dynamic_fib(n-1) + dynamic_fib(n-2))
return dyn_cache[n]
dyn_cache2 = [1,1]
def dynamic_fib2(n):
if n >= len(dyn_cache2):
for i in xrange(len(dyn_cache2),n):
dynamic_fib2(i)
dyn_cache2.append(dyn_cache2[-1] + dyn_cache2[-2])
return dyn_cache2[n]
cache_fibo = [1,1]
def dyn_fib_simple(n):
while len(cache_fibo) <= n:
cache_fibo.append(cache_fibo[-1]+cache_fibo[-2])
return cache_fibo[n]
import timeit
for func in ('dyn_fib_simple','dynamic_fib2','dynamic_fib','fib_iter_memo','fib_iter'):
print timeit.timeit('%s(100)'%func,setup='from __main__ import %s'%func),func
print fib_iter(100)
print fib_iter_memo(100)
print fib_iter_memo(100)
print dynamic_fib(100)
print dynamic_fib2(100)
print dyn_fib_simple(100)
And the results:
0.269892930984 dyn_fib_simple
0.256865024567 dynamic_fib2
0.241492033005 dynamic_fib
0.222282171249 fib_iter_memo
7.23831701279 fib_iter
573147844013817084101
573147844013817084101
573147844013817084101
573147844013817084101
573147844013817084101
573147844013817084101
A recursion free version:
def fibo(n):
cache=[1,1]
while len(cache) < n:
cache.append(cache[-1]+cache[-2])
return cache
It's probably because of the limit of stack depth, which results in an RuntimeError. You can increase the stack's recursion limit by calling
sys.setrecursionlimit(<number>)
of the sys module.
I wanted to see how much faster reduce was than using a for loop for simple numerical operations. Here's what I found (using the standard timeit library):
In [54]: print(setup)
from operator import add, iadd
r = range(100)
In [55]: print(stmt1)
c = 0
for i in r:
c+=i
In [56]: timeit(stmt1, setup)
Out[56]: 8.948904991149902
In [58]: print(stmt3)
reduce(add, r)
In [59]: timeit(stmt3, setup)
Out[59]: 13.316915035247803
Looking a little more:
In [68]: timeit("1+2", setup)
Out[68]: 0.04145693778991699
In [69]: timeit("add(1,2)", setup)
Out[69]: 0.22807812690734863
What's going on here? Obviously, reduce does loop faster than for, but the function call seems to dominate. Shouldn't the reduce version run almost entirely in C? Using iadd(c,i) in the for loop version makes it run in ~24 seconds. Why would using operator.add be so much slower than +? I was under the impression + and operator.add run the same C code (I checked to make sure operator.add wasn't just calling + in python or anything).
BTW, just using sum runs in ~2.3 seconds.
In [70]: print(sys.version)
2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)]
The reduce(add, r) must invoke the add() function 100 times, so the overhead of the function calls adds up -- reduce uses PyEval_CallObject to invoke add on each iteration:
for (;;) {
...
if (result == NULL)
result = op2;
else {
# here it is creating a tuple to pass the previous result and the next
# value from range(100) into func add():
PyTuple_SetItem(args, 0, result);
PyTuple_SetItem(args, 1, op2);
if ((result = PyEval_CallObject(func, args)) == NULL)
goto Fail;
}
Updated: Response to question in comments.
When you type 1 + 2 in Python source code, the bytecode compiler performs the addition in place and replaces that expression with 3:
f1 = lambda: 1 + 2
c1 = byteplay.Code.from_code(f1.func_code)
print c1.code
1 1 LOAD_CONST 3
2 RETURN_VALUE
If you add two variables a + b the compiler will generate bytecode which loads the two variables and performs a BINARY_ADD, which is far faster than calling a function to perform the addition:
f2 = lambda a, b: a + b
c2 = byteplay.Code.from_code(f2.func_code)
print c2.code
1 1 LOAD_FAST a
2 LOAD_FAST b
3 BINARY_ADD
4 RETURN_VALUE
It could be the overhead of copying args and return values (i.e. "add(1, 2)"), opposed to simply operating on numeric literals
edit: Switching out zeroes instead of array multiply closes the gap big time.
from functools import reduce
from numpy import array, arange, zeros
from time import time
def add(x, y):
return x + y
def sum_columns(x):
if x.any():
width = len(x[0])
total = zeros(width)
for row in x:
total += array(row)
return total
l = arange(3000000)
l = array([l, l, l])
start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))
start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))
Gets you down almost no difference at all.
Reduce took 0.03230619430541992
For loop took took 0.058577775955200195
old: If reduce is used for adding together NumPy arrays by index, it can be faster than a for loop.
from functools import reduce
from numpy import array, arange
from time import time
def add(x, y):
return x + y
def sum_columns(x):
if x.any():
width = len(x[0])
total = array([0] * width)
for row in x:
total += array(row)
return total
l = arange(3000000)
l = array([l, l, l])
start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))
start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))
With the result of
[ 0 3 6 ..., 8999991 8999994 8999997]
Reduce took 0.024930953979492188
[ 0 3 6 ..., 8999991 8999994 8999997]
For loop took took 0.3731539249420166
I have a simple script to set up a Poisson distribution by constructing an array of "events" of probability = 0.1, and then counting the number of successes in each group of 10. It almost works, but the distribution is not quite right (P(0) should equal P(1), but is instead about 90% of P(1)). It's like there's an off-by-one kind of error, but I can't figure out what it is. The script uses the Counter class from here (because I have Python 2.6 and not 2.7) and the grouping uses itertools as discussed here. It's not a stochastic issue, repeats give pretty tight results, and the overall mean looks good, group size looks good. Any ideas where I've messed up?
from itertools import izip_longest
import numpy as np
import Counter
def groups(iterable, n=3, padvalue=0):
"groups('abcde', 3, 'x') --> ('a','b','c'), ('d','e','x')"
return izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
def event():
f = 0.1
r = np.random.random()
if r < f: return 1
return 0
L = [event() for i in range(100000)]
rL = [sum(g) for g in groups(L,n=10)]
print len(rL)
print sum(list(L))
C = Counter.Counter(rL)
for i in range(max(C.keys())+1):
print str(i).rjust(2), C[i]
$ python script.py
10000
9949
0 3509
1 3845
2 1971
3 555
4 104
5 15
6 1
$ python script.py
10000
10152
0 3417
1 3879
2 1978
3 599
4 115
5 12
I did a combinatorial reality check on your math, and it looks like your results are correct actually. P(0) should not be roughly equivalent to P(1)
.9^10 = 0.34867844 = probability of 0 events
.1 * .9^9 * (10 choose 1) = .1 * .9^9 * 10 = 0.387420489 = probability of 1 event
I wonder if you accidentally did your math thusly:
.1 * .9^10 * (10 choose 1) = 0.34867844 = incorrect probability of 1 event