Python loop slower than Excel VBA?

Python loop slower than Excel VBA? - python

I ran a little test between excel (VBA) and python performing a simple loop. Code listed below. To my surprise vba was significantly faster than python. Almost 6 times faster. I though that due to the fact that python runs through the command line the performance would be better. Do you guys have any comments on this?
Python
import time
import ctypes # An included library with Python install.
start_time = time.time()
for x in range(0, 1000000):
print x
x = ("--- %s seconds ---" % (time.time() - start_time))
ctypes.windll.user32.MessageBoxA(0, x, "Your title", 1)
Excel (VBA)
Sub looptest()
Dim MyTimer As Double
MyTimer = Timer
Dim rng As Range, cell As Range
Set rng = Range("A1:A1000000")
x = 1
For Each cell In rng
cell.Value = x
x = x + 1
Next cell
MsgBox Timer - MyTimer
End Sub

Your two code samples are not doing the same thing. In the Python code, the inner loop has to:
Ask for the next number in range(0, 1000000).
Display it.
In the VBA code, Excel has to:
Ask for the next cell in Range("A1:A1000000") (which has nothing to do with Python ranges).
Set the cell.Value property.
Run through various code Excel executes whenever it changes a cell.
Check to see if any formulas need to be recalculated.
Display it.
Increment x.
Let's rewrite this so the Python and VBA loops do the same thing, as near as we can:
Python
import time
import ctypes
start_time = time.time()
x = 0
while x <= 1000000:
x = x + 1
x = ("--- %s seconds ---" % (time.time() - start_time))
ctypes.windll.user32.MessageBoxA(0, x, "Your title", 1)
VBA
Declare Function QueryPerformanceCounter Lib "kernel32" (t As Currency) As Boolean
Declare Function QueryPerformanceFrequency Lib "kernel32" (t As Currency) As Boolean
Sub looptest()
Dim StartTime As Currency
QueryPerformanceCounter StartTime
x = 0
Do While x <= 1000000
x = x + 1
Loop
Dim EndTime As Currency
QueryPerformanceCounter EndTime
Dim Frequency As Currency
QueryPerformanceFrequency Frequency
MsgBox Format$((EndTime - StartTime) / Frequency, "0.000")
End Sub
On my computer, Python takes about 96 ms, and VBA 33 ms – VBA performs three times faster. If you throw in a Dim x As Long, it performs six times faster.
Why? Well, let's look at how each gets run. Python internally compiles your .py file into a .pyc, and runs it under the Python VM. Another answer describes the Python case in detail. Excel compiles VBA into MS P-Code, and runs it under the Visual Basic VM.
At this point, it doesn't matter that python.exe is command-line and Excel is GUI. The VM runs your code, and it lives a little deeper in the bowels of your computer. Performance depends on what specific instructions are in the compiled code, and how efficiently the VM runs these instructions. In this case, the VB VM ran its P-Code faster than the Python VM ran its .pyc.

The slow part about this is the print. Printing to the console is incredibly slow, so you should totally avoid it. I assume that setting cell values in Excel is just way faster.
If you want to compare computation speed you should not have any I/O within the loop. Instead, only calculate the time it took to process the whole loop without doing anything inside (or doing something simple like adding a number or something). If you do that, you will see that Python is very fast.

It varies with the computation you need to perform and it is difficult to give a proof which is faster in a simply way, with a simple comparison. So I will share my experience in a generic way, without proofs, but with some comments on things that may generate a big difference.
In my experience, if you compare a simple for loop with exactly the same code between pure python and pure VBA, VBA is around 3 times faster. But nobody do the same in different languages. You have to apply each language best practices.
If you apply VBA best practices you can make it even faster through declaring the variables and other similar optimizations not available in python. In my experience this can make code around 2-3 times faster. So you can make VBA code around 6-9 times faster than a simple for loop in python.
On the other hand, if you apply python best practices, you won't generally write a regular for loop. You will use list comprehension or you will use numpy and scipy, which run in compiled C libraries. Those solutions are much faster than VBA code.
In general, if you perform complex matrix calculations you can do with numpy and scipy, python will be faster than VBA. In other cases, VBA is faster.
In python, you can also use Numba, which adds a bit of complexity to the code, but it helps generating compiled code and handles running it in the GPU. It will make your code even faster.
This is my experience with pure computation mainly involving internal arrays. I haven't compared performance for I/O with GUI, external files and databases, network communication or API.

Related

Spark: Why does Python significantly outperform Scala in my use case?

To compare performance of Spark when using Python and Scala I created the same job in both languages and compared the runtime. I expected both jobs to take roughly the same amount of time, but Python job took only 27min, while Scala job took 37min (almost 40% longer!). I implemented the same job in Java as well and it took 37minutes too. How is this possible that Python is so much faster?
Minimal verifiable example:
Python job:
# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)
# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()
print(a, b)
Scala job:
// Configuration
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
// 960 Files from a public dataset in 2 batches
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()
println(s"Lines with a: $num1, Lines with b: $num2")
Just by looking at the code, they seem to be identical. I looked a the DAGs and they didn't provide any insights (or at least I lack the know-how to come up with an explanation based on them).
I would really appreciate any pointers.

Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect. You can easily verify it with minimal local applications. Scala one:
import scala.io.Source
import java.time.{Duration, Instant}
object App {
def main(args: Array[String]) {
val Array(filename, string) = args
val start = Instant.now()
Source
.fromFile(filename)
.getLines
.filter(line => line.startsWith(string))
.length
val stop = Instant.now()
val duration = Duration.between(start, stop).toMillis
println(s"${start},${stop},${duration}")
}
}
Python one
import datetime
import sys
if __name__ == "__main__":
_, filename, string = sys.argv
start = datetime.datetime.now()
with open(filename) as fr:
# Not idiomatic or the most efficient but that's what
# PySpark will use
sum(1 for _ in filter(lambda line: line.startswith(string), fr))
end = datetime.datetime.now()
duration = round((end - start).total_seconds() * 1000)
print(f"{start},{end},{duration}")
Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:
Python 273.50 (258.84, 288.16)
Scala 634.13 (533.81, 734.45)
As you see Python is not only systematically faster, but also is more consistent (lower spread).
Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.
Edit:
To address some concerns raised in the comments:
In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side). That's as cheap as it gets when it comes to "serialization".
What is passed back is just a single integer by partition, so in that direction impact is negligible.
Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth, and its nothing else than socket associated file). Again, as cheap as it gets when it comes to communication between processes.
Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.
This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).
Edit 2:
Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.
Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.
Python 22809.57 (21466.26, 24152.87)
Scala 27315.28 (24367.24, 30263.31)
Please note non-overlapping confidence intervals.
Edit 3:
To address another comment from Jasper-M:
The bulk of all the processing is still happening inside a JVM in the Spark case.
That is simply incorrect in this particular case:
The job in question is map job with single global reduce using PySpark RDDs.
PySpark RDD (unlike let's say DataFrame) implement gross of functionality natively in Python, with exception input, output and inter-node communication.
Since it is single stage job, and final output is small enough to be ignored, the main responsibility of JVM (if one was to nitpick, this is implemented mostly in Java not Scala) is to invoke Hadoop input format, and push data through socket file to Python.
The read part is identical for JVM and Python API, so it can be considered as constant overhead. It also doesn't qualify as the bulk of the processing, even for such simple job like this one.

The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources.
There are two mistakes in the code:
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5
LINE 1. Once the line has been executed, the resource configuration of the Spark job is already established and fixed. From this point on, no way to adjust anything. Neither the number of executors nor the number of cores per executor.
LINE 4-5. sc.hadoopConfiguration is a wrong place to set any Spark configuration. It should be set in the config instance you pass to new SparkContext(config).
[ADDED]
Bearing the above in mind, I would propose to change the code of the Scala job to
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
and re-test it again. I bet the Scala version is going to be X times faster now.

How to make PyCharm profiler show only timings of my source code, not any libraries?

When I run PyCharm profiler (a quick intro video is here -- https://www.youtube.com/watch?v=QSueV8MYtlw ) I get thousands of lines like hasattr or npyio.py (where did it come from, I do not even use numpy) which do not help me to understand what's going on at all.
How can I make make PyCharm profiler to show only timings of my source code, not any libraries or system calls?
In other words, can the time spent in system calls and libraries be assigned to my functions which call them?
In other words (version two), all I want is number of milliseconds next to each line of my python code, not anything else.

I created a code to provide an example and hopefully provide an acceptable answer:
import datetime as dt
class something:
def something_else(self):
other_list = range(100000)
for num in other_list:
datetimeobj = dt.datetime.fromtimestamp(num)
print(num)
print(datetimeobj)
def something_different(self):
other_list = range(100000)
for num in other_list:
datetimeobj = dt.datetime.fromtimestamp(num)
print(num)
print(datetimeobj)
st = something()
st.something_else()
st.something_different()
The code resulted in the below picture, which I have sorted based on name. (This is in my case possible because all of the builtin methods are prefixed by "<". After doing this I can now see that main took 100 % of the total time (Column: Time (ms)). something_else took 50.8 % of the time and something_different took 49.2 % of the time (Totaling to 100 % as well) (Column: Time (ms)) The time spent inside each of the two home-grown methods was 2.0 % for each (Column: Own Time (ms)) -> This means that underlying calls from something_else accounted for 48.8 % and for something_different 47.2 % and the parts that I wrote accounted for 4.0 % of the total time. The remaining 96.0 % of the code happens from the built-in methods, that I call.
Your questions were:
How can I make make PyCharm profiler to show only timings of my source code, not any libraries or system calls? -> That's what you see in the column: "Own Time (ms)" -> 2.0 % (Time spent inside the specific method.)
In other words, can the time spent in system calls and libraries be assigned to my functions which call them? -> That's what you see in the column: "Time (ms)" (Time spent including underlying methods.)
Subtract the two columns and you get time spent only in underlying methods.
I have unfortunately been unable to find a method for filtering in the profiler, but it is possible to export the list by copying it and this way you could create something else to do the filtering on e.g. "<built_in" to clean up the data.

What is the logic behind the <statement, setup> design for timeit?

I am new to Python and figured I'd play around with problems on Project Euler to have something concrete to do meanwhile.
I came across the idea of timing different solutions to see how they rate against each other. That simple task turned out to be too complicated for my taste however. I read that the time.clock() calls are not accurate enough on unix systems (seconds resolution is simply pathetic with modern processors). Thus I stumbled upon the timeit module which seems to be the first choice for profiling tasks.
I have to say I really don't understand why they went with such a counter-intuitive way to go about it. I can't seem to get it to work, without needing to rewrite/restructure my code, which I find very frustrating.
Take the code below and nevermind for a second that it's neither pretty nor particularly efficient:
import math
import sys
from timeit import Timer
def digitsum(number):
rem = 0
while number > 0:
rem += number % 10
number //= 10
return rem
def prime_form(p):
if p == 2 or p == 3 or p == 5:
return True
elif (p-1) % 6 != 0 and (p+1) % 6 != 0:
return False
elif digitsum(p) % 3 == 0:
return False
elif p % 10 == 0 or p % 10 == 5:
return False
else:
return True
def lfactor(n):
if n <= 3:
return 1
limit = int(math.sqrt(n))
if limit % 2 == 0:
limit -= 1
lfac = 1
for i in range(3,limit+1,2):
if prime_form(i):
(div,rem) = divmod(n,i)
if rem == 0:
lfac = max(lfac, max(lfactor(div) ,lfactor(i)))
return lfac if lfac != 1 else n
number = int(sys.argv[1])
t = Timer("""print lfactor(number)""", """import primefacs""")
t.timeit(100)
#print lfactor(number)
If i would like to time the line print lfactor(number) why should I go through a bunch of loops, trying to define a setup statement etc.. I understand why one would want to have debug tool that are detached from the code being tested (a la unit testing) but shouldn't there be a simple and straightforward way to get the process time of a chunk of code without much hassle (importing/defining a setup etc)? What I am thinking here is something like the way one would do that:
long t0 = System.currentTimeInMillis();
// do something
long t = System.currentTimeInMillis() - t0;
.. or even better with MATLAB, using the tic/toc commands:
tic
x = A\b;
t(n) = toc;
Hope this doesn't come across as a rant, I am really trying understand "the pythonian way of thinking" but honestly it doesn't come naturally here, not at all...

Simple, the logic behind the statement and setup is that the setup is not part of the code you want to benchmark. Normally a python module is loaded once while the functions inside it are run more than one, much more.
A pythonic way of use timeit?
$ python -m timeit -h
Tool for measuring execution time of small code snippets.
This module avoids a number of common traps for measuring execution
times. See also Tim Peters' introduction to the Algorithms chapter in
the Python Cookbook, published by O'Reilly.
Library usage: see the Timer class.
Command line usage:
python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [--] [statement]
Options:
-n/--number N: how many times to execute 'statement' (default: see below)
-r/--repeat N: how many times to repeat the timer (default 3)
-s/--setup S: statement to be executed once initially (default 'pass')
-t/--time: use time.time() (default on Unix)
-c/--clock: use time.clock() (default on Windows)
-v/--verbose: print raw timing results; repeat for more digits precision
-h/--help: print this usage message and exit
--: separate options from statement, use when statement starts with -
statement: statement to be timed (default 'pass')
[cut]
$ python -m timeit -s 'from primefacs import lfactor' 'lfactor(42)'
$ # this does not work, primefacs is not binded, ie. not loaded
$ python -m timeit 'primefacts.lfactor(42)'
$ # this does not work too, lfactor is not defined
$ python -m timeit 'lfactor(42)'
$ # this works but the time to import primefacs is benchmarked too
$ # but only the first time is loaded, the successive ones the cache is used.
$ python -m timeit 'import primefacts; primefacts.lfactor(42)'
As you can see, the way timeit works is much more intuitive than you think.
Edit to add:
I read that the time.clock() calls are not accurate enough on unix
systems (seconds resolution is simply pathetic with modern
processors).
quoting the documentation:
On Unix, return the current processor time as a floating point number
expressed in seconds. The precision, and in fact the very definition
of the meaning of “processor time”, depends on that of the C function
of the same name, but in any case, this is the function to use for
benchmarking Python or timing algorithms... The resolution is
typically better than one microsecond.
going on..
I have to say I really don't understand why they went with such a
counter-intuitive way to go about it. I can't seem to get it to work,
without needing to rewrite/restructure my code, which I find very
frustrating.
Yes, it could be but then this is one of those cases where documentation can help you, here a link to the examples for the impatiens. Here a more gentle introduction to timeit.

When timing a statement, you want to time just that statement, not the setup. The setup could be considerably slower than the statement-under-test.
Note that timeit runs your statement thousands of times to get a reasonable average. It does this to eliminate the effects of OS scheduling and other processes (including but not limited to disk buffer flushing, cronjob execution, memory swapping, etc); only an average time would have any meaning when comparing different code alternatives.
For your case, just test lfactor(number) directly, and just use the timeit() function:
timeit.timeit('lfactor(number)', 'from __main__ import lfactor, number')
The setup code retrieves the lfactor() function, as well as number taken from sys.argv from the main script; the function and number won't otherwise be seen.
There is absolutely no point in performance testing the print statement, that's not what you are trying to time. Using timeit is not about seeing the result of the call, just the time it takes to run it. Since the code-under-test is run thousands of times, all you'd get is thousands of prints of (presumably) the same result.
Note that usually timeit is used to compare performance characteristics of short python snippets; to find performance bottlenecks in more complex code, use profiling instead.
If you want to time just one run, use the timeit.default_timer() function to get the most accurate timer for your platform:
timer = timeit.default_timer
start = timer()
print lfactor(number)
time_taken = timer() - start

Plotting time against size of input for Longest Common Subsequence Problem

I wish to plot the time against size of input, for Longest common subsequence problem in recursive as well as dynamic programming approaches. Until now I've developed programs for evaluating lcs functions in both ways, a simple random string generator(with help from here) and a program to plot the graph. Now I need to connect all these in the following way.
Now I have to connect all these. That is, the two programs for calculating lcs should run about 10 times with output from simple random string generator given as command line arguments to these programs.
The time taken for execution of these programs are calculated and this along with the length of strings used is stored in a file like
l=15, r=0.003, c=0.001
This is parsed by the python program to populate the following lists
sequence_lengths = []
recursive_times = []
dynamic_times = []
and then the graph is plotted. I've the following questions regarding above.
1) How do I pass the output of one C program to another C program as command line arguments?
2) Is there any function to evaluate the time taken to execute the function in microseconds? Presently the only option I have is time function in unix. Being a command-line utility makes it tougher to handle.
Any help would be much appreciated.

If the data being passed from program to program is small and can be converted to character format, you can pass it as one or more command-line arguments. If not you can write it to a file and pass its name as a argument.
For Python programs many people use the timeit module's Timer class to measure code execution speed. You can also roll-you-own using the clock() or time() functions in time module. The resolution depends on what platform you're running on.

1) There are many ways, the simplest is to use system with a string constructed from the output (or popen to open it as a pipe if you need to read back its output), or if you wish to leave the current program then you can use the various exec (placing the output in the arguments).
In an sh shell you can also do this with command2 $(command1 args_to_command_1)
2) For timing in C, see clock and getrusage.

Why is subtraction faster than addition in Python?

I was optimising some Python code, and tried the following experiment:
import time
start = time.clock()
x = 0
for i in range(10000000):
x += 1
end = time.clock()
print '+=',end-start
start = time.clock()
x = 0
for i in range(10000000):
x -= -1
end = time.clock()
print '-=',end-start
The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.
Stranger,
for i in range(10000000, 0, -1):
(ie running the loop backwards) is faster than
for i in range(10000000):
even when loop contents are identical.
What gives, and is there a more general programming lesson here?

I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:
('+=', 11.370000000000001)
('-=', 10.769999999999998)
First, some observations:
This is 5% for a trivial operation. That's significant.
The speed of the native addition and subtraction opcodes is irrelevant. It's in the noise floor, completely dwarfed by the bytecode evaluation. That's talking about one or two native instructions around thousands.
The bytecode generates exactly the same number of instructions; the only difference is INPLACE_ADD vs. INPLACE_SUBTRACT and +1 vs -1.
Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx. INPLACE_ADD has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT, since you can't subtract strings. That means INPLACE_ADD contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.
This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.
Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.

$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop
Looks like you've some measurement bias

I think the "general programming lesson" is that it is really hard to predict, solely by looking at the source code, which sequence of statements will be the fastest. Programmers at all levels frequently get caught up by this sort of "intuitive" optimisation. What you think you know may not necessarily be true.
There is simply no substitute for actually measuring your program performance. Kudos for doing so; answering why undoubtedly requires delving deep into the implementation of Python, in this case.
With byte-compiled languages such as Java, Python, and .NET, it is not even sufficient to measure performance on just one machine. Differences between VM versions, native code translation implementations, CPU-specific optimisations, and so on will make this sort of question ever more tricky to answer.

"The second loop is reliably faster ..."
That's your explanation right there. Re-order your script so the subtraction test is timed first, then the addition, and suddenly addition becomes the faster operation again:
-= 3.05
+= 2.84
Obviously something happens to the second half of the script that makes it faster. My guess is that the first call to range() is slower because python needs to allocate enough memory for such a long list, but it is able to re-use that memory for the second call to range():
import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start
A few runs of this script show that the extra time needed for the first range() accounts for nearly all of the time difference between '+=' and '-=' seen above:
first range() 0.4
second range() 0.23

It's always a good idea when asking a question to say what platform and what version of Python you are using. Sometimes it does't matter. This is NOT one of those times:
time.clock() is appropriate only on Windows. Throw away your own measuring code and use -m timeit as demonstrated in pixelbeat's answer.
Python 2.X's range() builds a list. If you are using Python 2.x, replace range with xrange and see what happens.
Python 3.X's int is Python2.X's long.

Is there a more general programming lesson here?
The more general programming lesson here is that intuition is a poor guide when predicting run-time performance of computer code.
One can reason about algorithmic complexity, hypothesise about compiler optimisations, estimate cache performance and so on. However, since these things can interact in non-trivial ways, the only way to be sure about how fast a particular piece of code is going to be is to benchmark it in the target environment (as you have rightfully done.)

With Python 2.5 the biggest problem here is using range, which will allocate a list that big to iterate over it. When using xrange, whichever is done second is a tiny bit faster for me. (Not sure if range has become a generator in Python 3.)

Your experiment is faulty. The way this experiment should be designed is to write 2 different programs - 1 for addition, 1 for subtraction. They should be exactly the same and run under the same conditions with the data being put to file. Then you need to average the runs (at least several thousand), but you'd need a statistician to tell you an appropriate number.
If you wanted to analyze different methods of addition, subtraction, and looping, again each of those should be a separate program.
Experimental error might arise from heat of processor and other activity going on the cpu, so i'd execute the runs in a variety of patterns...

That would be remarkable, so I have thoroughly evaluated your code and also setup the expiriment as I would find it more correct (all declarations and function calls outside the loop). Both versions I have run five times.
Running your code validated your claims:
-= takes constantly less time; 3.6% on average
Running my code, though, contradicts the outcome of your experiment:
+= takes on average (not always) 0.5% less time.
To show all results I have put plots online:
Your evaluation: http://bayimg.com/kadAeaAcN
My evaluation: http://bayimg.com/KadaAaAcN
So, I conclude that your experiment has a bias, and it is significant.
Finally here is my code:
import time
addtimes = [0.] * 100
subtracttimes = [0.] * 100
range100 = range(100)
range10000000 = range(10000000)
j = 0
i = 0
x = 0
start = 0.
for j in range100:
start = time.clock()
x = 0
for i in range10000000:
x += 1
addtimes[j] = time.clock() - start
for j in range100:
start = time.clock()
x = 0
for i in range10000000:
x -= -1
subtracttimes[j] = time.clock() - start
print '+=', sum(addtimes)
print '-=', sum(subtracttimes)

The running loop backwards is faster because the computer has an easier time comparing if a number is equal to 0.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.