Spark: Why does Python significantly outperform Scala in my use case?

Spark: Why does Python significantly outperform Scala in my use case? - python

To compare performance of Spark when using Python and Scala I created the same job in both languages and compared the runtime. I expected both jobs to take roughly the same amount of time, but Python job took only 27min, while Scala job took 37min (almost 40% longer!). I implemented the same job in Java as well and it took 37minutes too. How is this possible that Python is so much faster?
Minimal verifiable example:
Python job:
# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)
# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()
print(a, b)
Scala job:
// Configuration
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
// 960 Files from a public dataset in 2 batches
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()
println(s"Lines with a: $num1, Lines with b: $num2")
Just by looking at the code, they seem to be identical. I looked a the DAGs and they didn't provide any insights (or at least I lack the know-how to come up with an explanation based on them).
I would really appreciate any pointers.

Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect. You can easily verify it with minimal local applications. Scala one:
import scala.io.Source
import java.time.{Duration, Instant}
object App {
def main(args: Array[String]) {
val Array(filename, string) = args
val start = Instant.now()
Source
.fromFile(filename)
.getLines
.filter(line => line.startsWith(string))
.length
val stop = Instant.now()
val duration = Duration.between(start, stop).toMillis
println(s"${start},${stop},${duration}")
}
}
Python one
import datetime
import sys
if __name__ == "__main__":
_, filename, string = sys.argv
start = datetime.datetime.now()
with open(filename) as fr:
# Not idiomatic or the most efficient but that's what
# PySpark will use
sum(1 for _ in filter(lambda line: line.startswith(string), fr))
end = datetime.datetime.now()
duration = round((end - start).total_seconds() * 1000)
print(f"{start},{end},{duration}")
Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:
Python 273.50 (258.84, 288.16)
Scala 634.13 (533.81, 734.45)
As you see Python is not only systematically faster, but also is more consistent (lower spread).
Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.
Edit:
To address some concerns raised in the comments:
In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side). That's as cheap as it gets when it comes to "serialization".
What is passed back is just a single integer by partition, so in that direction impact is negligible.
Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth, and its nothing else than socket associated file). Again, as cheap as it gets when it comes to communication between processes.
Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.
This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).
Edit 2:
Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.
Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.
Python 22809.57 (21466.26, 24152.87)
Scala 27315.28 (24367.24, 30263.31)
Please note non-overlapping confidence intervals.
Edit 3:
To address another comment from Jasper-M:
The bulk of all the processing is still happening inside a JVM in the Spark case.
That is simply incorrect in this particular case:
The job in question is map job with single global reduce using PySpark RDDs.
PySpark RDD (unlike let's say DataFrame) implement gross of functionality natively in Python, with exception input, output and inter-node communication.
Since it is single stage job, and final output is small enough to be ignored, the main responsibility of JVM (if one was to nitpick, this is implemented mostly in Java not Scala) is to invoke Hadoop input format, and push data through socket file to Python.
The read part is identical for JVM and Python API, so it can be considered as constant overhead. It also doesn't qualify as the bulk of the processing, even for such simple job like this one.

The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources.
There are two mistakes in the code:
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5
LINE 1. Once the line has been executed, the resource configuration of the Spark job is already established and fixed. From this point on, no way to adjust anything. Neither the number of executors nor the number of cores per executor.
LINE 4-5. sc.hadoopConfiguration is a wrong place to set any Spark configuration. It should be set in the config instance you pass to new SparkContext(config).
[ADDED]
Bearing the above in mind, I would propose to change the code of the Scala job to
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
and re-test it again. I bet the Scala version is going to be X times faster now.

Related

Unit testing + Mocking eBPF programs in python incl emulating kernel to user-space interactions

So I need to get some mocking+testing done on a very specific kind of python code: it is an eBPF(Berkely packet filter) program using python, which in essence runs a C program in kernel space through a python script via the BPF Compiler Collection BCC
eg code:
from bcc import BPF
# define BPF program
prog = """
int hello(void *ctx) {
bpf_trace_printk("Hello, World!\\n");
return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
I want to now write tests for the C and the Python code without the 2 interfering (I would even be satisfied with some basic python unit testing). Any & all possible approaches to the problem are welcome, or even a general direction to look for info on this kind of testing, really just looking to address the following concerns:
Can I easily mock the kernel interactions when testing the python code?
Can I emulate a message coming from the kernel to user-space?
Can I test a message that’s just larger or just smaller than the expected message size?
I tried looking into the topic for days, to no avail so even pointing me in the right direction would be greatly appreciated...

So it turns out there is no quick way to really get into the testing of the C code in a basic way like you would using pytest to test the python logic. However if you absolutely need solid testing done then browsing through here and making sense of how BCC does their version of "unit testing" may be of help...

Writing to memory in a single operation

I'm writing a userspace driver for accessing FPGA registers in Python 3.5 that mmaps the FPGA's PCI address space, obtains a memoryview to provide direct access to the memory-mapped register space, and then uses struct.pack_into("<I", ...) to write a 32-bit value into the selected 32-bit aligned address.
def write_u32(address, data):
assert address % 4 == 0, "Address must be 32-bit aligned"
path = path.lib.Path("/dev/uio0")
file_size = path.stat().st_size
with path.open(mode='w+b') as f:
mv = memoryview(mmap.mmap(f.fileno(), file_size))
struct.pack_into("<I", mv, address, data)
Unfortunately, it appears that struct.pack_into does a memset(buf, 0, ...) that clears the register before the actual value is written. By examining write operations within the FPGA, I can see that the register is set to 0x00000000 before the true value is set, so there are at least two writes across the PCI bus (in fact for 32-bit access there are three, two zero writes, then the actual data. 64-bit involves six writes). This causes side-effects with some registers that count the number of write operations, or some that "clear on write" or trigger some event when written.
I'd like to use an alternative method to write the register data in a single write to the memory-mapped register space. I've looked into ctypes.memmove and it looks promising (not yet working), but I'm wondering if there are other ways to do this.
Note that a register read using struct.unpack_from works perfectly.
Note that I've also eliminated the FPGA from this by using a QEMU driver that logs all accesses - I see the same double zero-write access before data is written.
I revisited this in 2022 and the situation hasn't really changed. If you're considering using memoryview to write blocks of data at once, you may find this interesting.

Perhaps this would work as needed?
mv[address:address+4] = struct.pack("<I", data)
Update:
As seen from the comments, the code above does not solve the problem. The following variation of it does, however:
mv_as_int = mv.cast('I')
mv_as_int[address/4] = data
Unfortunately, precise understanding of what happens under the hood and why exactly memoryview behaves this way is beyond the capabilities of modern technology and will thus stay open for the researchers of the future to tackle.

You could try something like this:
def __init__(self,offset,size=0x10000):
self.offset = offset
self.size = size
mmap_file = os.open('/dev/mem', os.O_RDWR | os.O_SYNC)
mem = mmap.mmap(mmap_file, self.size,
mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE,
offset=self.offset)
os.close(mmap_file)
self.array = np.frombuffer(mem, np.uint32, self.size >> 2)
def wread(self,address):
idx = address >> 2
return_val = int(self.array[idx])
return return_val
def wwrite(self,address,data):
idx = address >> 2
self.array[idx] = np.uint32(data)

Optimizing speed of python code that tests results of an API

I'm trying to test a publicly available web page that takes a GET request and returns a different JSON file depending on the GET argument.
The API looks like
https://www.example.com/api/page?type=check&code=[Insert string here]
I made a program to check the results of all possible 4-letter strings on this API. My code looks something like this (with the actual URL replaced):
import time, urllib.request
for a in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for b in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for d in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
a,b,c,d = "J","A","K","E"
test = urllib.request.urlopen("https://www.example.com/api/page?type=check&code=" + a + b + c + d).read()
if test != b'{"result":null}':
print(a + b + c + d)
f = open("codes", "a")
f.write(a + b + c + d + ",")
f.close()
This code is completely functional and works as expected. However, there is a problem. Because the program can't progress until it receives a responses, this method is very slow. If this ping time is 100ms for the API, then it will take 100ms for each check. When I modified this code so that it could test half of the results in one instance, and half in another, I noticed that the speed doubled.
Because of this, I'm led to believe that the ping time of the site is the limiting factor in this script. What I want to do is be able to is basically check each code, and then immediately check the next one without waiting for a response.
That would be the equivalent of opening up the page a few thousand times in my browser. It could load many tabs at the same time, since each page is less than a kilobyte.
I looked into using threading to do this, but I'm not sure if its relevant or helpful.

User a worker pool, like described here: https://docs.python.org/3.7/library/multiprocessing.html
from multiprocessing import Pool
def test_url(code):
''' insert code to test URL '''
pass
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(test_url, [code1,code2,code3]))
Just be aware that the website might be rate-limiting the amount of requests you are making.
To be more specific with your example, I would split it up into two phases: (1) generate test codes (2) test url, given one test code. Once you have the list of codes generated, you can apply the above strategy of applying the verifier to each generated code, using a worker pool.
To generate the test codes, you can use itertools:
codes_to_test = [''.join(i) for i in itertools.product(string.ascii_lowercase, repeat = 5)]
You have a better understanding of how to test a URL given one test code , so I assume you can write a function test_url(test_code) that will make the appropriate URL request and verify the result as necessary. Then you can call:
with Pool(5) as p:
print(p.map(test_url, test_codes))
On top of this, I would suggest two things: make sure test_codes is not enormous at first (for example by taking a sublist of these generated codes) to make sure your code is working correctly and (2) that you can play with the size of the worker pool to not overwhelm your machine or the API.
Alternatively you can use asyncio (https://docs.python.org/3/library/asyncio.html) to keep everything in a single process.

Python loop slower than Excel VBA?

I ran a little test between excel (VBA) and python performing a simple loop. Code listed below. To my surprise vba was significantly faster than python. Almost 6 times faster. I though that due to the fact that python runs through the command line the performance would be better. Do you guys have any comments on this?
Python
import time
import ctypes # An included library with Python install.
start_time = time.time()
for x in range(0, 1000000):
print x
x = ("--- %s seconds ---" % (time.time() - start_time))
ctypes.windll.user32.MessageBoxA(0, x, "Your title", 1)
Excel (VBA)
Sub looptest()
Dim MyTimer As Double
MyTimer = Timer
Dim rng As Range, cell As Range
Set rng = Range("A1:A1000000")
x = 1
For Each cell In rng
cell.Value = x
x = x + 1
Next cell
MsgBox Timer - MyTimer
End Sub

Your two code samples are not doing the same thing. In the Python code, the inner loop has to:
Ask for the next number in range(0, 1000000).
Display it.
In the VBA code, Excel has to:
Ask for the next cell in Range("A1:A1000000") (which has nothing to do with Python ranges).
Set the cell.Value property.
Run through various code Excel executes whenever it changes a cell.
Check to see if any formulas need to be recalculated.
Display it.
Increment x.
Let's rewrite this so the Python and VBA loops do the same thing, as near as we can:
Python
import time
import ctypes
start_time = time.time()
x = 0
while x <= 1000000:
x = x + 1
x = ("--- %s seconds ---" % (time.time() - start_time))
ctypes.windll.user32.MessageBoxA(0, x, "Your title", 1)
VBA
Declare Function QueryPerformanceCounter Lib "kernel32" (t As Currency) As Boolean
Declare Function QueryPerformanceFrequency Lib "kernel32" (t As Currency) As Boolean
Sub looptest()
Dim StartTime As Currency
QueryPerformanceCounter StartTime
x = 0
Do While x <= 1000000
x = x + 1
Loop
Dim EndTime As Currency
QueryPerformanceCounter EndTime
Dim Frequency As Currency
QueryPerformanceFrequency Frequency
MsgBox Format$((EndTime - StartTime) / Frequency, "0.000")
End Sub
On my computer, Python takes about 96 ms, and VBA 33 ms – VBA performs three times faster. If you throw in a Dim x As Long, it performs six times faster.
Why? Well, let's look at how each gets run. Python internally compiles your .py file into a .pyc, and runs it under the Python VM. Another answer describes the Python case in detail. Excel compiles VBA into MS P-Code, and runs it under the Visual Basic VM.
At this point, it doesn't matter that python.exe is command-line and Excel is GUI. The VM runs your code, and it lives a little deeper in the bowels of your computer. Performance depends on what specific instructions are in the compiled code, and how efficiently the VM runs these instructions. In this case, the VB VM ran its P-Code faster than the Python VM ran its .pyc.

The slow part about this is the print. Printing to the console is incredibly slow, so you should totally avoid it. I assume that setting cell values in Excel is just way faster.
If you want to compare computation speed you should not have any I/O within the loop. Instead, only calculate the time it took to process the whole loop without doing anything inside (or doing something simple like adding a number or something). If you do that, you will see that Python is very fast.

It varies with the computation you need to perform and it is difficult to give a proof which is faster in a simply way, with a simple comparison. So I will share my experience in a generic way, without proofs, but with some comments on things that may generate a big difference.
In my experience, if you compare a simple for loop with exactly the same code between pure python and pure VBA, VBA is around 3 times faster. But nobody do the same in different languages. You have to apply each language best practices.
If you apply VBA best practices you can make it even faster through declaring the variables and other similar optimizations not available in python. In my experience this can make code around 2-3 times faster. So you can make VBA code around 6-9 times faster than a simple for loop in python.
On the other hand, if you apply python best practices, you won't generally write a regular for loop. You will use list comprehension or you will use numpy and scipy, which run in compiled C libraries. Those solutions are much faster than VBA code.
In general, if you perform complex matrix calculations you can do with numpy and scipy, python will be faster than VBA. In other cases, VBA is faster.
In python, you can also use Numba, which adds a bit of complexity to the code, but it helps generating compiled code and handles running it in the GPU. It will make your code even faster.
This is my experience with pure computation mainly involving internal arrays. I haven't compared performance for I/O with GUI, external files and databases, network communication or API.

Is there a hidden possible deadlock in ppmap/parallel python?

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)

You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.

May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.