I am learning scala by converting some of my python code to scala code. I just encountered an issue where the python code is significantly outperforming the scala code. The code is supposed to construct a set of candidate pairs based on some conditions. Scala has comparable runtime performance with python for all previous parts.
The id_map is an array of map from Long to set of string. The average number of k-v pairs in the map is 1942.
The scala code snippet is below:
// id_map Array[mutable.Map[Long, Set[String]]
val candidate_pairs = id_map
.flatMap(hashmap => hashmap.values)
.filter(_.size >= 2)
.flatMap(strset => strset.toList.combinations(2))
.map(_.sorted)
.toSet
and the corresponding python code is
candidate_pairs = set()
for hashmap in id_map.values():
for strset in hashmap.values():
if len(strset) >= 2:
for pair in combinations(strset, 2):
candidate_pairs.add(tuple(sorted(pair)))
The scala code snippet takes 80 seconds while python version takes 10 seconds.
I am wondering what can I optimize the above code to make it faster. What I have been trying is updating the set using the for loop
var candidate_pairs = Set.empty[List[String]]
for (
hashmap: mutable.Map[Long, Set[String]] <- id_map;
setstr: Set[String] <- hashmap.values if setstr.size >= 2;
pair <- setstr.toList.combinations(2)
)
candidate_pairs += pair.sorted
and although the candidate_pairs is updated a lot of time and each time it creates a new set, it actually is faster than the previous scala version, and takes about 50 seconds, still worse than python though. I tried using mutable set but however the result is about the same as the immutable version.
Any help would be appreciated! Thanks!
Being slower than python sounds ... surprising.
First of all, make sure you have adequate memory settings, and it is not spending half of those 80 seconds in GC.
Also, be sure to "warm up" the JVM (run your function a few times before doing actual measurement), use the same exact data for runs in python and scala (not just same statistics, exactly the same data), and do not include the time spent acquiring/generating data into measurement. Make several runs and compare average time, not how much a single run took.
Having said that, a few ways to make your code faster:
Adding .view (or .iterator) after id_map in your implementation cuts the execution time by about factor of 4 in my experiments.
(.view makes your chained transformation applied "lazily" – essentially, making a single pass through the single instance of array instead of multiple with multiple copies).
- Replacing .map(_.sorted) with
.map {
case List(a,b) if a < b => (a,b)
case List(a,b) => (b, a)
}
Shaves off about another 75% (sorting two element lists is mostly overhead).
This changes the return type to tuples rather than lists (constructing lots of tiny lists also adds up), but this seems even more appropriate in this case actually.
– Removing .filter(_.size >= 2) (it is redundant anyway, and computing size of a collection may get expensive) yields further improvement, but fairly small, that I did not bother to measure exactly.
Additionally, it may be cheaper to get rid of the separate sort step altogether, and just add .sorted before .combinations. I have not tested it, because it would be futile without knowing more details about your data profile.
These are some general improvements that should improve your performance either way, though it is hard to be sure you'll see the same effect as I do, as I don't really know anything about your data beyond that average map size, the improvement you see might be even better than mine, or it could be somewhat smaller ... but you should see some.
I ran this version with some test Scala code I created. On a list of 1944 elements, it completed in about 15 ms on my laptop.
id_map
.flatMap(hashmap => hashmap.values)
.flatMap { strset =>
if (strset.size >= 2) {
strset.toIndexedSeq.combinations(2)
} else IndexedSeq.empty
}.map(_.sorted).toSet
Main changes I have are to use an IndexedSeq instead of a List (which is a LinkedList), and to do the filter on the fly.
I assume you didn't want to hyper optimize, in which case you could still remove a lot of the intermediate collections created in the flatMap, combinations, conversion to IndexedSeq and toSet call.
Related
I'm trying to get the 15 most relevant item for each users but every functions i tried took an eternity. (more than 6 hours i shutdown it after that ...)
I have 418 unique users, 3718 unique items.
U2tfifd dict has as well 418 entry and there is 32645 words in tfidf_feature_names.
Shape of my interactions_full_df is (40733, 3)
i tried :
def index_tfidf_users(user_id) :
return [users for users in U2tfifd[user_id].flatten().tolist()]
def get_relevant_items(user_id):
return sorted(zip(tfidf_feature_names, index_tfidf_users(user_id)), key=lambda x: -x[1])[:15]
def get_tfidf_token(user_id) :
return [words for words, values in get_relevant_items(user_id)]
then interactions_full_df["tags"] = interactions_full_df["user_id"].apply(lambda x : get_tfidf_token(x))
or
def get_tfidf_token(user_id) :
tags = []
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
for words, values in v :
tags.append(words)
return tags
or
def get_tfidf_token(user_id) :
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
tags = [words for words in v]
return tags
U2tfifd is a dict with keys = user_id, values = an array
There are several things going on which could cause poor performance in your code. The impact of each of these will depend on things like your Python version (2.x or 3.x), your RAM speed, and whatnot. You'll need to experiment and benchmark the various potential improvements yourself.
1. TFIDF Sparsity (~10x speedup depending on sparsity)
One glaring potential problem is that TFIDF naturally returns sparse data (e.g. a paragraph doesn't use anywhere near as many unique words as an entire book), and working with dense structures like numpy arrays is a strange choice when the data is probably zero almost everywhere.
If you'll be doing this same analysis in the future, it might be helpful to make/use a version of TFIDF with sparse array outputs so that when you extract your tokens you can skip over the zero values. This would likely have the secondary benefit of the entire sparse array for each user fitting in the cache and preventing costly RAM access in your sorts and other operations.
It might be worth sparsifying your data anyway. On my potato, a quick benchmark on data which should be similar to yours indicates that the process can be done in ~30s. The process replaces much of the work you're doing with a highly optimized routine coded in C and wrapped for use in Python. The only real cost is the second pass through the non-zero entries, but unless that pass is pretty efficient to begin with you should be better off working with sparse data.
2. Duplicated Efforts and Memoization (~100x speedup)
If U2tfifd has 418 entries and interactions_full_df has 40733 rows then at least 40315 (or 99.0%) of your calls to get_tfidf_token() are wasted since you've already computed the answer. There are tons of memoization decorators out there, but you don't need anything very complicated for your use case.
def memoize(f):
_cache = {}
def _f(arg):
if arg not in _cache:
_cache[arg] = f(arg)
return _cache[arg]
return _f
#memoize
def get_tfidf_token(user_id):
...
Breaking this down, the function memoize() returns another function. The behavior of that function is to check a local cache for the expected return value before computing it and storing it if necessary.
The syntax #memoize... is short for something like the following.
def uncached_get_tfidf_token(user_id):
...
get_tfidf_token = memoize(uncached_get_tfidf_token)
The # symbol is used to signify that we want the modified, or decorated, version of get_tfidf_token() instead of the original. Depending on your application, it might be beneficial to chain decorators together.
3. Vectorized Operations (varying speedup, benchmarking necessary)
Python doesn't really have a notion of primitive types like other languages, and even integers take 24 bytes in memory on my machine. Lists aren't usually be packed, so you can incur costly cache misses as you're plowing through them. No matter how little work the CPU is doing for sorting and whatnot, clobbering a whole new chunk of memory to turn your array into a list and only using that brand new, expensive memory once is going to incur a performance hit.
Many of the things you are trying to do have fast (SIMD vectorized, parallelized, memory-efficient, packed memory, and other fun optimizations) numpy equivalents AND avoid unnecessary array copies and type conversions. It seems you're already using numpy anyway, so you won't have any extra imports or dependencies.
As one example, zip() creates another list in memory in Python 2.x and still does unnecessary work in Python 3.x when you really only care about the indices of tfidf_feature_names. To compute those indices, you can use something like the following, which avoids an unnecessary list creation and uses an optimized routine with slightly better asymptotic complexity as an added bonus.
def get_tfidf_token(user_id):
temp = U2tfifd[user_id].flatten()
ind = np.argpartition(temp, len(temp)-15)[-15:]
return tfidf_feature_names[ind] # works if tfidf_feature_names is a numpy array
return [tfidf_feature_names[i] for i in ind] # always works
Depending on the shape of U2tfifd[user_id], you could avoid the costly .flatten() computation by passing an axis argument to np.argsort() and flattening the 15 obtained indices instead.
4. Bonus
The sorted() function supports a reverse argument so that you can avoid extra computations like throwing a negative on every value. Simply use
sorted(..., reverse=True)
Even better, since you really don't care about the sort itself but just the 15 largest values you can get away with
sorted(...)[-15:]
to index the largest 15 instead of reversing the sort and taking the smallest 15. That doesn't really matter if you're using a better function for the application like np.argpartition(), but it could be helpful in the future.
You can also avoid some function calls by replacing .apply(lambda x : get_tfidf_token(x)) with .apply(get_tfidf_token) since get_tfidf_token is already a function which has the intended behavior. You don't really need the extra lambda.
As far as I can see though, most additional gains are fairly nitpicky and system-dependent. You can make most things faster with Cython or straight C with enough time for example, but you already have reasonably fast routines which do what you want out of the box. The extra engineering effort probably isn't worth any potential gains.
The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.
I have few questions which are bothering me since few days back. I'm a beginner Python/Django Programmer so I just want to clear few things before I dive into real time product development.(for Python 2.7.*)
1) saving value in a variable before using in a function
for x in some_list/tuple:
func(do_something(x))
for x in some_list/tuple:
y = do_something(x)
func(y)
Which one is faster or which one I SHOULD use.
2)Creating a new object of a model in Django
def myview(request):
u = User(username="xyz12",city="TA",name="xyz",...)
u.save()
def myview(request):
d = {'username':"xyz12",'city':"TA",'name':"xyz",...}
u = User(**d)
u.save()
3) creating dictionary
var = Dict(key1=val1,key2=val2,...)
var = {'key1':val1,'key2':val2,...}
4) I know .append() is faster than += but what if I want to append a list's elements to another
a = [1,2,3,],b=[4,5,6]
a += b
or
for i in b:
a.append(i)
This is a very interesting question, but I think you don't ask it for the good reason. The performances gained by such optimisations are negligible, especially if you're working with small number of elements.
On the other hand, what is really important is the ease of reading the code and it's clarity.
def myview(request):
d = {'username':"xyz12",'city':"TA",'name':"xyz",...}
u = User(**d)
u.save()
This code for example isn't "easy" to read and to understand at first sight. It requires to think about it before finding what is actually does. Unless you need the intermediary step, don't do it.
For the 4th point, I'd go for the first solution, way much clearer (and it avoids the function call overhead created by calling the same function in a loop). You could also use more specialised function for better performances such as reduce (see this answer : https://stackoverflow.com/a/11739570/3768672 and this thread as well : What is the fastest way to merge two lists in python?).
The 1st and 3rd points are usually up to what you prefer, as both are really similar and will probably be optimised when compiled to bytecode anyway.
If you really want to optimise more your code, I advise you to go check this out : https://wiki.python.org/moin/PythonSpeed/PerformanceTips
PS : Ultimately, you can still do your own tests. Write two functions doing the exact same things with the two different methods you want to test, measure the execution times of these methods and compare them (be careful, do the tests multiple time to reduce the uncertainties).
Python is a kind of "script" programming language.
In this situation:
def dic_test():
a={}
a[0]=[0,0,0]
for i in range(10000000):
a[0][0]+=1
a[0][1]+=1
a[0][2]+=1
print(a)
def no_dic_test():
a={}
a[0]=[0,0,0]
target=a[0]
for i in range(10000000):
target[0]+=1
target[1]+=1
target[2]+=1
print(a)
Will no_dic_test() be faster than dic_test()?
I thought Yes. Because, Python is dynamical. Each statement will be translated separately.
I used profile to benchmark. The first function was slower than second one, but the different was slight.
First function: 5 function calls in 26.113 seconds
Second function: 5 function calls in 23.835 seconds
That is a extreme case. In my own case, like 10k keys, 10k times operations, direct use of a dictionary will be faster. I am so surprised.
To end, is there "static compiler" like C or cache optimisation in Python for Dictionary? or are Python hash table just too fast to face the problems?
Thanks!
Its pretty obvious that the second function is doing far less work on each loop.
The first function will have to do a dict lookup and local store for each loop where as the second function does this once.
There are runtimes like PyPy that spot the hot loop and JIT compile them for added performance, but the CPython runtime doesn't do this kind of optimisation yet.
I wrote a simple stack-based virtual machine in Python, and now I'm trying to rewrite it in Clojure, which is proving difficult as I don't have much experience with Lisp. This Python snippet processes the bytecode, which is represented as a list of tuples like so:
[("label", "entry"),
("load", 0),
("load", 1),
("add",),
("store", 0)]
Or in Clojure:
[[:label :entry]
[:load 0]
[:load 1]
[:add]
[:store 0]]
When a Function object loads the bytecode, every "label" tuple is processed specially to mark that position, while every other tuple stays in the final bytecode. I would assume that the Clojure equivalent of this function would involve a fold, but I'm not sure how to do that in an elegant or idiomatic way. Any ideas?
Reading that Python snippet, it looks like you want the eventual output to look like
{:code [[:load 0]
[:load 1]
[:add]
[:store 0]]
:labels {:entry 0}}
It's much easier to write the code once you have a firm description of the goal, and indeed this is a pretty simple reduce. There are a number of stylistically-different ways to write the reductor, but this way seems easiest to read, for me.
(defn load [asm]
(reduce (fn [{:keys [code labels]} [op arg1 & args :as instruction]]
(if (= :label op)
{:code code
:labels (assoc labels arg1 (count code))}
{:code (conj code instruction)
:labels labels}))
{:code [], :labels {}},
asm))
Edit
This version supports a name argument, and simplifies the reduction step by not repeating elements that don't change.
(defn load [name asm]
(reduce (fn [program [op arg1 :as instruction]]
(if (= :label op)
(assoc-in program [:labels arg1] (count (:code program)))
(update-in program [:code] conj instruction)))
{:code [], :labels {}, :name name},
asm))
I can't guarantee that this is idiomatic Clojure, but this is a functional version of your Python code, which should at least get you pretty close.
(def prog [
[:label :entry]
[:load 0]
[:load 1]
[:add]
[:store 0]])
(defn parse [stats]
(let [
f (fn [[out-stats labels pc] stat]
(if (= :label (first stat))
[out-stats (conj labels [(second stat) pc]) pc]
[(conj out-stats stat) labels (+ 1 pc)]))
init [[] {} 0]
]
(reduce f init stats)))
(println (parse prog))
So I think you're correct that a fold is what you want. All functional folds walk a collection and "reduce" that collection into a single value. However, nothing says that the resulting single value can't also be a collection or, as in this case, a collection of collections.
In our case, we are going to use the three-parameter version of reduce - this lets us provide an initial accumulator value. We need to do this because we are going to track a lot of state as we iterate across the collection of bytecodes, and the two-parameter version pretty much requires that your accumulator be similar to the items in the list. (c.f. (reduce + [1 2 3 4]) )
When working with a functional fold, you need to think in terms of what you are accumulating, and how each element in the input collection contributes to that accumulation. If you look at your Python code, there are three values that can be updated on each turn of the loop:
The output statements (self.code)
The label mapping (self.labels)
The program counter (pc)
Nothing else is written during the loop. So, our accumulator value will need to store those three values.
That previous bit is the most important part.
Once you have that, the rest should be pretty easy. We need an initial accumulator value, which has no code, no label mappings, and a PC that starts at 0. On each iteration, we will update the accumulator in one of two ways:
Add a new label mapping
Add a new output program statement, and increment the program counter
And now, the output:
[[[:load 0] [:load 1] [:add] [:store 0]]
{:entry 0}
4]
That's a 3-element vector. The first element is the program. The second element is the label mappings. The third element is the next PC value. Now, you might modify parse to only produce two values; that's not an unreasonable thing to do. There are reasons you might not want to do it, but that's more an issue of API design than anything. I'll leave it as an exercise to the reader.
I should also mention that, initially, I had omitted the let block and had simply inlined the named values. I decided to pull them out to hopefully increase readability. Again, I don't know which is more idiomatic. That might be more of a per-project convention.
Finally, I don't know if monads have really taken off in the Clojure community, but you could also create a monad for bytecode parsing, and define the operations "add-statement" and "add-label" to be values in that monad. This would greatly increase the set-up complexity, but would simplify the actual parsing code. In fact, it would allow your parsing code to look fairly procedural, which may or may not be a good thing. (don't worry, it's still functional and side-effect free; monads just let you hide plumbing.) If your Python sample is pretty representative of the kind of data you need to process, then monads are almost certainly unnecessary overhead. On the other hand, if you actually have to manage much more state than indicated by your sample, then monads might help to keep you sane.
(defn make-function [name code]
(let [[code labels] (reduce (fn [[code labels] inst]
(if (= (first inst) :label)
[code (assoc labels (second inst) (count code))]
[(conj code inst) labels]))
[[] {}] ;; initial state of code and labels
code)]
{:name name, :code code :labels labels}))
It's a bit wide for my liking, but not too bad.
I'm going to give you a general solution for these kind of problems.
Most loops can be done effortlessly with a strait forward map, filter or reduce, and if your data structure is recursive, naturally the loop will be a recursion.
Your loop, however, is a different kind of loop. Your loop accumulates a result -- which would suggests using reduce -- but the loop also carries a local variable along (pc), so it's not a strait reduce.
It's a reasonably common kind of loop. If this was Racket, I would use for/fold1, but since it's not, we will have to shoehorn your loop onto reduce.
Let's define a function called load which returns two things, the processed code and the processed labels. I will also use a helper function called is-label?.
(defn load [asm]
(defn is-label? [x] (= (first x) :label))
{:code <<< CODE GOES HERE >>>
:labels
<<< CODE GOES HERE >>>
})
Right now, your loop does two things, it processes the code, and it processes the labels. As much as possible, I try to keep loops to a single task. It makes them easier to understand, and it often reveals opportunities for using the simpler loop constructs.
To get the code, we simply need to remove the labels. That's a call to filter.
{:code (filter (complement is-label?) asm)
:labels
<<< CODE GOES HERE >>>
}
Reduce normally has only one accumulator, but your loop needs two: the result, and the local variable pc. I will package these two into a vector which will be immediately deconstructed by the body of the loop. The two slots of the vector will be my two local variables.
The initial values for these two variables appear as the 2nd argument to reduce.
(first
(reduce
(fn [[result, pc] inst]
<< MORE CODE >>
[{} 0] asm))
(Note how the initial values for the variables are placed far from their declaration. If the body is long this can be hard to read. That's the problem Racket's for/fold1 solves.)
Once reduce returns, I call first to discard to the local variable pc and keep just the result.
Filling the body of the loop is straight forward. If the instruction is a label, assoc into the result, otherwise increase pc by one. In either case, I construct a vector containing new values for all the local variables.
(fn [[result, pc] [_ arg :as inst]]
(if (is-label? inst)
[(assoc result arg pc) pc]
[result (inc pc)]))
This technique can be used to convert any accumulator-with-locals loop into a reduce. Here's the full code.
(defn load [asm]
(defn is-label? [x] (= (first x) :label))
{:code (filter (complement is-label?) asm)
:labels
(first
(reduce
(fn [[result, pc] [_ arg :as inst]]
(if (is-label? inst)
[(assoc result arg pc) pc]
[result (inc pc)]))
[{} 0] asm))})