Extract a tuple array in python in Spark - python

I have an RDD of the form:
(2, [hello, hi, how, are, you])
I need to map these tuple like:
((2,hello), (2, hi), (2, how), ((2, are), (2, you))
I am trying this in python:
PairRDD = rdd.flatMap(lambda (k,v): v.split(',')).map(lambda x: (k,x)).reduceByKey())
This will not work as I do not have k in map transformation. I am not sure how to do it ? Any comments ?
Thanking you in advance.

I think your core issue is a misplaced right parens. Consider the following code (I've tested the equivalent in Scala, but it should work the same way in pySpark):
PairRDD = rdd.flatMap(lambda (k,v): v.split(',').map(lambda x: (k,x)))
v is split into a list of strings, and then that list is mapped to a tuple of (key, string), and then that list is returned to flatMap, splitting it out into multiple rows in the RDD. With the additional right parens after v.split(','), you were throwing away the key (since you only returned a list of strings).
Are the key values unique in the original dataset? If so and you want a list of tuples, then instead of flatMap use map and you'll get what you want without a shuffle. If you do want to combine multiple rows from the original dataset, then a groupByKey is called for, not reduceByKey.
I'm also curious if the split is necessary--is your tuple (Int, String) or (Int, List(String))?

Related

How do I make a list comprehension for 1/3 of a dictionary?

I am relatively new to python and I started to learn list comprehensions, they seemed a bit complicated at first but start to get intuitive over time.
However, there is one problem I could not solve yet
A function receives a dictionary as a parameter with the values looking like this:
datadict = dict(a1=[], a2=[], a3=[], b1=[], b2=[], b3=[], c1=[], c2[], c3=[])
One dictionary with some lists in it, which of course, also contain (float) values
all lists have the same len()
The function should seperate the nested lists from the dict according to the letters in to a list of tuples like this:
list_a = []
for entry in range(len(datadict["a"])):
list_a.append((
datadict["a1"][entry],
datadict["a2"][entry],
datadict["a3"][entry]
))
This works but looks very clunky / hard to read, in my opinion and I wasn't able to make a list comprehension out of it.
I tried using the zip() method, tried unpacking, but I couldn't figure it out.
I am not sure I have understood correctly the description of the task.
In general, you should make a custom function to embed the logic and use that in the list comprehension. It is way more readable.
Here an example based on your post:
datadict = dict(a1=[1,2,3], a2=[4,5,6], a3=[7,8,9], b1=[0,0,0], b2=[1,1,1], b3=[2,2,2])
def f(d, i):
return (d['a1'][i], d['a2'][i], d['a3'][i])
result = [f(datadict, n) for n in range(len(datadict['a1']))]
print(result)
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

How is 'key=lambda x: x[1]' working here?

I am sorting a tuple of tuples by second item. The tuple I need to sort is this: tuple1 = (('a', 23),('b', 37),('c', 11), ('d',29)). The solution to this program given on the internet is as follows:
tuple1 = (('a', 23),('b', 37),('c', 11), ('d',29))
print(tuple(sorted(list(tuple1), key=lambda x: x[1])))
What I can't understand is the function of key=lambda x: x[1] expression in the code. What does the keyword key denote here? I know lambda is an anonymous function. But how is it working in this code to give the desired output?
The key argument is ment to specify how to perform the sort. You can refer to the following link:
https://www.w3schools.com/python/ref_func_sorted.asp
For a more in-depth explanation of sorted and it's arguments have a look at the following link:
https://developers.google.com/edu/python/sorting
In your case, you sort the list of tuples based on the second element from each tuple.
The keyword key is an argument to sorted, it is the element that is compared when sorting list(tuple1)
The lambda function simply selects the second element of each tuple in the list, so we're comparing the ints not the characters
For a List of T, key takes a function T -> int (or anything thats sortable, but ints behave in the most expected way), and sorts by those. Here T = (int, int) and the lambda returns the 2nd int

sort list of tuples with multiple criteria

I have a list of tuples of k elements. I'd like to sort with respect to element 0, then element 1 and so on and so forth. I googled but I still can't quite figure out how to do it. Would it be something like this?
list.sort(key = lambda x : (x[0], x[1], ...., x[k-1])
In particular, I'd like to sort using different criteria, for example, descending on element 0, ascending on element 1 and so on.
Since python's sort is stable for versions after 2.2 (or perhaps 2.3), the easiest implementation I can think of is a serial repetition of sort using a series of index, reverse_value tuples:
# Specify the index, and whether reverse should be True/False
sort_spec = ((0, True), (1, False), (2, False), (3, True))
# Sort repeatedly from last tuple to the first, to have final output be
# sorted by first tuple, and ties sorted by second tuple etc
for index, reverse_value in sort_spec[::-1]:
list_of_tuples.sort(key = lambda x: x[index], reverse=reverse_value)
This does multiple passes so it may be inefficient in terms of constant time cost, but still O(nlogn) in terms of asymptotic complexity.
If the sort order for indices is truly 0, 1... n-1, n for a list of n-sized tuples as shown in your example, then all you need is a sequence of True and False to denote whether you want reverse or not, and you can use enumerate to add the index.
sort_spec = (True, False, False, True)
for index, reverse_value in list(enumerate(sort_spec))[::-1]:
list_of_tuples.sort(key = lambda x: x[index], reverse=reverse_value)
While the original code allowed for the flexibility of sorting by any order of indices.
Incidentally, this "sequence of sorts" method is recommended in the Python Sorting HOWTO with minor modifications.
Edit
If you didn't have the requirement to sort ascending by some indices and descending by others, then
from operator import itemgetter
list_of_tuples.sort(key = itemgetter(1, 3, 5))
will sort by index 1, then ties will be sorted by index 3, and further ties by index 5. However, changing the ascending/descending order of each index is non-trivial in one-pass.
list.sort(key = lambda x : (x[0], x[1], ...., x[k-1])
This is actually using the tuple as its own sort key. In other words, the same thing as calling sort() with no argument.
If I assume that you simplified the question, and the actual elements are actually not in the same order you want to sort by (for instance, the last value has the most precedence), you can use the same technique, but reorder the parts of the key based on precedence:
list.sort(key = lambda x : (x[k-1], x[1], ...., x[0])
In general, this is a very handy trick, even in other languages like C++ (if you're using libraries): when you want to sort a list of objects by several members with varying precedence, you can construct a sort key by making a tuple containing all the relevant members, in the order of precedence.
Final trick (this one is off topic, but it may help you at some point): When using a library that doesn't support the idea of "sort by" keys, you can usually get the same effect by building a list that contains the sort-key. So, instead of sorting a list of Obj, you would construct then sort a list of tuples: (ObjSortKey, Obj). Also, just inserting the objects into a sorted set will work, if they sort key is unique. (The sort key would be the index, in that case.)
So I am assuming you want to sort tuple_0 ascending, then tuple_1 descending, and so on. A bit verbose but this is what you might be looking for:
ctr = 0
for i in range(list_of_tuples):
if ctr%2 == 0:
list_of_tuples[0] = sorted(list_of_tuples[0])
else:
list_of_tuples[0] = sorted(list_of_tuples[0], reverse=True)
ctr+=1
print list_of_tuples

sorting list of tuples by arbitrary key

order = ['w','x','a','z']
[(object,'a'),(object,'x'),(object,'z'),(object,'a'),(object,'w')]
How do I sort the above list of tuples by the second element according the the key list provided by 'order'?
UPDATE on 11/18/13:
I found a much better approach to a variation of this question where the keys are certain to be unique, detailed in this question: Python: using a dict to speed sorting of a list of tuples.
My above question doesn't quite apply because the give list of tuples has two tuples with the key value of 'a'.
You can use sorted, and give as the key a function that returns the index of the second value of each tuple in the order list.
>>> sorted(mylist,key=lambda x: order.index(x[1]))
[('object', 'w'), ('object', 'x'), ('object', 'a'), ('object', 'a'), ('object', 'z')]
Beware, this fails whenever a value from the tuples is not present within the order list.
Edit:
In order to be a little more secure, you could use :
sorted(mylist,key=lambda x: x[1] in order and order.index(x[1]) or len(order)+1)
This will put all entries with a key that is missing from order list at the end of the resulting list.

Python, how to sort dictionary by keys, when keys are floating point numbers in scientific format?

I need to sort a Python dictionary by keys, when keys are floating point numbers in scientific format.
Example:
a={'1.12e+3':1,'1.10e+3':5,'1.19e+3':7,...}
I need to maintain key-value links unchanged.
What is the simplest way to do this?
Probably by simply converting back to a number:
sorted(a, key = lambda x: float(x))
['1.10e+3', '1.12e+3', '1.19e+3']
This just gives you a sorted copy of the keys. I'm not sure if you can write to a dictionary and change its list of keys (the list returned by keys() on the dictionary) in-place. Sounds a bit evil.
You can sort the (key, value) pairs by the float value
a={'1.12e+3':1,'1.10e+3':5,'1.19e+3':7,...}
print sorted(a.iteritems(), key=lambda (x,y):float(x))
# [('1.10e+3', 5), ('1.12e+3', 1), ('1.19e+3', 7)]
I guess you want floats anyways eventually so you can just convert them right away:
print sorted((float(x),y) for x,y in a.iteritems())
# [(1100.0, 5), (1120.0, 1), (1190.0, 7)]

Categories