Trying to convert some pySpark over to the scala equivalent and I am having issues with the correct syntax for a double list comprehension. The code takes a list of key values and returns a list of values in tuple form that occurred for the same key. Meaning (2, ('user1','user2','user3')) would return (('user1','user2'),('user1','user3'),('user2','user3')).
#source rdd
[(2, ['user1', 'user3']), (1, ['user1', 'user2', 'user1']), (3, ['user2', 'user4', 'user4', 'user3'])]
#current list comprehension in pySpark
rdd2 = rdd.flatMap(lambda kv: [(x, y) for x in kv[1] for y in kv[1] if x < y])
//scala attempt to make equivelent is currently throwing errors for syntax issues
val rdd2 = rdd.flatMap((x,y) => for (x <- _(1)) yield x for(y <- _(1)) yield y if x < y)
Scala supports multiple iterators in a comprehension.
Try this
val rdd2 = rdd.flatMap {
case (_, v) => for {
x <- v
y <- v if x < y
} yield (x,y)
}
Notes
The underscore won't work as you did it (twice); either way unwrapping the tuple with Scala's pattern matching is clearer (and closer to Python*). Since you don't use the first tuple item, you can use an undescore there to "throw it away".
*FWIW, you could do the Python slightly neater:
lambda (_,v): [(x, y) for x in v for y in v if x < y]
While the answer provided by Nick B translates your code directly it makes more sense to use combinations here:
rdd.values.flatMap(_.toSeq.distinct.sorted.combinations(2))
Related
I have a nested list in the following format:
[['john'],['jack','john','mary'],['howard','john'],['jude']...]
I want to find the first 3 or 5 indices of john that occurs in the nested list(since the list is really long) and return the indices like:
(0,0),(1,1),(2,1) or in any format which is advisable.
I'm fairly new to nested list. Any help would be much appreciated.
Question 1: Here is one way using a nested comprehension list. I will however look if there is a dupe.
nested_list = [['john'],['jack','john','mary'],['howard','john'],['jude']]
out = [(ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if y == 'john']
print(out)
Returns: [(0, 0), (1, 1), (2, 1)]
Update: Something similar found here Finding the index of an element in nested lists in python. The answer however only takes the first value which could be translated into:
out = next(((ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if y == 'john'),None)
print(out) # (0,0)
Question 2: (from comment)
Yes this is quite easy by editing y == 'john' to: 'john' in y.
nested_list = [['john xyz'],['jack','john dow','mary'],['howard','john'],['jude']]
out = [(ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if 'john' in y]
print(out)
Returns: [(0, 0), (1, 1), (2, 1)]
Question 3: (from comment)
The most efficient way to get the first N elements is to use pythons library itertools like this:
import itertools
nested_list = [['john xyz'],['jack','john dow','mary'],['howard','john'],['jude']]
gen = ((ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if 'john' in y)
out = list(itertools.islice(gen, 2)) # <-- Next 2
print(out)
Returns: [(0, 0), (1, 1)]
This is also answered here: How to take the first N items from a generator or list in Python?
Question 3 extended:
And say now that you want to take them in chunks of N, then you can do this:
import itertools
nested_list = [['john xyz'],['jack','john dow','mary'],['howard','john'],['jude']]
gen = ((ind,ind2) for ind,i in enumerate(nested_list)
for ind2,y in enumerate(i) if 'john' in y)
f = lambda x: list(itertools.islice(x, 2)) # Take two elements from generator
print(f(gen)) # calls the lambda function asking for 2 elements from gen
print(f(gen)) # calls the lambda function asking for 2 elements from gen
print(f(gen)) # calls the lambda function asking for 2 elements from gen
Returns:
[(0, 0), (1, 1)]
[(2, 1)]
[]
I have following two lists:
advanced_filtered_list_val1 = [row for row in cleaned_list if float(row[val1]) < wert1]
advanced_filtered_list_val2 = [row for row in cleaned_list if float(row[val2]) < wert2]
How can I map the filtered lists in a list with the option and and/or or?
The data in the lists are dictionaries and I search and filter some rows in this lists. I want to filter two values on. This works fine. But how can I now map this to filter in a list?
I tried following things:
select = int(input())
#and operation
if select == 1:
mapped_list = [row for row in advanced_filtered_list_val1 and advanced_filtered_list_val2]
for x in mapped_list:
print(x)
#or operation
if select == 2:
mapped_list = [row for row in advanced_filtered_list_val1 or advanced_filtered_list_val2]
for x in mapped_list:
print(x)
I import the data as follows:
faelle = [{k: v for k, v in row.items()}
for row in csv.DictReader(csvfile, delimiter=";")]
I want to filter now from wert1 and wert2 and from wert1 or wert2. Thats mean on the and clause it should be on both filters true, and on the or clause it should one of wert1 or wert2 True
You want to filter dictionaries contained in cleaned_list which respect either the two wert-like conditions (AND) or at least one of them (OR). What you can do is
import operator as op
ineq_1 = 'gt'
ineq_2 = 'lt'
select = 2
andor = {
1:lambda L: filter(
lambda d: getattr(op,ineq_1)(float(d[val1]), wert1)
and getattr(op,ineq_2)(float(d[val2]), wert2),
L
),
2:lambda L: filter(
lambda d: getattr(op,ineq_1)(float(d[val1]), wert1)
or getattr(op,ineq_2)(float(d[val2]), wert2),
L
),
}
mapped_list = andor[select](cleaned_list)
for x in mapped_list:
print(dict(x))
The possible choices are gt (greater than), lt (lower than), or eq.
Note that you can even make things a little bit more "dynamic" by as well using the method and_ and or_ of the python-builtin module operator. For example, doing
#Where the two following ix2-like stuffs are defined to make
# a correspondence between names one knows, and methods of the
# module operator.
ix2conj = {
1:'and_',
2:'or_',
}
ix2ineq = {
'<' :'lt',
'==':'eq',
'>' :'gt',
}
def my_filter(conjunction, inequality1, inequality2, my_cleaned_list):
return filter(
lambda d: getattr(op, ix2conj[conjunction])(
getattr(op, ix2ineq[inequality1])(float(d[val1]), wert1),
getattr(op, ix2ineq[inequality2])(float(d[val2]), wert2)
),
my_cleaned_list
)
ineq_1 = '>'
ineq_2 = '<'
select = 2
print(my_filter(select, ineq_1, ineq_2, cleaned_list))
I see where you're coming from with that syntax, but that's not what the "and" and "or" keywords in python do at all. To do what you're looking for I think you'll want to use the built in type, set. You could do something like
# note that this is already the "or" one
both = list1 + [x for x in list2 if not x in list1]
# for "and"
mapped_list = [x for x in both if x in list1 and x in list2]
If you want the resultant lists to have only unique values; otherwise you could just do the same with
both = list1 + list2
I have a list as below
tlist=[(‘abc’,HYD,’user1’), (‘xyz’,’SNG’,’user2’), (‘pppp’,’US’,’user3’), (‘qq’,’HK’,’user4’)]
I want to display the second field tuple of provided first field of tuple.
Ex:
tlist(‘xyz’)
SNG
Is there way to get it?
A tuple doesn't have a hash table lookup like a dictionary, so you will need to loop through it in sequence until you find it:
def find_in_tuple(tlist, search_term):
for x, y, z in tlist:
if x == search_term:
return y
print(find_in_tuple(tlist, 'xyz')) # prints 'SNG'
If you plan to do this multiple times, you definitely want to convert to a dictionary. I would recommend making the first element of the tuple the key and then the other two the values for that key. You can do this very easily using a dictionary comprehension.
>>> tlist_dict = { k: (x, y) for k, x, y in tlist } # Python 3: { k: v for k, *v in tlist }
>>> tlist_dict
{'qq': ['HK', 'user4'], 'xyz': ['SNG', 'user2'], 'abc': ['HYD', 'user1'], 'pppp': ['US', 'user3']}
You can then select the second element as follows:
>>> tlist_dict['xyz'][0]
'SNG'
If there would be multiple tuples with xyz as a first item, use the following simple approach(with modified example):
tlist = [('abc','HYD','user1'), ('xyz','SNG','user2'), ('pppp','US','user3'), ('xyz','HK','user4')]
second_fields = [f[1] for f in tlist if f[0] == 'xyz']
print(second_fields) # ['SNG', 'HK']
Suppose we have the following set, S, and the value v:
S = {(0,1),(2,3),(4,5)}
v = 3
I want to test if v is the second element of any of the pairs within the set. My current approach is:
for _, y in S:
if y == v:
return True
return False
I don't really like this, as I have to put it in a separate function and something is telling me there's probably a nicer way to do it. Can anyone shed some light?
The any function is tailor-made for this:
any( y == v for (_, y) in S )
If you have a large set that doesn't change often, you might want to project the y values onto a set.
yy = set( y for (_, y) in S )
v in yy
Of course, this is only of benefit if you compute yy once after S changes, not before every membership test.
You can't do an O(1) lookup, so you don't get much benefit from having a set. You might consider building a second set, especially if you'll be doing lots of lookups.
S = {(0,1), (2,3), (4,5)}
T = {x[1] for x in S}
v = 3
if v in T:
# do something
Trivial answer is any (see Marcelo's answer).
Alternative is zip.
>>> zip(*S)
[(4, 0, 2), (5, 1, 3)]
>>> v in zip(*S)[1]
True
Given a list of items, and a map from a predicate function to the "value" function, the code below applies "value" functions to the items satisfying the corresponding predicates:
my_re0 = re.compile(r'^([a-z]+)$')
my_re1 = re.compile(r'^([0-9]+)$')
my_map = [
(my_re0.search, lambda x: x),
(my_re1.search, lambda x: x),
]
for x in ['abc','123','a1']:
for p, f in my_map:
v = p(x)
if v:
print f(v.groups())
break
Is there a way to express the same with a single statement?
If I did not have to pass the value returned by the predicate to the "value" function then I could do
for x in ['abc','123','a1']:
print next((f(x) for p, f in my_map if p(x)), None)
Can something similar be done for the code above? I know, maybe it is better to leave these nested for loops, but I am just curious whether it is possible.
A bit less terse than Nate's ;-)
from itertools import product
comb = product(my_map, ['abc','123','a1'])
mapped = ((p(x),f) for (p,f),x in comb)
groups = (f(v.groups()) for v,f in mapped if v)
print next(groups), list(groups) # first match and the rest of them
[f(v.groups()) for x in ['abc','123','a1'] for p, f in my_map for v in [p(x)] if v]
You said more terse, right? ;^)
here is my version:
for x in ['abc','123','a1']:
print next((f(v.groups()) for p, f in my_map for v in [p(x)] if v), None)
this version does not iterate over the whole my_map but stops as soon as the first successful mapping is found.