pyspark redueByKey modify single results - python

I have a dataset that looks like this in pyspark:
samp = sc.parallelize([(1,'TAGA'), (1, 'TGGA'), (1, 'ATGA'), (1, 'GTGT'), (2, 'GTAT'), (2, 'ATGT'), (3, 'TAAT'), (4, 'TAGC')])
I have a function that I'm using to combine the strings:
def combine_strings(x,y):
if (isinstance(x,list) and isinstance(y, list)):
z = x + y
return z
if (isinstance(x, list) and isinstance(y, str)):
x.append(y)
return x
if (isinstance(x, str) and isinstance(y, list)):
y.append(x)
return y
return [x,y]
The result I get is:
samp.reduceByKey(lambda x,y : combine_strings(x,y)).collect()
[(1, ['TAGA', 'TGGA', 'ATGA', 'GTGT']), (2, ['GTAT', 'ATGT']), (3, 'TAAT'), (4, 'TAGC')]
What I want is:
[(1, ['TAGA', 'TGGA', 'ATGA', 'GTGT']), (2, ['GTAT', 'ATGT']), (3, ['TAAT']), (4, ['TAGC'])]
Where everything is an array. I can't tell if pyspark is calling combine_strings on a result where there's 1 entry or if I can tell reduceByKey to do something with singleton results? How do I modify the reduceByKey() or the combine_strings function to produce what I'd like?

You could first map the values into lists and then only combine those lists:
samp.mapValues(lambda x : [x]).reduceByKey(lambda x,y : x + y).collect()
The problem here is that those singletons are not affected by reduceByKey. Here is another example:
samp = sc.parallelize([(1,1),(2,2),(2,2),(3,3)])
>>> samp.reduceByKey(lambda x, y : x + y + 1).collect()
[(3, 3), (1, 1), (2, 5)]

Related

Remove overlaping tuple ranges from list leaving only the longest range

For a given list of range-tuples, I need to remove overlapping rage tuples while leaving the longest range for those that overlap or if same length keep both.
eg
input = [ [(1, 7), (2, 3), (7, 8), (9, 20)], [(4, 7), (2, 3), (7, 10)], [(1, 7), (2, 3), (7, 8)]]
expected_output = [ [(1,7), (9,20)], [(4,7), (2, 3), (7,10)], [(1,7)] ]
so only the longest overlapping range-tuple should not be removed.
def overlap(x:tuple, y:tuple) -> bool:
return bool(len( range(max(x[0],y[0]), min(x[1], y[1])+1 ) ))
def drop_overlaps(tuples: list):
def other_tuples(elems: list, t: tuple)-> list:
return [e for e in elems if e != t]
return [ t for t in tuples if not any( overlap(t, other_tuple)
for other_tuple in other_tuples(tuples, t)) ]
How do I remove the overlaps and keep the longest of them and those that are non-overlapping?
You can sort the tuple based on the first key, Then compare using your overlap function and check the difference and add the values to result based on the difference. If the difference is equal add value to result list otherwise replace last element in result with max value.
def drop(lst):
sorted_lst = sorted(lst, key=lambda x: x[0])
diff = lambda x: abs(x[0]-x[1])
res = [sorted_lst[0]]
for x in sorted_lst[1:]:
if overlap(res[-1], x):
if diff(res[-1]) == diff(x):
res.append(x)
else:
res[-1] = max(res[-1], x, key=diff)
else:
res.append(x)
return res

How to join elements inside a tuple, in list of tuples?

I have a list like
A = [(1, 2, 3), (3, 4, 5), (3, 5, 7)]
and I want to turn it into
A = [[123], [345], [357]]
Is there any way to do this?
My upper list with tuple comes from permutation function so maybe you can reccomend me to change something in that code
def converter(N):
y = list(str(N))
t = [int(x) for x in y]
f = list(itertools.permutations(t))
return f
r = converter(345)
print(r)
You can swizzle that up like so:
Code:
[[int(''.join(str(i) for i in x))] for x in a]
this converts the integer digits to a str, and then joins them before converting back to an integer.
Test Code:
a = [(1, 2, 3), (3, 4, 5), (3, 5, 7)]
print([[int(''.join(str(i) for i in x))] for x in a])
Results:
[[123], [345], [357]]
For fun (and for proving a radically different approach):
>>> [[sum(i * 10**(len(t) - k - 1) for k, i in enumerate(t))] for t in A]
[[123], [345], [357]]
Use map with a list-comprehension:
[[int(''.join(map(str, x)))] for x in A]
# [[123], [345], [357]]

Searching for item in two dictionaries and returning result based on the match

co_1 = {'a1': [(1, 1)], 'b1': [(0, 4), (0, 0), (4, 0)]}
co_2 = {'a2': [(2, 2)], 'b2': [(1, 5), (1, 2), (5, 1)]}
position = (x, y)
How do I check if the position(e.g. (1, 5)) is present in the values of the two dictionaries 'co_1 and co_2'.
So far I have:
for key, value in co_1.items():
if position in value:
return (statement1)
for key, value in co_1.items():
if position in value:
return(statement2)
#if position not in either value --> return None
Is there a way to clean this up so I can search for position in both dictionaries together and then have an if-else statement: if position present in values (co_1 or co_2) return (statement) else return None.
Such As:
for key, value in co_1.items() and co_2.items():
if position in value in co_1.items():
return statement1
elif position in value in co_2.items():
return statement2
else:
return None
#example if position = (2, 2) --> return statement2
#exmaple if position = (3, 1) --> return None
It seems the keys are not relevant here – you could just build a set of tuples from both dictionaries.
co_1_set = {y for x in co_1 for y in co_1[x] }
co_2_set = {y for x in co_2 for y in co_2[x] }
Now, a membership test is as simple as an if-elif statement:
def foo(position):
if position in co_1_set:
return statement1
elif position in co_2_set:
return statement2
You'll want to perform the set constructions as little as possible - ideally only when the dictionary contents change.
If both your dictionaries contain position, this code returns statement1 only. If you want something different, you'll need to make changes as appropriate.
You can use a flattened values of each dict like below:
co_1 = {'a1': [(1, 1)], 'b1': [(0, 4), (0, 0), (4, 0)]}
co_2 = {'a2': [(2, 2)], 'b2': [(1, 5), (1, 2), (5, 1)]}
def check_position(pos):
if pos in [item for sublist in co_1.values() for item in sublist]:
return 'statement1'
elif pos in [item for sublist in co_2.values() for item in sublist]:
return 'statement2'
else:
return None
pos1 = (2, 2)
pos2 = (0, 4)
print(check_position(pos1)) # Output: statement2
print(check_position(pos2)) # Output: statement1
another way: sum(pos in v for v in arg.values()) is a membership test returning 0, which == False when pos is not found
used below as the conditional in a list comprehension that returning the enumerate index of the input *args
co_1 = {'a1': [(1, 1)], 'b1': [(0, 4), (0, 0), (4, 0)]}
co_2 = {'a2': [(2, 2)], 'b2': [(1, 5), (1, 2), (5, 1)]}
pos = (1, 5)
def q(pos, *args):
return [i for i, arg in enumerate(args)
if sum(pos in v for v in arg.values())]
q(pos, co_1, co_2)
Out[184]: [1]
q((0, 0), co_1, co_2)
Out[185]: [0]
q((0, 10), co_1, co_2)
Out[186]: []
You Can that using 2 Boolean variable
co_1 = {'a1': [(1, 1)], 'b1': [(0, 4), (0, 0), (4, 0)]}
co_2 = {'a2': [(1, 2)], 'b2': [(1, 5), (1, 2), (5, 1)]}
position = (1,1)
tco1=False
tco2=False
for key, value in co_1.items():
if position in value:
tco1=True
for key, value in co_2.items():
if position in value:
tco2=True
if tco1==True:
print("Statement 1") #Or return Statement 1
elif tco2==True:
print("Statement 2") #Or return Statement 2
else :
print("None") #Or return None
Try using generator:
def search(item, co_1, co_2):
co_1_values = co_1.values()
co_2_values = co_2.values()
for value in list(co_1_values) + list(co_2_values):
yield 'statement1' if item in value in co_1_values else ''
yield 'statement2' if item in value in co_2_values else ''
In this case you will recieve all found statements.
If you need just first occurance:
for item in search((1, 1), co_1, co_2):
if item.strip():
print(item)
break

Accumulate items in a list of tuples

I have a list of tuples that looks like this:
lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
What is the best way to accumulate the sum of the first and secound tuple elements? Using the example above, I'm looking for the best way to produce this list:
new_lst = [(0, 0), (2, 3), (6, 6), (11, 7)]
I am looking for a solution in Python 2.6
I would argue the best solution is itertools.accumulate() to accumulate the values, and using zip() to split up your columns and merge them back. This means the generator just handles a single column, and makes the method entirely scalable.
>>> from itertools import accumulate
>>> lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
>>> list(zip(*map(accumulate, zip(*lst))))
[(0, 0), (2, 3), (6, 6), (11, 7)]
We use zip() to take the columns, then apply itertools.accumulate() to each column, then use zip() to merge them back into the original format.
This method will work for any iterable, not just sequences, and should be relatively efficient.
Prior to 3.2, accumulate can be defined as:
def accumulate(iterator):
total = 0
for item in iterator:
total += item
yield total
(The docs page gives a more generic implementation, but for this use case, we can use this simple implementation).
How about this generator:
def accumulate_tuples(iterable):
accum_a = accum_b = 0
for a, b in iterable:
accum_a += a
accum_b += b
yield accum_a, accum_b
If you need a list, just call list(accumulate_tuples(your_list)).
Here's a version that works for arbitrary length tuples:
def accumulate_tuples(iterable):
it = iter(iterable):
accum = next(it) # initialize with the first value
yield accum
for val in it: # iterate over the rest of the values
accum = tuple(a+b for a, b in zip(accum, val))
yield accum
>> reduce(lambda x,y: (x[0] + y[0], x[1] + y[1]), lst)
(11, 7)
EDIT. I can see your updated question. To get the running list you can do:
>> [reduce(lambda x,y: (x[0]+y[0], x[1]+y[1]), lst[:i]) for i in range(1,len(lst)+1)]
[(0, 0), (2, 3), (6, 6), (11, 7)]
Not super efficient, but at least it works and does what you want :)
This works for any length of tuples or other iterables.
from collections import defaultdict
def accumulate(lst):
sums = defaultdict(int)
for item in lst:
for index, subitem in enumerate(item):
sums[index] += subitem
yield [sums[index] for index in xrange(len(sums))]
print [tuple(x) for x in accumulate([(0, 0), (2, 3), (4, 3), (5, 1)])]
In Python 2.7+ you would use a Counter instead of defaultdict(int).
This is a really poor way (in terms of performance) to do this because list.append is expensive, but it works.
last = lst[0]
new_list = [last]
for t in lst[1:]:
last += t
new_list.append(last)
Simple method:
>> x = [(0, 0), (2, 3), (4, 3), (5, 1)]
>>> [(sum(a for a,b in x[:t] ),sum(b for a,b in x[:t])) for t in range(1,len(x)+1)]
[(0, 0), (2, 3), (6, 6), (11, 7)]
lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
lst2 = [lst[0]]
for idx in range(1, len(lst)):
newItem = [0,0]
for idx2 in range(0, idx + 1):
newItem[0] = newItem[0] + lst[idx2][0]
newItem[1] = newItem[1] + lst[idx2][1]
lst2.append(newItem)
print(lst2)
You can use the following function
>>> def my_accumulate(lst):
new_lst = [lst[0]]
for x, y in lst[1:]:
new_lst.append((new_lst[-1][0]+x, new_lst[-1][1]+y))
return new_lst
>>> lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
>>> my_accumulate(lst)
[(0, 0), (2, 3), (6, 6), (11, 7)]
Changed my code to a terser version:
lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
def accumulate(the_list):
the_item = iter(the_list)
accumulator = next(the_item)
while True:
yield accumulator
accumulator = tuple(x+y for (x,y) in zip (accumulator, next(the_item)))
new_lst = list(accumulate(lst))

Tuple and recursive list conversion

A recursive list is represented by a chain of pairs. The first element of each pair is an element in the list, while the second is a pair that represents the rest of the list. The second element of the final pair is None, which indicates that the list has ended. We can construct this structure using a nested tuple literal. Example:
(1, (2, (3, (4, None))))
So far, I've created a method that converts a tuple of values or the value None into a corresponding rlist. The method is called to_rlist(items). Example:
>>> to_rlist((1, (0, 2), (), 3))
(1, ((0, (2, None)), (None, (3, None))))
How do I write the inverse of to_rlist, a function that takes an rlist as input and returns the corresponding tuple? The method should be called to_tuple(parameter). Example of what should happen:
>>> x = to_rlist((1, (0, 2), (), 3))
>>> to_tuple(x)
(1, (0, 2), (), 3)
Note: The method to_rlist works as intended.
This is what I have so far:
def to_tuple(L):
if not could_be_rlist(L):
return (L,)
x, y = L
if not x is None and not type(x) is tuple and y is None:
return (x,)
elif x is None and not y is None:
return ((),) + to_tuple(y)
elif not x is None and not y is None:
return to_tuple(x) + to_tuple(y)
Which gives me the following result (which is incorrect):
>>> x = to_rlist((1, (0, 2), (), 3))
>>> to_tuple(x)
(1, 0, 2, (), 3)
How can I fix my method to return a nested tuple properly?
def to_list(x):
if x == None:
return ()
if type(x) != tuple:
return x
a, b = x
return (to_list(a),) + to_list(b)
This one worked for my HW ;)
def to_rlist(items):
r = empty_rlist
for i in items[::-1]:
if is_tuple(i): r1 = to_rlist(i)
else: r1 = i
r = make_rlist(r1,r)
return r

Categories