Understanding PySpark Reduce() - python

I'm learning Spark with PySpark and I'm trying different staff with the function reduce() to properly understand it, but I did something and obtained a result that makes no sense to me.
The previous examples I executed with reduce was basic things like:
>>> a = sc.parallelize(['a','b','c','d'])
>>> a.reduce(lambda x,y:x+y)
'abcd'
>>> a = sc.parallelize([1,2,3,4])
>>> a.reduce(lambda x,y:x+y)
10
>>> a = sc.parallelize(['azul','verde','azul','rojo','amarillo'])
>>> aV2 = a.map(lambda x:(x,1))
>>> aRes = aV2.reduceByKey(lambda x,y: x+y)
>>> aRes.collect()
[('rojo', 1), ('azul', 2), ('verde', 1), ('amarillo', 1)]
But I tried this:
>>> a = sc.parallelize(['a','b','c','d'])
>>> a.reduce(lambda x,y:x+x)
'aaaaaaaa'
And I was expecting 'aaaa' as a result but no 'aaaaaaaa'
I was looking for an answers reading reduce() docs but I think I'm missing something.
Thanks!

Your x in lambda function is keep changing and so the x of last expression in each step is
a
aa
aaaa
which gives the last result aaaaaaaa. The number of character should be doubled with your expression.

Related

How to use natsort in python to sort folder names?

I have three folders, which names are ["-folder2-", "-folder1-", "=Folder-"].
When i use 'sorted' or in Window, it returns ["-folder-", "-folder1-", "-folder2-"].
But using natsort, it returns ["-folder1-", "-folder2-", "-folder-"].
I want to get same result by using natsort
How can i do it?
a = ["-folder1-", "-folder2-", "-folder-"]
import natsort
sorting = natsort.natsorted(a, alg = natsort.ns.PATH | natsort.ns.LOCALE | natsort.ns.IGNORECASE)
print(sorted(a)) #---> ["-folder-", "-folder1-", "-folder2-"]
print(sorting) #---> ["-folder1-", "-folder2-", "-folder-"]
Before I answer your question, first I want to explain what is going on. natsort is looking for numbers in your input and separating them out from the non-numeric components. The easiest way to see this is by looking at the output of the natural sorting key. (I omitted the PATH and LOCALE options because they completely mangle the output).
>>> import natsort
>>> ns_key = natsort.natsort_keygen(alg=natsort.IGNORECASE)
>>> a = ["-folder1-", "-folder2-", "-folder-"]
>>> [ns_key(x) for x in a]
[('-folder', 1, '-'), ('-folder', 2, '-'), ('-folder-',)]
When '-folder' is compared against '-folder-', the former is considered to be first according to Python's sorting heuristics, so your folders with numbers get placed first.
To answer your question, we need to trick natsort into thinking that '-' followed by no numbers should be treated like the case with numbers. One way to do that is with regex.
>>> import re
>>> r = re.compile(r"(?<!\d)-")
>>> # What does the regex do?
>>> [r.sub("0\g<0>", x) for x in a]
['0-folder1-', '0-folder2-', '0-folder0-']
>>> # What does natsort generate?
>>> ns_key = natsort.natsort_keygen(key=lambda x: r.sub("0\g<0>", x), alg=natsort.IGNORECASE)
>>> [ns_key(x) for x in a]
[('', 0, '-folder', 1, '-'), ('', 0, '-folder', 2, '-'), ('', 0, '-folder', 0, '-')]
>>> # Does it actually work?
>>> natsort.natsorted(a, key=lambda x: r.sub("0\g<0>", x), alg=natsort.ns.PATH | natsort.ns.LOCALE | natsort.ns.IGNORECASE)
['-folder-', '-folder1-', '-folder2-']
An alternative method would be to "split" your input on '-', which would have a similar effect. This is one of the things that PATH under the hood, but for file separators.
>>> # What does natsort generate?
>>> ns_key = natsort.natsort_keygen(key = lambda x: x.split('-'), alg=natsort.IGNORECASE)
>>> [ns_key(x) for x in a]
[((), ('folder', 1), ()), ((), ('folder', 2), ()), ((), ('folder',), ())]
>>> # Does it actually work?
>>> natsort.natsorted(a, key=lambda x: x.split('-'), alg=natsort.ns.PATH | natsort.ns.LOCALE | natsort.ns.IGNORECASE)
['-folder-', '-folder1-', '-folder2-']
You may be wondering why PATH does not automatically take care of this. PATH was intended to handle oddities that arise because of file separators or file extensions. Your examples have neither, so it does not help. If the examples given here are representative, I would recommend removing the PATH option since it will only add runtime but give no benefit.

Python PriorityQueue order

I'm trying to figure out in what order PriorityQueue.get() returns values in Python. First I thought that the smaller priority values get returned first but after few examples it's not like that. This is the example that I have run:
>>> qe = PriorityQueue()
>>> qe.put("Br", 0)
>>> qe.put("Sh", 0.54743812441605)
>>> qe.put("Gl", 1.1008112004388)
>>> qe.get()
'Br'
>>> qe.get()
'Gl'
>>> qe.get()
'Sh'
Why is it returning values in this order?
According to doc, the first parameter is priority, and second - is value. So that's why you get such result.
A typical pattern for entries is a tuple in the form: (priority_number, data).
So you should pass a tuple to put like this:
>>> q = PriorityQueue()
>>> q.put((10,'ten'))
>>> q.put((1,'one'))
>>> q.put((5,'five'))
>>> q.get()
>>> (1, 'one')
>>> q.get()
>>> (5, 'five')
>>> q.get()
>>> (10, 'ten')
Notice additional braces.

loop for inside lambda

I need to simplify my code as much as possible: it needs to be one line of code.
I need to put a for loop inside a lambda expression, something like that:
x = lambda x: (for i in x : print i)
Just in case, if someone is looking for a similar problem...
Most solutions given here are one line and are quite readable and simple. Just wanted to add one more that does not need the use of lambda(I am assuming that you are trying to use lambda just for the sake of making it a one line code).
Instead, you can use a simple list comprehension.
[print(i) for i in x]
BTW, the return values will be a list on None s.
Since a for loop is a statement (as is print, in Python 2.x), you cannot include it in a lambda expression. Instead, you need to use the write method on sys.stdout along with the join method.
x = lambda x: sys.stdout.write("\n".join(x) + "\n")
To add on to chepner's answer for Python 3.0 you can alternatively do:
x = lambda x: list(map(print, x))
Of course this is only if you have the means of using Python > 3 in the future... Looks a bit cleaner in my opinion, but it also has a weird return value, but you're probably discarding it anyway.
I'll just leave this here for reference.
anon and chepner's answers are on the right track. Python 3.x has a print function and this is what you will need if you want to embed print within a function (and, a fortiori, lambdas).
However, you can get the print function very easily in python 2.x by importing from the standard library's future module. Check it out:
>>>from __future__ import print_function
>>>
>>>iterable = ["a","b","c"]
>>>map(print, iterable)
a
b
c
[None, None, None]
>>>
I guess that looks kind of weird, so feel free to assign the return to _ if you would like to suppress [None, None, None]'s output (you are interested in the side-effects only, I assume):
>>>_ = map(print, iterable)
a
b
c
>>>
If you are like me just want to print a sequence within a lambda, without get the return value (list of None).
x = range(3)
from __future__ import print_function # if not python 3
pra = lambda seq=x: map(print,seq) and None # pra for 'print all'
pra()
pra('abc')
lambda is nothing but an anonymous function means no need to define a function like def name():
lambda <inputs>: <expression>
[print(x) for x in a] -- This is the for loop in one line
a = [1,2,3,4]
l = lambda : [print(x) for x in a]
l()
output
1
2
3
4
We can use lambda functions in for loop
Follow below code
list1 = [1,2,3,4,5]
list2 = []
for i in list1:
f = lambda i: i /2
list2.append(f(i))
print(list2)
First of all, it is the worst practice to write a lambda function like x = some_lambda_function. Lambda functions are fundamentally meant to be executed inline. They are not meant to be stored. Thus when you write x = some_lambda_function is equivalent to
def some_lambda_funcion():
pass
Moving to the actual answer. You can map the lambda function to an iterable so something like the following snippet will serve the purpose.
a = map(lambda x : print(x),[1,2,3,4])
list(a)
If you want to use the print function for the debugging purpose inside the reduce cycle, then logical or operator will help to escape the None return value in the accumulator variable.
def test_lam():
'''printing in lambda within reduce'''
from functools import reduce
lam = lambda x, y: print(x,y) or x + y
print(reduce(lam,[1,2,3]))
if __name__ =='__main__':
test_lam()
Will print out the following:
1 2
3 3
6
You can make it one-liner.
Sample
myList = [1, 2, 3]
print_list = lambda list: [print(f'Item {x}') for x in list]
print_list(myList)
otherList = [11, 12, 13]
print_list(otherList)
Output
Item 1
Item 2
Item 3
Item 11
Item 12
Item 13

python create interchangeable strings

lets say I have 2 strings that are interchangeable, like a full word and it's abbreviation: 'max' and 'maximum'
I would like to set it so that they respond the same, for example if i have the following dictionary:
d = {'max':10,'a':5,'b':9}
d['maximum'] will return 10
is this even remotely possible?
note:
these two strings could be 'dog' and 'cat', they do not have to be related.
what I am asking is if I could do something like:
a = 'a' or 'b'
that way the two strings are interchangeable. I do understand that above is not correct syntax, I am just curious if anything like it is possible
You can do that using two dicts:
>>> key_dic = {'maximum':'max', 'minimum':'min'}
>>> d = {'max':10,'a':5,'b':9, 'min':-1}
def get_value(key):
return d[key_dic.get(key, key)]
...
>>> get_value('maximum')
10
>>> get_value('max')
10
>>> get_value('min')
-1
>>> get_value('minimum')
-1
You'll need to convert it into a function, class or something similar.
d_array = {'max':10,'a':5,'b':9}
def d(keyword):
if keyword == "maximum":
keyword = "max"
return d_array[keyword]
>>>print d("maximum")
10

Square braces not required in list comprehensions when used in a function

I submitted a pull request with this code:
my_sum = sum([x for x in range(10)])
One of the reviewers suggested this instead:
my_sum = sum(x for x in range(10))
(the difference is just that the square braces are missing).
I was surprised that the second form seems to be identical. But when I tried to use it in other contexts where the first one works, it fails:
y = x for x in range(10)
^ SyntaxError !!!
Are the two forms identical? Is there any important reason for why the square braces aren't necessary in the function? Or is this just something that I have to know?
This is a generator expression. To get it to work in the standalone case, use braces:
y = (x for x in range(10))
and y becomes a generator. You can iterate over generators, so it works where an iterable is expected, such as the sum function.
Usage examples and pitfalls:
>>> y = (x for x in range(10))
>>> y
<generator object <genexpr> at 0x0000000001E15A20>
>>> sum(y)
45
Be careful when keeping generators around, you can only go through them once. So after the above, if you try to use sum again, this will happen:
>>> sum(y)
0
So if you pass a generator where actually a list or a set or something similar is expected, you have to be careful. If the function or class stores the argument and tries to iterate over it multiple times, you will run into problems. For example consider this:
def foo(numbers):
s = sum(numbers)
p = reduce(lambda x,y: x*y, numbers, 1)
print "The sum is:", s, "and the product:", p
it will fail if you hand it a generator:
>>> foo(x for x in range(1, 10))
The sum is: 45 and the product: 1
You can easily get a list from the values a generator produces:
>>> y = (x for x in range(10))
>>> list(y)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
You can use this to fix the previous example:
>>> foo(list(x for x in range(1, 10)))
The sum is: 45 and the product: 362880
However keep in mind that if you build a list from a generator, you will need to store every value. This might use a lot more memory in situations where you have lots of items.
Why use a generator in your situation?
The much lower memory consumption is the reason why sum(generator expression) is better than sum(list): The generator version only has to store a single value, while the list-variant has to store N values. Therefore you should always use a generator where you don't risk side-effects.
They are not identical.
The first form,
[x for x in l]
is a list comprehension. The other is a generator expression and written thus:
(x for x in l)
It returns a generator, not a list.
If the generator expression is the only argument in a function call, its parentheses can be skipped.
See PEP 289
First one is list comprehnsion Where second one is generator expression
(x for x in range(10))
<generator object at 0x01C38580>
>>> a = (x for x in range(10))
>>> sum(a)
45
>>>
Use brace for generators:
>>> y = (x for x in range(10))
>>> y
<generator object at 0x01C3D2D8>
>>>
Read this PEP: 289
For instance, the following summation code will build a full list of squares in memory, iterate over those values, and, when the reference is no longer needed, delete the list:
sum([x*x for x in range(10)])
Memory is conserved by using a generator expression instead:
sum(x*x for x in range(10))
As the data volumes grow larger, generator expressions tend to perform better because they do not exhaust cache memory and they allow Python to re-use objects between iterations.
Use brace product a generator:
>>> y = (x for x in range(10))
>>> y
<generator object <genexpr> at 0x00AC3AA8>

Categories