Function invoked by Pyspark map do not modify global list

Function invoked by Pyspark map do not modify global list - python

I have defined this function that operates on the global list signature, I have tested the function and it works.
def add_to_list_initial(x):
global signature
signature.append([x])
print(x)
return x
The print will check if the function is invoked.
I have to run this function for each row of a Pyspark rdd, so I written this code:
rdd.map(lambda x: min([str(int.from_bytes(hash_functions[0](str(shingle)), 'big')) for shingle in x])).map(lambda x: add_to_list_initial(x))
But the function is not invoked so, to avoid the "laziness" of map, I tried to add ".count()" at the end, in this way:
rdd.map(lambda x: min([str(int.from_bytes(hash_functions[0](str(shingle)), 'big')) for shingle in x])).map(lambda x: add_to_list_initial(x)).count()
And now the print is done. I have even checked that list signature is updated but, when I try to print the size of the list, the result will be 0, because the list is not updated at all.
I have even tried to use foreach instead of map, but the result is the same:
rdd1 = rdd.map(lambda x: min([str(int.from_bytes(hash_functions[0](str(shingle)), 'big')) for shingle in x]))
rdd1.foreach(add_to_list_initial)
These are the firsts lines of the output, they are written in red on my Pycharm console, even the prints:
19/11/19 21:56:51 WARN TaskSetManager: Stage 2 contains a task of very large size (76414 KB). The maximum recommended task size is 100 KB.
1000052032941703168135263382785614272239884872602
1001548144792848500380180424836160638323674923493
1001192257270049214326810337735024900266705408878
1005273115771118475643621392239203192516851021236
100392090499199786517408984837575190060861208673
1001304115299775295352319010425102201971454728176
1009952688729976061710890304226612996334789156125
1001064097828097404652846404629529563217707288121
1001774517560471388799843553771453069473894089066
1001111820875570611167329779043376285257015448116
1001339474866718130058118603277141156508303423308
1003194269601172112216983411469283303300285500716
1003194269601172112216983411469283303300285500716
1003194269601172112216983411469283303300285500716
1003194269601172112216983411469283303300285500716
1003194269601172112216983411469283303300285500716
How can I resolve in an efficient way?
I use Python 3.7 and Pyspark 3.2.1
I'm doing this in order to obtain a min-hash signature for each set of hashed shingles, where the id of the document is
Then, to compute the other permutations, I think to act in this way:
def add_to_list(x):
global num_announcements
global signature
global i
print(len(signature))
if i == num_announcements:
i = 0
signature[i].append(x)
print(i)
i += 1
for function in hash_functions[1:]:
rdd.map(lambda x: min([str(int.from_bytes(function(str(shingle)), 'big')) for shingle in x])).foreach(add_to_list)
But the problem is the same.
I will be glad even to have suggestions for my minhashing problem, but the question is about the problem described above.

I resolved in this way, even if I did not find a useful solution for the problem in general.
signatures = shingles.flatMap(lambda x: [[(x[1]+1, (x[1]+1)%lsh_b), min([int.from_bytes(function(str(s)), 'big') for s in x[0]])] for function in hash_functions]).cache()

Instead of a global variable, you can use a class (a Callable).
For instance:
from collections.abc import Callable
class Signature(Callable):
def __init__(self):
self.signature = []
def __call__(self, x):
self.signature.append([x])
return x
Then, you can instanciate this callable where you need it:
add_to_list_initial = Signature()
rdd.map(lambda x: min([str(int.from_bytes(hash_functions[0](str(shingle)), 'big')) for shingle in x])).map(
lambda x: add_to_list_initial(x)
).count()
print(add_to_list_initial.signature)
Note: you can avoid a lambda expression here, an simplify with:
rdd.map(lambda x: min([str(int.from_bytes(hash_functions[0](str(shingle)), 'big')) for shingle in x])).map(
add_to_list_initial
).count()
EDIT
To allow pickling, you can use:
class Signature:
def __init__(self):
self.signature = []
def __call__(self, x):
self.signature.append([x])
return x

Related

python: Is it possible to make another variable that contains a undefined variable?

The following code can be executed, but this code's problem is that it cannot be dynamically altered.
import numpy as np
def odes(x,t=0):
v_rates = np.array([x[0]*x[2], x[1], x[1], x[0]*x[3]])
v_k = np.array([[-1,1,1,-1],
[1,-1,-1,1],
[-1,1, 0,0],
[0, 0,1,-1]])
return np.matmul(v_k, v_rates)
print(odes([1,2,2,1]))
For my use I want to be able to use different versions of the v_rates-array, i.e. I would like to use v_rates as an argument such that the function becomes 'odes(x,t, v_rates)'. However, there is a problem: Due to x not being defined in advance, it is not possible to make another variable that contains a undefined variable. My question is how to define the function such that I can use another argument that can determine whether v_rates is version_1 or version_2 from below:
def version_1(x):
return np.array([x[0]*x[2], x[1], x[1], x[0]*x[3]])
def version_2(x):
return np.array([x[3], x[3], x[4], x[3]])

You could pass in a function to your function that will get the v_rates
def odes(x,t=0, get_vrates_func=version1):
v_rates = get_vrates_func(x)
and then call it either with a default specified, or with your function or a lambda
odes(1,1)
odes(1,1,version2)
odes(1,1,lambda x: np.array(a,b,c,...))

python lambda evaluate expression

I am trying out lambda in python and came across this question:
def foo(y):
return lambda x: x(x(y))
def bar(x):
return lambda y: x(y)
print((bar)(bar)(foo)(2)(lambda x:x+1))
can someone explain/breakdown how this code works? I am having problems trying to figure out what is x and y.

Lambda functions are just functions. They're almost syntatic sugar, as you can think of this structure:
anony_mouse = lambda x: x # don't actually assign lambdas
as equivalent to this structure:
def anony_mouse(x):
return x
(Almost, as there is no other way of getting a function without assigning it to some variable, and the syntax prevents you doing some things with them, such as using multiple lines.)
Thus let's write out the top example using standard function notation:
def foo(y):
# note that y exists here
def baz(x):
return x(x(y))
return baz
So we have a factory function, which generates a function which... expects to be called with a function (x), and returns x(x(arg_to_factory_function)). Consider:
>>> def add_six(x):
return x + 6
>>> bazzer = foo(3)
>>> bazzer(add_six) # add_six(add_six(3)) = 6+(6+3)
I could go on, but does that make it clearer?
Incidentally that code is horrible, and almost makes me agree with Guido that lambdas are bad.

The 1st ‘(bar)’ is equal to just ‘bar’ so it is an ordinary function call, the 2nd — argument to that call, i.e. bar(bar) — substitute ‘x’ to ‘bar’ there any you will get what is result of bar(bar); the’(foo)’ argument passing to the result of bar(bar) it will be a lambda-function with some arg. — substitute it to ‘foo’ and get result and so on until you reach the end of expression

I slightly modify your original function to make clearer what's going on (so it should be clearer which parameter is callable!)
# given a function it evaluates it at value p
def eval(func): # your foo
return lambda p: func(p)
# given a value p perform a double composition of the function at this value (2-step recursion)
def iter_2(p): # your bar
return lambda func: func(func(p))
increment = lambda x: x + 1 # variable binding only for readability
This example is quite hard to understand because one of the function, eval just do nothing special, and it composition is equivalent to the identity! ... so it could be quite confusing.
(foo)(2)(lambda x:x+1)):
x = 2
iter_2(x)(increment) # increment by 2 because iter_2 calls increment twice
# 4
idempotency: (or composition with itself return the identity function)
increment(3) == eval(increment)(3)
# True
# idempotency - second composition is equivalent to the identity
eval(increment)(3) == eval(eval)(increment)(3)
# True
eval(increment)(3) == eval(eval)(eval)(increment)(3)
# True
# ... and so on
final: consequence of idempotency -> bar do nothing, just confusion
eval(eval)(iter_2)(x)(increment) == iter_2(x)(increment)
# True
Remark:
in (bar)(bar)(foo)(2)(lambda x:x+1) you can omit the brackets around the 1st term, just bar(bar)(foo)(2)(lambda x:x+1)
Digression: [since you example is quite scaring]
Lambda functions are also known as anonymous function. Why this? Simply because that they don't need to be declared. They are designed to be single purpose, so you should "never" assign to a variable. The arise for example in the context of functional programming where the basic ingredients are... functions! They are used to modify the behavior of other functions (for example by decoration!). Your example it is just a standalone syntactical one... essentially a nonsense example which hides the truth "power" of the lambda functions. There is also a branch mathematics which based on them called lambda calculus.
Here a totally different example of application of the lambda functions, useful for decoration (but this is another story):
def action(func1):
return lambda func2: lambda p: func2(p, func1())
def save(path, content):
print(f'content saved to "{path}"')
def content():
return 'content' # i.e. from a file, url, ...
# call
action(content)(save)('./path')
# with each key-parameter would be
action(func1=content)(func2=save)(p='./path')
Output
content saved to "./path"

Modifying functional python compose() to return list of all intermediate values

In Python 3, here is my compose function one-liner, which I am trying to modify:
def compose(*fncs):
return functools.reduce(lambda f,g: lambda x: f(g(x)), fncs, lambda x: x)
When I compose a function with c = compose(h, g, f), calling c(x) is equivalent to calling h(g(f(x))
By changing my existing one-liner as little as possible, I would like to create a compose_intermed(*fncs) function which returns a slightly different kind of composed function. This function, when called, returns not the final value of the composed functions, but a list whose first element is the final value, followed by all the intermediate values at each step in which composed functions are applied.
When I compose a function with ci = compose_intermed(h, g, f), calling ci(x) would return the list [h(g(f(x))), g(f(x)), f(x)].
I would like to modify the compose function as little as possible, continuing to use either reduce or perhaps a list comprehension, rather than loops. I know there may be easier ways to do this, but I'm trying to use this as an exercise to improve my general understanding of the nexus of functional programming and Python 3.
Bonus question: Does this function have another more standardized name in the functional programming world? I've searched several libraries, and I haven't yet found a library function for what I am trying to do.

Ry's comment is a good starting point. In this post, I'll try to demonstrate what he/she is talking about -
from functools import reduce
def identity(x):
return x
def pipeline(f = identity, *fs):
return reduce(lambda r,f: lambda x: f(r(x)), fs, f)
Make two simple functions and test it out. Notice how pipeline applies the functions in left-to-right order -
def add1(x):
return x + 1
def mult2(x):
return x * 2
f = pipeline(mult2, add1, add1, add1)
print(f(10))
# 23
Next, implement pipeline_intermediate. Just as Ry comments, the output is reversed at the end using [::-1] -
def pipeline_intermediate(f = identity, *fs):
return lambda x: reduce(lambda r,f: [f(r[0])]+r, fs, [f(x)]) [::-1]
g = pipeline_intermediate(mult2, add1, add1, add1)
print(g(10))
# [20, 21, 22, 23]
Now can you see how to implement right-to-left compose_intermediate? Can you see why it's more challenging?

python 3.4 lambda and revrse her how

I have question about python 3.4.
let's say I do:
inches_to_meters =lambda x: x*0.0254
inches_to_feets =lambda x: x*(1/12)
miles_to_feets =lambda x: x*5280
I want to know how to calculate the opposite function, only with lambda how can I do it?
For example:
feets_to_inches = opposite(inches_to_feets)
or for more example I want composition with lambda only:
miles_to_inches = composition(feets_to_inches, miles_to_feets)
tnx for the help

The task is specifically limited to converting distances using the specified lambdas as the base and using an opposite and composition lambdas for the rest.
Since the conversion of these units is rather simple, you can get the conversion factor by dividing by 1 first. Basically:
opposite = lambda f: lambda x: x/f(1)

As Willem Van Onsem says, you cannot automatically define the inverse of a function. You can, however, compute the appropriate conversion factor and pass that to a converter-making function.
def make_converter(factor):
def _(x):
return x * factor
return _
inches_in_feet = 12
feet_to_inches = make_converter(inches_in_feet)
inches_to_feet = make_converter(1/inches_in_feet)
Composition, however, is trivial (assuming the output of the first function
is the expected input of the second):
def composition(f, g):
return lambda x: f(g(x))

Pythonic way to re-apply a function to its own output n times?

Assume there are some useful transformation functions, for example random_spelling_error, that we would like to apply n times.
My temporary solution looks like this:
def reapply(n, fn, arg):
for i in range(n):
arg = fn(arg)
return arg
reapply(3, random_spelling_error, "This is not a test!")
Is there a built-in or otherwise better way to do this?
It need not handle variable lengths args or keyword args, but it could. The function will be called at scale, but the values of n will be low and the size of the argument and return value will be small.
We could call this reduce but that name was of course taken for a function that can do this and too much more, and was removed in Python 3. Here is Guido's argument:
So in my mind, the applicability of reduce() is pretty much limited to
associative operators, and in all other cases it's better to write out
the accumulation loop explicitly.

reduce is still available in python 3 using the functools module. I don't really know that it's any more pythonic, but here's how you could achieve it in one line:
from functools import reduce
def reapply(n, fn, arg):
return reduce(lambda x, _: fn(x), range(n), arg)

Get rid of the custom function completely, you're trying to compress two readable lines into one confusing function call. Which one do you think is easier to read and understand, your way:
foo = reapply(3, random_spelling_error, foo)
Or a simple for loop that's one more line:
for _ in range(3):
foo = random_spelling_error(foo)
Update: According to your comment
Let's assume that there are many transformation functions I may want to apply.
Why not try something like this:
modifiers = (random_spelling_error, another_function, apply_this_too)
for modifier in modifiers:
for _ in range(3):
foo = modifier(foo)
Or if you need different amount of repeats for different functions, try creating a list of tuples:
modifiers = [
(random_spelling_error, 5),
(another_function, 3),
...
]
for modifier, count in modifiers:
for _ in range(count):
foo = modifier(foo)

some like recursion, not always obviously 'better'
def reapply(n, fn, arg):
if n:
arg = reapply(n-1, fn, fn(arg))
return arg
reapply(1, lambda x: x**2, 2)
Out[161]: 4
reapply(2, lambda x: x**2, 2)
Out[162]: 16

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Function invoked by Pyspark map do not modify global list - python

I resolved in this way, even if I did not find a useful solution for the problem in general. signatures = shingles.flatMap(lambda x: [[(x[1]+1, (x[1]+1)%lsh_b), min([int.from_bytes(function(str(s)), 'big') for s in x[0]])] for function in hash_functions]).cache()

Related

python: Is it possible to make another variable that contains a undefined variable?

python lambda evaluate expression

Modifying functional python compose() to return list of all intermediate values

python 3.4 lambda and revrse her how

Pythonic way to re-apply a function to its own output n times?

Categories

Resources