Intersecting two sets, retaining all (up to) three parts efficiently - python

If you have two sets a and b and intersect them, there are three interesting parts (which may be empty): h(ead) elements of a not in b, i(ntersection) elements in both a and b, and t(ail) elements of b not in a.
For example: {1, 2, 3} & {2, 3, 4} -> h:{1}, i:{2, 3}, t:{4} (not actual Python code, clearly)
One very clean way to code that in Python:
h, i, t = a - b, a & b, b - a
I figure that this can be slightly more efficient though:
h, t = a - (i := a & b), b - i
Since it first computes the intersection and then subtracts only that from a and then b, which would help if i is small and a and b are large - although I suppose it depends on the implementation of the subtraction whether it's truly faster. It's not likely to be worse, as far as I can tell.
I was unable to find such an operator or function, but since I can imagine efficient implementations that would perform the three-way split of a and b into h, i, and t in fewer iterations, am I missing something like this, which may already exist?
from magical_set_stuff import hit
h, i, t = hit(a, b)

It's not in Python, and I haven't seen such a thing in a 3rd-party library either.
Here's a perhaps unexpected approach that's largely insensitive to which sets are bigger than others, and to how much overlap among inputs there may be. I dreamed it up when facing a related problem: suppose you had 3 input sets, and wanted to derive the 7 interesting sets of overlaps (in A only, B only, C only, both A and B, both A and C, both B and C, or in all 3). This version strips that down to the 2-input case. In general, assign a unique power of 2 to each input, and use those as bit flags:
def hit(a, b):
x2flags = defaultdict(int)
for x in a:
x2flags[x] = 1
for x in b:
x2flags[x] |= 2
result = [None, set(), set(), set()]
for x, flag in x2flags.items():
result[flag].add(x)
return result[1], result[3], result[2]

I won't accept my own answer unless nobody manages to beat my own solution or any of the good and concise Python ones.
But for anyone interested in some numbers:
from random import randint
from timeit import timeit
def grismar(a: set, b: set):
h, i, t = set(), set(), b.copy()
for x in a:
if x in t:
i.add(x)
t.remove(x)
else:
h.add(x)
return h, i, t
def good(a: set, b: set):
return a - b, a & b, b - a
def better(a: set, b: set):
h, t = a - (i := a & b), b - i
return h, i, t
def ok(a: set, b: set):
return a - (a & b), a & b, b - (a & b)
from collections import defaultdict
def tim(a, b):
x2flags = defaultdict(int)
for x in a:
x2flags[x] = 1
for x in b:
x2flags[x] |= 2
result = [None, set(), set(), set()]
for x, flag in x2flags.items():
result[flag].add(x)
return result[1], result[3], result[2]
def pychopath(a, b):
h, t = set(), b.copy()
h_add = h.add
t_remove = t.remove
i = {x for x in a
if x in t and not t_remove(x) or h_add(x)}
return h, i, t
def enke(a, b):
t = b - (i := a - (h := a - b))
return h, i, t
xs = set(randint(0, 10000) for _ in range(10000))
ys = set(randint(0, 10000) for _ in range(10000))
# validation
g = (f(xs, ys) for f in (grismar, good, better, ok, tim, enke))
l = set(tuple(tuple(sorted(s)) for s in t) for t in g)
assert len(l) == 1, 'functions are equivalent'
# warmup, not competing
timeit(lambda: grismar(xs, ys), number=500)
# competition
print('a - b, a & b, b - a ', timeit(lambda: good(xs, ys), number=10000))
print('a - (i := a & b), b - i ', timeit(lambda: better(xs, ys), number=10000))
print('a - (a & b), a & b, b - (a & b) ', timeit(lambda: ok(xs, ys), number=10000))
print('tim ', timeit(lambda: tim(xs, ys), number=10000))
print('grismar ', timeit(lambda: grismar(xs, ys), number=10000))
print('pychopath ', timeit(lambda: pychopath(xs, ys), number=10000))
print('b - (i := a - (h := a - b)) ', timeit(lambda: enke(xs, ys), number=10000))
Results:
a - b, a & b, b - a 5.6963334
a - (i := a & b), b - i 5.3934624
a - (a & b), a & b, b - (a & b) 9.7732018
tim 16.3080373
grismar 7.709292500000004
pychopath 6.76331460000074
b - (i := a - (h := a - b)) 5.197220600000001
So far, the optimisation proposed by #enke in the comments appears to win out:
t = b - (i := a - (h := a - b))
return h, i, t
Edit: added #Pychopath's results, which is indeed substantially faster than my own, although #enke's result is still the one to beat (and likely won't be with just Python). If #enke posts their own answer, I'd happily accept it as the answer.

Optimized version of yours, seems to be about 20% faster than yours in your benchmark:
def hit(a, b):
h, t = set(), b.copy()
h_add = h.add
t_remove = t.remove
i = {x for x in a
if x in t and not t_remove(x) or h_add(x)}
return h, i, t
And you might want to do this at the start, especially if the two sets can have significantly different sizes:
if len(a) > len(b):
return hit(b, a)[::-1]

Related

Find the missing starting number of shuffled cumsum series

a and b are two arrays of floats of length n each. a can have both negative and positive entries.
b is cumulative sum of a.
b[0] != a[0]. In fact, b[0] = a[0] + k
Both a and b are shuffled such that the relative order between them is maintained, i.e., if say a[0] becomes a[6] then b[0] will become b[6] and so on.
Can someone suggest an algo to find k for randomly shuffled a and b such that their relative order is maintained.
My naive attempt below (which takes forever for n>=10)
import numpy as np
import itertools
def get_starting_point(a, b):
for msk in itertools.permutations(range(len(a))): # NOTE: Takes forever for n>=10.
new_a = a[list(msk)]
new_b = b[list(msk)]
k = new_b[0] - new_a[0]
new_a = np.cumsum(new_a) + k
if np.nansum(np.abs(new_b - new_a)) < 0.001:
return k
return None
Generate samples of a, b and expected k to try your solution:
def get_a_b_k(n=14):
a = np.round(np.random.uniform(low=-10, high=10, size=(n,)), 2)
b = np.cumsum(a)
prob = np.random.uniform(0,1)
if prob < 0.4:
k = np.round(np.random.uniform(-10,10), 2)
# NOTE: this elif can be removed as its just sub-case of else block.
elif prob < 0.6: # k same as the last b.
k = b[n-1]
a[n-2] -= k
else: # k same as one of b's
idx = np.random.choice(n, size=1)
k = b[idx]
a[idx] -= k
b = np.cumsum(a)
msk = np.random.choice(n, size=n, replace=False) # Randomly generated mask of size n.
return a[msk], b[msk] + k, k
We have:
b = np.cumsum(a) + k
We can compute b-a to get the previous elements of the sum. Thus the only element of b-a that does not belong to b indicates the position of the start.
As we are working with floating point numbers, we need a function to match floating point values. I used isin_tolerance that is defined here.
def solve(a, b):
m = isin_tolerance(b-a, b, 1e-8)
return (b[~m] - a[~m])[0]
np.random.seed(0)
for i in range(1_000_000):
a, b, k = get_a_b_k()
assert np.isclose(k, solve(a, b))
This takes a few minutes to run on 1M attempts but did not fail. On 10k tests with n=200 this runs in ~2s.
NB. This could fail if coincidentally, k is equal to one of the values in b, but this is fairly unlikely and did not happen in my random tests.

performing more complex calculations on groupby objects

I am currently looking to do some calculations on a large dataset of options where I want first to split the data according to the strike price and expiry, then perform a set of calculations shown below onto each subgroup. I have been able to separate the data using groupby to get the split I want, I also wrote the calculation i want to do which works when tested on a subgroup. The only problem I have is to combine the two together.
Here is the code I used to group my data:
grouped =df.groupby(['Expiry','Strike'])
I had a read online and they mentioned the use of the apply function but the examples only included simple functions such as summation or averages.
Here is the calculation that I would like to perform on each subgroup data, where x,y,z,u,R are columns that in each subset that is the same for all subgroups:
def p(d, S, B, c):
return d * S + B - c
def b_t(r, b_old, S, d, d_old, t):
return np.exp(r * t) * b_old + S * (d_old - d)
def e_t(d_old, S, c, r, t, b_old):
return d_old * S - c + np.exp(r * t) * b_old
P_results = []
B_results = []
E_results = []
for i,(d,S,c,t,r) in enumerate(zip(x,y,z,u,R)):
B = b_t(r, b_old, S, d, d_old, t)
P = p(d, S, B, c)
E = e_t(d_old, S, c, r, t, b_old)
print('i={},P={},B={},E={}'.format(i,P,B,E))
B_results.append(B)
P_results.append(P)
E_results.append(E)
b_old = B
d_old = d
I thought maybe if I could save each subset as a new variable dataframe then maybe it could work but I haven't been able to do that.
I hope this is clear and I think posting some data would help but I am not sure how best to upload it here.
Much appreciate your help!
UPDATE 1: Found a solution that works
grouped =df.groupby(['Expiry','Strike'])
lg = list(grouped)
P_results = []
l_results =[]
B_results = []
E_results = []
for l in range(len(lg)):
df2=lg[l][1]
d_old = df2.iloc[0, 4]
S_old = df2.iloc[0, 8]
c_old = df2.iloc[0, 10]
b_old = c_old - d_old * S_old
x = df2.iloc[1:, 4]
y = df2.iloc[1:, 8]
z = df2.iloc[1:, 10]
u = df2.iloc[1:, 9]
R = df2.iloc[1:, 7]
for i, (d, S, c, t, r) in enumerate(zip(x, y, z, u, R)):
B = b_t(r, b_old, S, d, d_old, t)
P = p(d, S, B, c)
E = e_t(d_old, S, c, r, t, b_old)
print('i={},P={},B={},E={}'.format(i, P, B, E))
l_results.append(l)
B_results.append(B)
P_results.append(P)
E_results.append(E)
b_old = B
d_old = d
BB = pd.DataFrame(np.column_stack([l_results, P_results,
E_results,B_results]),columns=['l','P','E','B'])
All I did was to transform grouped into a callable list and then call each of the sections out using a for loop then use another for loop to perform the calculations. It is not the prettiest output, I put l_results there to show which group the calculations were referring to but seems to be sufficient for now. If there is any better way please let me know!

How to find reverse of pow(a,b,c) in python?

pow(a,b,c) operator in python returns (a**b)%c . If I have values of b, c, and the result of this operation (res=pow(a,b,c)), how can I find the value of a?
Despite the statements in the comments this is not the discrete logarithm problem. This more closely resembles the RSA problem in which c is the product of two large primes, b is the encrypt exponent, and a is the unknown plaintext. I always like to make x the unknown variable you want to solve for, so you have y= xb mod c where y, b, and c are known, you want to solve for x. Solving it involves the same basic number theory as in RSA, namely you must compute z=b-1 mod λ(c), and then you can solve for x via x = yz mod c. λ is Carmichael's lambda function, but you can also use Euler's phi (totient) function instead. We have reduced the original problem to computing an inverse mod λ(c). This is easy to do if c is easy to factor or we already know the factorization of c, and hard otherwise. If c is small then brute-force is an acceptable technique and you can ignore all the complicated math.
Here is some code showing these steps:
import functools
import math
def egcd(a, b):
"""Extended gcd of a and b. Returns (d, x, y) such that
d = a*x + b*y where d is the greatest common divisor of a and b."""
x0, x1, y0, y1 = 1, 0, 0, 1
while b != 0:
q, a, b = a // b, b, a % b
x0, x1 = x1, x0 - q * x1
y0, y1 = y1, y0 - q * y1
return a, x0, y0
def inverse(a, n):
"""Returns the inverse x of a mod n, i.e. x*a = 1 mod n. Raises a
ZeroDivisionError if gcd(a,n) != 1."""
d, a_inv, n_inv = egcd(a, n)
if d != 1:
raise ZeroDivisionError('{} is not coprime to {}'.format(a, n))
else:
return a_inv % n
def lcm(*x):
"""
Returns the least common multiple of its arguments. At least two arguments must be
supplied.
:param x:
:return:
"""
if not x or len(x) < 2:
raise ValueError("at least two arguments must be supplied to lcm")
lcm_of_2 = lambda x, y: (x * y) // math.gcd(x, y)
return functools.reduce(lcm_of_2, x)
def carmichael_pp(p, e):
phi = pow(p, e - 1) * (p - 1)
if (p % 2 == 1) or (e >= 2):
return phi
else:
return phi // 2
def carmichael_lambda(pp):
"""
pp is a sequence representing the unique prime-power factorization of the
integer whose Carmichael function is to be computed.
:param pp: the prime-power factorization, a sequence of pairs (p,e) where p is prime and e>=1.
:return: Carmichael's function result
"""
return lcm(*[carmichael_pp(p, e) for p, e in pp])
a = 182989423414314437
b = 112388918933488834121
c = 128391911110189182102909037 * 256
y = pow(a, b, c)
lam = carmichael_lambda([(2,8), (128391911110189182102909037, 1)])
z = inverse(b, lam)
x = pow(y, z, c)
print(x)
The best you can do is something like this:
a = 12
b = 5
c = 125
def is_int(a):
return a - int(a) <= 1e-5
# ============= Without C ========== #
print("Process without c")
rslt = pow(a, b)
print("a**b:", rslt)
print("a:", pow(rslt, (1.0 / b)))
# ============= With C ========== #
print("\nProcess with c")
rslt = pow(a, b, c)
i = 0
while True:
a = pow(rslt + i*c, (1.0 / b))
if is_int(a):
break
else:
i += 1
print("a**b % c:", rslt)
print("a:", a)
You can never be sure that you have found the correct modulo value, it is the first value that is compatible with your settings. The algorithm is based on the fact that a, b and c are integers. If they are not you have no solution a likely combination that was the original one.
Outputs:
Process without c
a**b: 248832
a: 12.000000000000002
Process with c
a**b % c: 82
a: 12.000000000000002

sum function vs long addition

I am wondering if the sum() builtin has an andventage over a long addition ?
is
sum(filter(None, [a, b, c, d]))
faster than
a + b + c + d
assuming I am using CPython ?
thanks
EDIT: What if those variables are Decimals ?
A quick example (note that, to try to be fairer, the sum version takes a tuple argument, so you don't include the time for building that structure (a, b, c, d), and doesn't include the unnecessary filter):
>>> import timeit
>>> def add_up(a, b, c, d):
return a + b + c + d
>>> def sum_up(t):
return sum(t)
>>> t = (1, 2, 3, 4)
>>> timeit.timeit("add_up(1, 2, 3, 4)", setup="from __main__ import sum_up, add_up, t")
0.2710826617188786
>>> timeit.timeit("sum_up(t)", setup="from __main__ import sum_up, add_up, t")
0.3691424539089212
This is pretty much inevitable - add_up doesn't have any function call overhead, it just does 3 binary adds. But the different forms have different uses - sum doesn't care how many items are given to it, whereas you have to write each name out with +. In an example with a fixed number of items, where speed is crucial, + has the edge, but for almost all general cases sum is the way to go.
With Decimals:
>>> t = tuple(map(Decimal, t))
>>> a = Decimal(1)
>>> b = Decimal(2)
>>> c = Decimal(3)
>>> d = Decimal(4)
>>> timeit.timeit("add_up(a, b, c, d)", setup="from __main__ import sum_up, add_up, t, a, b, c, d")
0.5005962150420373
>>> timeit.timeit("sum_up(t)", setup="from __main__ import sum_up, add_up, t, a, b, c, d")
0.7599533142681025

Error in for loop. (Finding three integers)

So, our teacher gave us an assignment to find three integers a, b c. They are in all between 0 and 450 using Python.
a = c + 11 if b is even
a = 2c-129 if b is odd
b = ac mod 2377
c = (∑(b-7k) from k = 0 too a-1) +142 (Edited. I wrote it wrong. Was -149)
I tired my code that looks like this: (Still a newbie. I guess a lot of my code is wrong)
for a, b, c in range(0, 450):
if b % 2 == 0:
a = c + 11
else:
a = 2 * c - 129
b = (a * c) % 2377
c = sum(b - 7 * k for k in range(0, a - 1))
but I get the error:
for a, b, c in range(0, 450):
TypeError: 'int' object is not iterable
What am I doing wrong and how can I make it check every number between 0 and 450?
The answers by Nick T and Eric hopefully helped you solve your issue with iterating over values of a, b, and c. I would like to also point out that the way you're approaching this problem isn't going to work. What's the point of iterating over various values of a if you're going to re-assign a to something anyway at each iteration of the loop? And likewise for b and c. A better approach involves checking that any given triple (a, b, c) satisfies the conditions given in the assignment. For example:
from itertools import product, tee
def test(a, b, c):
flags = {'a': False,
'b': False,
'c': False}
if (b % 2 == 0 and a == c+11) or (b % 2 == 1 and a == 2*c-129):
flags['a'] = True
if b == (a * c) % 2377:
flags['b'] = True
if c == sum(b - 7*k for k in range(a-1)) - 149:
flags['c'] = True
return all(flags.values()) # True if zero flags are False
def run_tests():
# iterate over all combinations of a=0..450, b=0..450, c=0..450
for a, b, c in product(*tee(range(451), 3)):
if test(a, b, c):
return (a, b, c)
print(run_tests())
NOTE: This is a slow solution. One that does fewer loops, like in glglgl's answer, or Duncan's comment, is obviously favorable. This is really more for illustrative purposes than anything.
import itertools
for b, c in itertools.product(*[range(450)]*2):
if b % 2 == 0:
a = c + 11
else:
a = 2 * c - 129
derived_b = (a * c) % 2377
derived_c = sum(b - 7 * k for k in range(0, a - 1))
if derived_b == b and derived_c == c:
print a, b, c
You need to nest the loops to brute-force it like you are attempting:
for a in range(451): # range(450) excludes 450
for b in range(451):
for c in range(451):
...
It's very obviously O(n3), but if you want a quick and dirty answer, I guess it'll work—only 91 million loops, worst case.
The stuff with [0, 450] is just as a hint.
In fact, your variables are coupled together. You can immediately eliminate at least one loop directly:
for b in range(0, 451):
for c in range(0, 451):
if b % 2: # odd
a = 2 * c - 129
else:
a = c + 11
if b != (a * c) % 2377: continue # test failed
if c != sum(b - 7 * k for k in range(a)): continue # test failed as well
print a, b, c
should do the job.
I won't post full code (after all, it is homework), but you can eliminate two of the outer loops. This is easiest if you iterate over c.
You code should then look something like:
for c in range(451):
# calculate a assuming b is even
# calculate b
# if b is even and a and b are in range:
# calculate what c should be and compare against what it is
# calculate a assuming b is odd
# calculate b
# if b is odd and a and b are in range:
# calculate what c should be and compare against what it is
Extra credit for eliminating the duplication of the code to calculate c
a = c + 11 if b is even
a = 2c-129 if b is odd
b = ac mod 2377
c = (∑(b-7k) from k = 0 to a-1) +142
This gives you a strong relation between all 3 numbers
Given a value a, there are 2 values c (a-11 or (a+129)/2), which in turn give 2 values for b (ac mod 2377 for both values of c, conditioned on the oddity of the result for b), which in turn gets applied in the formula for validating c.
The overall complexity for this is o(n^2) because of the formula to compute c.
Here is an implementation example:
for a in xrange(451):
c_even = a - 11
b = (a*c_even) % 2377
if b % 2 == 0:
c = sum(b - 7 * k for k in range(a)) + 142
if c == c_even:
print (a, b, c)
break
c_odd = (a+129)/2
b = (a*c_odd) % 2377
if b % 2 == 1:
c = sum(b - 7 * k for k in range(a)) + 142
if c == c_odd:
print (a, b, c)
break

Categories