Related
I have a tuple like this (note that first element can be of any size (big but not extremely big, ie. 2**12 - 1 is OK), and second will always be in range [0, 255]).
t = [(0, 137), (0, 80), (0, 78), (0, 71), (0, 13), ...]
I want to store these numbers as bytes on the file system (for compression). That means I also want to later use these bits to recover the tuple. Also note that it is a requirement that the Big endian is used.
for idx, v in compressed:
if v:
f.write(struct.pack(">I", idx))
f.write(struct.pack(">I", v))
However, when I try to get the numbers, like this:
with open(filepath, 'rb') as file:
data = file.read(4)
nums = []
while data:
num = struct.unpack(">I", data)[0]
print(num)
data = file.read(4)
nums.append(num)
I am not getting the numbers above (I am for some numbers, but later it gets messed up, probably because of bit padding).
How to stay consistent with bit padding? How can I add something with struct.pack('>I, ...) that I can later reliably get?
Update:
For the following tuple
[(0, 137), (0, 80), (0, 78), (0, 71), (0, 13), (0, 10), (0, 26), (6, 0), (0, 0), (9, 13), (0, 73), (0, 72), (0, 68), (0, 82), (9, 0), (0, 1), (0, 44), (15, 1), (17, 8), (0, 2), (15, 0), (0, 246) ...]
I get the following numbers using my approach:
[0, 137, 0, 80, 0, 78, 0, 71, 0, 13, 0, 10, 0, 26, 9, 13, 0, 73, 0, 72, 0, 68, 0, 82, 0, 1, 0, 44, 15, 1, 17, 8, 0, 2, 0, 246 ...]
See, at the (6,0) it starts to diverge. Until then it's fine. But it corrects itself?? at 9,13 and continues to do well.
Your code seems to work fine. However, you do have the following line in there:
if v:
v will be False for some of those tuples where the second element is 0 which won't be written to the file and therefore you won't see them when reading from that file, again.
Also, since you're writing your elements in pairs anyways, you could use >II as your format:
from struct import pack, unpack, calcsize
original = [(0, 137), (0, 80), (0, 78), (0, 71), (0, 13), (0, 10), (0, 26), (6, 0), (0, 0), (9, 13), (0, 73), (0, 72), (0, 68), (0, 82), (9, 0), (0, 1), (0, 44), (15, 1), (17, 8), (0, 2), (15, 0), (0, 246)]
filename = "test.txt"
fileformat = ">II"
with open(filename, "wb") as fp:
for element in original:
fp.write(pack(fileformat, *element))
with open(filename, "rb") as fp:
elements = iter(lambda: fp.read(calcsize(fileformat)), b"")
readback = [unpack(fileformat, element) for element in elements]
print(readback == original)
Given the following input:
compressed = [(0, 137), (0, 80), (0, 78), (0, 71), (0, 13), (0, 10), (0, 26), (6, 0), (0, 0), (9, 13), (0, 73), (0, 72), (0, 68), (0, 82), (9, 0), (0, 1), (0, 44), (15, 1), (17, 8), (0, 2), (15, 0), (0, 246)]
try this code for wrting data:
import struct
with open('file.dat', 'wb') as f:
for idx, v in compressed:
f.write(struct.pack(">I", idx))
f.write(struct.pack(">I", v))
and this code for reading:
with open('file.dat', 'rb') as f:
data = f.read(4)
nums = []
while data:
idx = struct.unpack(">I", data)[0]
data = f.read(4)
v = struct.unpack(">I", data)[0]
data = f.read(4)
nums.append((idx,v))
and nums contains:
[(0, 137), (0, 80), (0, 78), (0, 71), (0, 13), (0, 10), (0, 26), (6, 0), (0, 0), (9, 13), (0, 73), (0, 72), (0, 68), (0, 82), (9, 0), (0, 1), (0, 44), (15, 1), (17, 8), (0, 2), (15, 0), (0, 246)]
Which is the same as the input, in fact nums == compressed gives True.
I have a list:
[(14, 2), (14, 2), (16, 2), (14, 2), (15, 2), (15, 2), (21, 2), (15, 2), (18, 2), (15, 2), (19, 2), (25, 2), (22, 2), (17, 2), (31, 2), (26, 2), (21, 2), (25, 2), (29, 2), (33, 2), (25, 2), (23, 2), (25, 2), (19, 2), (12, 2), (29, 2), (18, 2), (21, 2), (13, 2), (13, 2), (18, 2), (11, 2), (12, 2), (20, 2), (23, 2), (17, 2), (14, 2), (17, 2), (12, 2), (13, 2), (15, 2), (21, 2), (15, 2), (19, 2), (22, 2), (16, 2), (16, 2), (13, 2), (17, 2), (18, 2), (20, 2), (18, 2), (13, 2), (13, 2), (18, 2), (14, 2), (13, 2), (22, 2), (14, 2), (25, 2), (22, 2), (9, 2), (18, 2), (22, 2), (19, 2), (13, 2), (14, 2), (15, 2), (13, 2), (17, 2), (21, 2), (18, 2), (21, 2), (18, 2), (15, 2), (16, 2), (13, 2), (16, 2), (16, 2), (15, 2), (11, 2), (24, 2), (15, 2), (12, 2), (20, 2), (21, 2), (21, 2), (14, 2), (11, 2), (26, 2), (17, 2), (21, 2), (16, 2), (13, 2), (15, 2), (13, 2), (12, 2), (22, 2), (16, 2), (13, 2), (13, 2), (22, 2), (12, 2), (16, 2), (16, 2), (21, 2), (19, 2), (15, 2), (16, 2), (16, 2), (13, 2), (14, 2), (14, 2), (20, 2), (14, 2), (20, 2), (13, 2), (19, 2), (20, 2), (17, 2), (17, 2), (25, 2), (22, 2), (22, 2), (22, 2), (14, 2), (19, 2), (20, 2), (16, 2), (13, 2), (19, 2), (16, 2), (12, 2), (18, 2), (20, 2), (19, 2), (18, 2), (15, 2), (22, 2), (18, 2), (20, 2), (14, 2), (19, 2), (16, 2), (18, 2), (28, 2), (14, 2), (17, 2), (17, 2), (23, 2), (18, 2), (24, 2), (17, 2), (18, 2), (18, 2), (22, 2)]
And I want my output to be a list of lists where every sub-list is the number of elements that can be added together without being over a certain threshold and only stored as the sum (of each element):
Example, if the threshold is 50 (inclusive):
[[16, 16, 18], [16, 17, 17], [23, 17], [20, 17], [21, 27], [24, 19], [33], [28], [23, 27], [31], [35], [27], [25], [27, 21] ...]
The second value of the tuple may vary. Preferred as a list comprehension.
EDIT:
As requested, my original code which I want to cleanup/optimize:
padding = len("Packages () ") + math.floor(math.log10(len(apps))+1)
line_length = columns - (padding * 2) - 2
spacings = 2
element_length = [item for sublist in [list(a) for a in zip([len(i) for i in apps],[i for i in itertools.repeat(spacings, len([len(i) for i in apps]))])] for item in sublist]
limits = []
outer_limit = 0
while line_length <= sum(element_length):
while line_length >= sum(element_length[0:outer_limit]):
outer_limit += 1
limits.append(outer_limit - 1)
element_length = element_length[outer_limit - 1:]
outer_limit = 0
message = ""
a = 0
b = 0
for amount in limits:
b += math.ceil(amount / 2)
message += (" " * spacings).join(apps[a:b]) + ("\n" + " " * padding)
a = b
print("Packages ({}) {}".format(len(apps), message))
Here is a pretty simple way of doing that with just a single for-loop:
tups = [(14, 2), (14, 2), (16, 2), (14, 2)]
threshold = 50
result = [[]]
for tup in tups:
tupSum = sum(tup)
# Start a new sublist if adding this tuple's sum would exceed the threshold
if tupSum + sum(result[-1]) > threshold:
result.append([])
result[-1].append(tupSum)
This could be done with the reduce function from functools:
a = [(14, 2), (14, 2), (16, 2), (14, 2), (15, 2), (15, 2), (21, 2), (15, 2), (18, 2), (15, 2), (19, 2), (25, 2), (22, 2), (17, 2), (31, 2), (26, 2), (21, 2), (25, 2), (29, 2), (33, 2), (25, 2), (23, 2), (25, 2), (19, 2), (12, 2), (29, 2), (18, 2), (21, 2), (13, 2), (13, 2), (18, 2), (11, 2), (12, 2), (20, 2), (23, 2), (17, 2), (14, 2), (17, 2), (12, 2), (13, 2), (15, 2), (21, 2), (15, 2), (19, 2), (22, 2), (16, 2), (16, 2), (13, 2), (17, 2), (18, 2), (20, 2), (18, 2), (13, 2), (13, 2), (18, 2), (14, 2), (13, 2), (22, 2), (14, 2), (25, 2), (22, 2), (9, 2), (18, 2), (22, 2), (19, 2), (13, 2), (14, 2), (15, 2), (13, 2), (17, 2), (21, 2), (18, 2), (21, 2), (18, 2), (15, 2), (16, 2), (13, 2), (16, 2), (16, 2), (15, 2), (11, 2), (24, 2), (15, 2), (12, 2), (20, 2), (21, 2), (21, 2), (14, 2), (11, 2), (26, 2), (17, 2), (21, 2), (16, 2), (13, 2), (15, 2), (13, 2), (12, 2), (22, 2), (16, 2), (13, 2), (13, 2), (22, 2), (12, 2), (16, 2), (16, 2), (21, 2), (19, 2), (15, 2), (16, 2), (16, 2), (13, 2), (14, 2), (14, 2), (20, 2), (14, 2), (20, 2), (13, 2), (19, 2), (20, 2), (17, 2), (17, 2), (25, 2), (22, 2), (22, 2), (22, 2), (14, 2), (19, 2), (20, 2), (16, 2), (13, 2), (19, 2), (16, 2), (12, 2), (18, 2), (20, 2), (19, 2), (18, 2), (15, 2), (22, 2), (18, 2), (20, 2), (14, 2), (19, 2), (16, 2), (18, 2), (28, 2), (14, 2), (17, 2), (17, 2), (23, 2), (18, 2), (24, 2), (17, 2), (18, 2), (18, 2), (22, 2)]
from functools import reduce
b = list(reduce(lambda a,b:a+[[b]] if sum(a[-1])+b>50 else a[:-1]+[a[-1]+[b]],map(sum,a),[[]]))
print(b)
# [[16, 16, 18], [16, 17, 17], [23, 17], [20, 17], [21, 27], [24, 19], [33], [28], [23, 27], [31], [35], [27], [25], [27, 21], [14, 31], [20, 23], [15, 15, 20], [13, 14, 22], [25, 19], [16, 19, 14], [15, 17], [23, 17], [21, 24], [18, 18], [15, 19], [20, 22], [20, 15, 15], [20, 16], [15, 24], [16, 27], [24, 11], [20, 24], [21, 15], [16, 17, 15], [19, 23], [20, 23], [20, 17], [18, 15], [18, 18], [17, 13], [26, 17], [14, 22], [23, 23], [16, 13], [28, 19], [23, 18], [15, 17, 15], [14, 24], [18, 15, 15], [24, 14], [18, 18], [23, 21], [17, 18], [18, 15, 16], [16, 22], [16, 22], [15, 21], [22, 19], [19, 27], [24, 24], [24, 16], [21, 22], [18, 15], [21, 18], [14, 20], [22, 21], [20, 17], [24, 20], [22, 16], [21, 18], [20, 30], [16, 19], [19, 25], [20, 26], [19, 20], [20, 24]]
This is a part of a large program. I have a list like
cnfn=[(1, -3), (2, -3), (-1, -2, 3), (-1, 4), (-2, 4), (1, 2, -4), (-4, -5), (4, 5), (-3, 6), (-5, 6), (3, 5, -6), (7, -8), (6, -8), (-7, -6, 8), (-6, 9), (-7, 9), (6, 7, -9), (-9, -10), (9, 10), (-8, 11), (-10, 11), (8, 10, -11), (7, -12), (4, -12), (-7, -4, 12), (-12, 13), (-3, 13), (12, 3, -13), (14, -16), (15, -16), (-14, -15, 16), (-16, -17), (16, 17), (-14, 18), (-15, 18), (14, 15, -18), (17, -19), (18, -19), (-17, -18, 19), (13, -20), (19, -20), (-13, -19, 20), (-20, -21), (20, 21), (-19, 22), (-13, 22), (19, 13, -22), (21, -23), (22, -23), (-21, -22, 23), (13, -24), (18, -24), (-13, -18, 24), (-24, 25), (-16, 25), (24, 16, -25), (26, -28), (27, -28), (-26, -27, 28), (-28, -29), (28, 29), (-26, 30), (-27, 30), (26, 27, -30), (29, -31), (30, -31), (-29, -30, 31), (25, -32), (31, -32), (-25, -31, 32), (-32, -33), (32, 33), (-31, 34), (-25, 34), (31, 25, -34), (33, -35), (34, -35), (-33, -34, 35), (25, -36), (30, -36), (-25, -30, 36), (-36, 37), (-28, 37), (36, 28, -37), (38, -40), (39, -40), (-38, -39, 40), (-40, -41), (40, 41), (-38, 42), (-39, 42), (38, 39, -42), (41, -43), (42, -43), (-41, -42, 43), (37, -44), (43, -44), (-37, -43, 44), (-44, -45), (44, 45), (-43, 46), (-37, 46), (43, 37, -46), (45, -47), (46, -47), (-45, -46, 47), (37, -48), (42, -48), (-37, -42, 48), (-48, 49), (-40, 49), (48, 40, -49), (-50, -51), (50, 51), (-51, 53), (-52, 53), (51, 52, -53), (-52, -54), (52, 54), (-54, 55), (-50, 55), (54, 50, -55), (53, -56), (55, -56), (-53, -55, 56), (-56, -57), (56, 57), (58, -59), (57, -59), (-58, -57, 59), (52, -60), (50, -60), (-52, -50, 60), (-59, 61), (-60, 61), (59, 60, -61), (56, -62), (58, -62), (-56, -58, 62), (-58, -63), (58, 63), (57, -64), (63, -64), (-57, -63, 64), (-62, 65), (-64, 65), (62, 64, -65), (-66, -67), (66, 67), (-67, 69), (-68, 69), (67, 68, -69), (-68, -70), (68, 70), (-70, 71), (-66, 71), (70, 66, -71), (69, -72), (71, -72), (-69, -71, 72), (-72, -73), (72, 73), (61, -74), (73, -74), (-61, -73, 74), (68, -75), (66, -75), (-68, -66, 75), (-74, 76), (-75, 76), (74, 75, -76), (72, -77), (61, -77), (-72, -61, 77), (-61, -78), (61, 78), (73, -79), (78, -79), (-73, -78, 79), (-77, 80), (-79, 80), (77, 79, -80), (-81, -82), (81, 82), (-82, 84), (-83, 84), (82, 83, -84), (-83, -85), (83, 85), (-85, 86), (-81, 86), (85, 81, -86), (84, -87), (86, -87), (-84, -86, 87), (-87, -88), (87, 88), (76, -89), (88, -89), (-76, -88, 89), (83, -90), (81, -90), (-83, -81, 90), (-89, 91), (-90, 91), (89, 90, -91), (87, -92), (76, -92), (-87, -76, 92), (-76, -93), (76, 93), (88, -94), (93, -94), (-88, -93, 94), (-92, 95), (-94, 95), (92, 94, -95), (-96, -97), (96, 97), (-97, 99), (-98, 99), (97, 98, -99), (-98, -100), (98, 100), (-100, 101), (-96, 101), (100, 96, -101), (99, -102), (101, -102), (-99, -101, 102), (-102, -103), (102, 103), (91, -104), (103, -104), (-91, -103, 104), (-104, -105), (104, 105), (-104, 106), (-105, 106), (104, 105, -106), (102, -107), (91, -107), (-102, -91, 107), (-91, -108), (91, 108), (103, -109), (108, -109), (-103, -108, 109), (-107, 110), (-109, 110), (107, 109, -110), (-1, 50), (1, -50), (-2, 52), (2, -52), (-7, 58), (7, -58), (-14, 66), (14, -66), (-15, 68), (15, -68), (-26, 81), (26, -81), (-27, 83), (27, -83), (-38, 96), (38, -96), (-39, 98), (39, -98), (-11, -65, -111), (-11, 65, 111), (11, -65, 111), (11, 65, -111), (-23, -80, -112), (-23, 80, 112), (23, -80, 112), (23, 80, -112), (-35, -95, -113), (-35, 95, 113), (35, -95, 113), (35, 95, -113), (-47, -106, -114), (-47, 106, 114), (47, -106, 114), (47, 106, -114), (-49, -110, -115), (-49, 110, 115), (49, -110, 115), (49, 110, -115), (111, 112, 113, 114, 115)]
And there is another list
cnfb=[(1, -3), (2, -3), (-1, -2, 3), (-1, 4), (-2, 4), (1, 2, -4), (4, 5), (-4, -5), (-3, 6), (-5, 6), (3, 5, -6), (7, -8), (6, -8), (-7, -6, 8), (-6, 9), (-7, 9), (6, 7, -9), (9, 10), (-9, -10), (-8, 11), (-10, 11), (8, 10, -11), (7, -12), (4, -12), (-7, -4, 12), (-12, 13), (-3, 13), (12, 3, -13), (14, -16), (15, -16), (-14, -15, 16), (16, 17), (-16, -17), (-14, 18), (-15, 18), (14, 15, -18), (17, -19), (18, -19), (-17, -18, 19), (13, -20), (19, -20), (-13, -19, 20), (20, 21), (-20, -21), (-19, 22), (-13, 22), (19, 13, -22), (21, -23), (22, -23), (-21, -22, 23), (13, -24), (18, -24), (-13, -18, 24), (-24, 25), (-16, 25), (24, 16, -25), (26, -28), (27, -28), (-26, -27, 28), (28, 29), (-28, -29), (-26, 30), (-27, 30), (26, 27, -30), (29, -31), (30, -31), (-29, -30, 31), (25, -32), (31, -32), (-25, -31, 32), (32, 33), (-32, -33), (-31, 34), (-25, 34), (31, 25, -34), (33, -35), (34, -35), (-33, -34, 35), (25, -36), (30, -36), (-25, -30, 36), (-36, 37), (-28, 37), (36, 28, -37), (38, -40), (39, -40), (-38, -39, 40), (40, 41), (-40, -41), (-38, 42), (-39, 42), (38, 39, -42), (41, -43), (42, -43), (-41, -42, 43), (37, -44), (43, -44), (-37, -43, 44), (44, 45), (-44, -45), (-43, 46), (-37, 46), (43, 37, -46), (45, -47), (46, -47), (-45, -46, 47), (37, -48), (42, -48), (-37, -42, 48), (-48, 49), (-40, 49), (48, 40, -49), (50, 51), (-50, -51), (-51, 53), (-52, 53), (51, 52, -53), (52, 54), (-52, -54), (-54, 55), (-50, 55), (54, 50, -55), (53, -56), (55, -56), (-53, -55, 56), (56, 57), (-56, -57), (58, -59), (57, -59), (-58, -57, 59), (52, -60), (50, -60), (-52, -50, 60), (-59, 61), (-60, 61), (59, 60, -61), (56, -62), (58, -62), (-56, -58, 62), (58, 63), (-58, -63), (57, -64), (63, -64), (-57, -63, 64), (-62, 65), (-64, 65), (62, 64, -65), (66, 67), (-66, -67), (-67, 69), (-68, 69), (67, 68, -69), (68, 70), (-68, -70), (-70, 71), (-66, 71), (70, 66, -71), (69, -72), (71, -72), (-69, -71, 72), (72, 73), (-72, -73), (61, -74), (73, -74), (-61, -73, 74), (68, -75), (66, -75), (-68, -66, 75), (-74, 76), (-75, 76), (74, 75, -76), (72, -77), (61, -77), (-72, -61, 77), (61, 78), (-61, -78), (73, -79), (78, -79), (-73, -78, 79), (-77, 80), (-79, 80), (77, 79, -80), (81, 82), (-81, -82), (-82, 84), (-83, 84), (82, 83, -84), (83, 85), (-83, -85), (-85, 86), (-81, 86), (85, 81, -86), (84, -87), (86, -87), (-84, -86, 87), (87, 88), (-87, -88), (76, -89), (88, -89), (-76, -88, 89), (83, -90), (81, -90), (-83, -81, 90), (-89, 91), (-90, 91), (89, 90, -91), (87, -92), (76, -92), (-87, -76, 92), (76, 93), (-76, -93), (88, -94), (93, -94), (-88, -93, 94), (-92, 95), (-94, 95), (92, 94, -95), (96, 97), (-96, -97), (-97, 99), (-98, 99), (97, 98, -99), (98, 100), (-98, -100), (-100, 101), (-96, 101), (100, 96, -101), (99, -102), (101, -102), (-99, -101, 102), (102, 103), (-102, -103), (91, -104), (103, -104), (-91, -103, 104), (104, 105), (-104, -105), (-104, 106), (-105, 106), (104, 105, -106), (102, -107), (91, -107), (-102, -91, 107), (91, 108), (-91, -108), (103, -109), (108, -109), (-103, -108, 109), (-107, 110), (-109, 110), (107, 109, -110), (35, 95, -111), (-35, -95, -111), (-35, 95, 111), (35, -95, 111), (23, 80, -112), (-23, -80, -112), (-23, 80, 112), (23, -80, 112), (49, 106, -113), (-49, -106, -113), (-49, 106, 113), (49, -106, 113), (47, 110, -114), (-47, -110, -114), (-47, 110, 114), (47, -110, 114), (11, 65, -115), (-11, -65, -115), (-11, 65, 115), (11, -65, 115), [111, 112, 113, 114, 115], (-26, 83), (26, -83), (-2, 50), (2, -50), (-38, 98), (38, -98), (-27, 81), (27, -81), (-39, 96), (39, -96), (-7, 58), (7, -58), (-14, 68), (14, -68), (-15, 66), (15, -66), (-1, 52), (1, -52)]
If I check with plane eye the look like having same values but if I put them in the same function the result is different. How can I determine those two have exactly same type and same value?
The two lists are NOT the same. That is why a function may be giving you a different result for the different lists.
To check if 2 lists are identical, you can do:
list1 == list2
So to give some examples:
>>> [1, 2, 3, 4, 5] == [1, 2, 3, 4, 5]
True
>>> [1, 2, 3, 4, 5] == [1, 2, 3, 4, 3]
False
>>> [1, 2, 3, 4, 5] == [5, 4, 3, 2, 1]
False
>>> [(1, 2), (3, 4)] == [(1, 2), (3, 4)]
True
>>> [(1, 2), (3, 4)] == [(1, 2), (3, 5)]
False
If you want to find what the differences are, you can do the following:
[e for e in list1 if e not in list2] + [e for e in list2 if e not in list1]
which I think is actually very readable for what it is.
So we could put that inside a function:
def comp(list1, list2):
return [e for e in list1 if e not in list2] + [e for e in list2 if e not in list1]
and some examples:
>>> comp([1, 2, 3], [1, 2, 3]) #should be empty as no differnence
[]
>>> comp([(1, 2), (3, 4)], [(1, 2), (3, 5)])
[(3, 4), (3, 5)]
>>> comp([(1, 2), (3, 4)], [(1, 2), (3, 5), (6, 7)])
[(3, 4), (3, 5), (6, 7)]
With hash function:
balanceLoad = lambda x: bisect.bisect_left(boundary_array, -keyfunc(x))
Where boundary_array is [-64, -10, 35]
The folowing tells me which partition to assign each element to
rdd.partitionBy(numPartitions, balanceLoad)
However, is there a way to determine /control WHERE in each partition they are assigned / placed? {1,2,3} vs {3,2,1}.
For example when I do this:
rdd = CleanRDD(sc.parallelize(range(100), 4).map(lambda x: (x *((-1) ** x) , x)))
sortByKey(rdd, keyfunc=lambda key: key, ascending=False).collect()
Elements in each partition are in reverse order:
[(64, 64),
(66, 66),
(68, 68),
(70, 70),
(72, 72),
(74, 74),
(76, 76),
(78, 78),
(80, 80),
(82, 82),
(84, 84),
(86, 86),
(88, 88),
(90, 90),
(92, 92),
(94, 94),
(96, 96),
(98, 98),
(10, 10),
(12, 12),
(14, 14),
(16, 16),
(18, 18),
(20, 20),
(22, 22),
(24, 24),
(26, 26),
(28, 28),
(30, 30),
(32, 32),
(34, 34),
(36, 36),
(38, 38),
(40, 40),
(42, 42),
(44, 44),
(46, 46),
(48, 48),
(50, 50),
(52, 52),
(54, 54),
(56, 56),
(58, 58),
(60, 60),
(62, 62),
(-35, 35),
(-33, 33),
(-31, 31),
(-29, 29),
(-27, 27),
(-25, 25),
(-23, 23),
(-21, 21),
(-19, 19),
(-17, 17),
(-15, 15),
(-13, 13),
(-11, 11),
(-9, 9),
(-7, 7),
(-5, 5),
(-3, 3),
(-1, 1),
(0, 0),
(2, 2),
(4, 4),
(6, 6),
(8, 8),
(-99, 99),
(-97, 97),
(-95, 95),
(-93, 93),
(-91, 91),
(-89, 89),
(-87, 87),
(-85, 85),
(-83, 83),
(-81, 81),
(-79, 79),
(-77, 77),
(-75, 75),
(-73, 73),
(-71, 71),
(-69, 69),
(-67, 67),
(-65, 65),
(-63, 63),
(-61, 61),
(-59, 59),
(-57, 57),
(-55, 55),
(-53, 53),
(-51, 51),
(-49, 49),
(-47, 47),
(-45, 45),
(-43, 43),
(-41, 41),
(-39, 39),
(-37, 37)]
Notice that elements in each of the three groups are in reverse order.
How can I correct this?
Determine no, because an order of the shuffle is nondeterministic.
You can control the order but not as a part of the partitioning process or at least not in PySpark. Instead you can take a similar approach like sortByKey and enforce the order per partition afterwards:
def applyOrdering(iter):
"""Takes an itertools.chain object
and returns iterable with specific ordering"""
...
rdd.partitionBy(numPartitions, balanceLoad).mapPartitions(applyOrdering)
Note that iter may be to large fit into memory so you should either increase granularity or use sorting mechanism which doesn't require reading all data at once.
I start with two numpy arrays, the "x values" and the "y values":
import numpy as np
x = np.arange(100)
y = np.arange(100)
The output is
[ 0 1 2 3 4 ..... 96 97 98 99]
[ 0 1 2 3 4 ..... 96 97 98 99]
I would like to append these values together into an array of len() = 100 such that the output is
[ (0,0) (1,1) (2,2) (3,3) .... (98,98) (99,99) ]
How does one use indexing to both (A) put the pairs in the correct order and (B) put the paratheses ( and comma , in the correct order?
For your particular requirement, you can use the built-in zip function, which combines multiple lists at their corresponding indexes (that is ith index of all lists that are parameter to it in combined in the returned iterator).
Example -
import numpy as np
x = np.arange(100)
y = np.arange(100)
print(list(zip(x,y)))
>>> [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15), (16, 16), (17, 17), (18, 18), (19, 19), (20, 20), (21, 21), (22, 22), (23, 23), (24, 24), (25, 25), (26, 26), (27, 27), (28, 28), (29, 29), (30, 30), (31, 31), (32, 32), (33, 33), (34, 34), (35, 35), (36, 36), (37, 37), (38, 38), (39, 39), (40, 40), (41, 41), (42, 42), (43, 43), (44, 44), (45, 45), (46, 46), (47, 47), (48, 48), (49, 49), (50, 50), (51, 51), (52, 52), (53, 53), (54, 54), (55, 55), (56, 56), (57, 57), (58, 58), (59, 59), (60, 60), (61, 61), (62, 62), (63, 63), (64, 64), (65, 65), (66, 66), (67, 67), (68, 68), (69, 69), (70, 70), (71, 71), (72, 72), (73, 73), (74, 74), (75, 75), (76, 76), (77, 77), (78, 78), (79, 79), (80, 80), (81, 81), (82, 82), (83, 83), (84, 84), (85, 85), (86, 86), (87, 87), (88, 88), (89, 89), (90, 90), (91, 91), (92, 92), (93, 93), (94, 94), (95, 95), (96, 96), (97, 97), (98, 98), (99, 99)]
For Python 2.x , please note you do not need list(zip(...)) , since zip itself would return a list , but for Python 3.x , zip returns an iterator, and to print it we would need to convert it into a list.
You can use np.dstack to get the columns :
>>> np.dstack((x,y))
array([[[ 0, 0],
[ 1, 1],
[ 2, 2],
[ 3, 3],
[ 4, 4],
[ 5, 5],
[ 6, 6],
[ 7, 7],
[ 8, 8],
[ 9, 9],
...
[99, 99]]])
And if you want to get tuple instead of list you can use map to convert it to tuple:
>>> map(tuple,np.dstack((x,y))[0])
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15), (16, 16), (17, 17), (18, 18), (19, 19), (20, 20), (21, 21), (22, 22), (23, 23), (24, 24), (25, 25), (26, 26), (27, 27), (28, 28), (29, 29), (30, 30), (31, 31), (32, 32), (33, 33), (34, 34), (35, 35), (36, 36), (37, 37), (38, 38), (39, 39), (40, 40), (41, 41), (42, 42), (43, 43), (44, 44), (45, 45), (46, 46), (47, 47), (48, 48), (49, 49), (50, 50), (51, 51), (52, 52), (53, 53), (54, 54), (55, 55), (56, 56), (57, 57), (58, 58), (59, 59), (60, 60), (61, 61), (62, 62), (63, 63), (64, 64), (65, 65), (66, 66), (67, 67), (68, 68), (69, 69), (70, 70), (71, 71), (72, 72), (73, 73), (74, 74), (75, 75), (76, 76), (77, 77), (78, 78), (79, 79), (80, 80), (81, 81), (82, 82), (83, 83), (84, 84), (85, 85), (86, 86), (87, 87), (88, 88), (89, 89), (90, 90), (91, 91), (92, 92), (93, 93), (94, 94), (95, 95), (96, 96), (97, 97), (98, 98), (99, 99)]
>>>
You could use vstack
In [36]: xy = np.vstack((x,y)).T
In [37]: xy.shape
Out[37]: (100, 2)
In [38]: xy[0]
Out[38]: array([0, 0])
In [39]: xy[1]
Out[39]: array([1, 1])