I need to match two very large Numpy arrays (one is 20000 rows, another about 100000 rows) and I am trying to build a script to do it efficiently. Simple looping over the arrays is incredibly slow, can someone suggest a better way? Here is what I am trying to do: array datesSecondDict and array pwfs2Dates contain datetime values, I need to take each datetime value from array pwfs2Dates (smaller array) and see if there is a datetime value like that (plus minus 5 minutes) in array datesSecondDict (there might be more than 1). If there is one (or more) I populate a new array (of the same size as array pwfs2Dates) with the value (one of the values) from array valsSecondDict (which is just the array with the corresponding numerical values to datesSecondDict). Here is a solution by #unutbu and #joaquin that worked for me (thanks guys!):
import time
import datetime as dt
import numpy as np
def combineArs(dict1, dict2):
"""Combine data from 2 dictionaries into a list.
dict1 contains primary data (e.g. seeing parameter).
The function compares each timestamp in dict1 to dict2
to see if there is a matching timestamp record(s)
in dict2 (plus/minus 5 minutes).
==If yes: a list called data gets appended with the
corresponding parameter value from dict2.
(Note that if there are more than 1 record matching,
the first occuring value gets appended to the list).
==If no: a list called data gets appended with 0."""
# Specify the keys to use
pwfs2Key = 'pwfs2:dc:seeing'
dimmKey = 'ws:seeFwhm'
# Create an iterator for primary dict
datesPrimDictIter = iter(dict1[pwfs2Key]['datetimes'])
# Take the first timestamp value in primary dict
nextDatePrimDict = next(datesPrimDictIter)
# Split the second dictionary into lists
datesSecondDict = dict2[dimmKey]['datetime']
valsSecondDict = dict2[dimmKey]['values']
# Define time window
fiveMins = dt.timedelta(minutes = 5)
data = []
#st = time.time()
for i, nextDateSecondDict in enumerate(datesSecondDict):
try:
while nextDatePrimDict < nextDateSecondDict - fiveMins:
# If there is no match: append zero and move on
data.append(0)
nextDatePrimDict = next(datesPrimDictIter)
while nextDatePrimDict < nextDateSecondDict + fiveMins:
# If there is a match: append the value of second dict
data.append(valsSecondDict[i])
nextDatePrimDict = next(datesPrimDictIter)
except StopIteration:
break
data = np.array(data)
#st = time.time() - st
return data
Thanks,
Aina.
Are the array dates sorted ?
If yes, you can speed up your comparisons by breaking from the inner
loop comparison once its dates are bigger than the date given by the
outer loop. In this way you will made a one-pass comparison instead of
looping dimVals items len(pwfs2Vals) times
If no, maybe you should transform the current pwfs2Dates array to, for example,
an array of pairs [(date, array_index),...] and then you can sort by
date all your arrays to make the one-pass comparison indicated above and at the
same time to be able to get the original indexes needed to set data[i]
for example if the arrays were already sorted (I use lists here, not sure you need arrays for that):
(Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):
pdates = iter(enumerate(pwfs2Dates))
i, datei = pdates.next()
for datej, valuej in zip(dimmDates, dimvals):
while datei < datej - fiveMinutes:
i, datei = pdates.next()
while datei < datej + fiveMinutes:
data[i] = valuej
i, datei = pdates.next()
Otherwise, if they were not ordered and you created the sorted, indexed lists like this:
pwfs2Dates = sorted([(date, idx) for idx, date in enumerate(pwfs2Dates)])
dimmDates = sorted([(date, idx) for idx, date in enumerate(dimmDates)])
the code would be:
(Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):
pdates = iter(pwfs2Dates)
datei, i = pdates.next()
for datej, j in dimmDates:
while datei < datej - fiveMinutes:
datei, i = pdates.next()
while datei < datej + fiveMinutes:
data[i] = dimVals[j]
datei, i = pdates.next()
great!
..
Note that dimVals:
dimVals = np.array(dict1[dimmKey]['values'])
is not used in your code and can be eliminated.
Note that your code gets greatly simplified by looping through the
array itself instead of using xrange
Edit: The answer from unutbu address some weak parts in the code above.
I indicate them here for completness:
Use of next: next(iterator) is prefered to iterator.next().
iterator.next() is an exception to a conventional naming rule that
has been fixed in py3k renaming this method as
iterator.__next__().
Check for the end of the iterator with a try/except. After all the
items in the iterator are finished the next call to next()
produces an StopIteration Exception. Use try/except to kindly
break out of the loop when that happens. For the specific case of the
OP question this is not an issue, because the two arrrays are the same
size so the for loop finishes at the same time than the iterator. So no
exception is risen. However, there could be cases were dict1 and dict2
are not the same size. And in this case there is the posibility of an
exception being risen.
Question is: what is better, to use try/except or to prepare the arrays
before looping by equalizing them to the shorter one.
Building on joaquin's idea:
import datetime as dt
import itertools
def combineArs(dict1, dict2, delta = dt.timedelta(minutes = 5)):
marks = dict1['datetime']
values = dict1['values']
pdates = iter(dict2['datetime'])
data = []
datei = next(pdates)
for datej, val in itertools.izip(marks, values):
try:
while datei < datej - delta:
data.append(0)
datei = next(pdates)
while datei < datej + delta:
data.append(val)
datei = next(pdates)
except StopIteration:
break
return data
dict1 = { 'ws:seeFwhm':
{'datetime': [dt.datetime(2011, 12, 19, 12, 0, 0),
dt.datetime(2011, 12, 19, 12, 1, 0),
dt.datetime(2011, 12, 19, 12, 20, 0),
dt.datetime(2011, 12, 19, 12, 22, 0),
dt.datetime(2011, 12, 19, 12, 40, 0), ],
'values': [1, 2, 3, 4, 5] } }
dict2 = { 'pwfs2:dc:seeing':
{'datetime': [dt.datetime(2011, 12, 19, 12, 9),
dt.datetime(2011, 12, 19, 12, 19),
dt.datetime(2011, 12, 19, 12, 29),
dt.datetime(2011, 12, 19, 12, 39),
], } }
if __name__ == '__main__':
dimmKey = 'ws:seeFwhm'
pwfs2Key = 'pwfs2:dc:seeing'
print(combineArs(dict1[dimmKey], dict2[pwfs2Key]))
yields
[0, 3, 0, 5]
I think you can do it with one fewer loops:
import datetime
import numpy
# Test data
# Create an array of dates spaced at 1 minute intervals
m = range(1, 21)
n = datetime.datetime.now()
a = numpy.array([n + datetime.timedelta(minutes=i) for i in m])
# A smaller array with three of those dates
m = [5, 10, 15]
b = numpy.array([n + datetime.timedelta(minutes=i) for i in m])
# End of test data
def date_range(date_array, single_date, delta):
plus = single_date + datetime.timedelta(minutes=delta)
minus = single_date - datetime.timedelta(minutes=delta)
return date_array[(date_array < plus) * (date_array > minus)]
dates = []
for i in b:
dates.append(date_range(a, i, 5))
all_matches = numpy.unique(numpy.array(dates).flatten())
There is surely a better way to gather and merge the matches, but you get the idea... You could also use numpy.argwhere((a < plus) * (a > minus)) to return the index instead of the date and use the index to grab the whole row and place it into your new array.
Related
I have a daypart column (str), which has 1s or 0s for each hour of the day, depending if we choose to run a campaign during that hour.
Example:
daypart = '110011100111111100011110'
I want to convert this to the following string format:
'0-1, 4-6, 9-15, 19-22'
The above format is more readable, and shows during which hours the campaign ran.
Here's what I'm doing:
hours_list = []
ind = 0
for x in daypart:
if int(x) == 1:
hours_list.append(ind)
else:
hours_list.append('exclude')
ind += 1
The above gives me a list like this:
[0, 1, 'exclude', 'exclude', 4, 5, 6, 'exclude', 'exclude', 9, 10, 11, 12, 13, 14, 15, 'exclude', 'exclude', 'exclude', 19, 20, 21, 22, 'exclude']
Now I want to find a way to make the above into my desired output. What I am thinking of doing is finding which elements exist between 'exclude', and start adding them to new lists. I can then take the smallest and largest element from each list, join them with a '-', and append all such lists together.
Any ideas how I can do this, or a simpler way to do all of this?
Here's simple, readable code to get all intervals:
daypart = '1111111111111111111111'
hours= []
start, end = -1, -1
for i in range(len(daypart)):
if daypart[i] == "1":
if end != -1:
end += 1
else:
start = i
end = i
else:
if end!=-1:
hours.append([start, end])
start, end = -1,-1
if end!=-1:
hours.append([start, end])
start, end = -1,-1
print(hours)
I suggest that you convert directly to your desired format rather than using an intermediate representation that has the exact same information as the original input. Let's think about how we can do this in words:
Look for the first 1 in the input string
Add the index to a list
Look for the next 0 in the string.
Append one less than found index to a list. (Or maybe append the index from steps 2 and 4 as a pair?)
Continue by looking for the next 1 and repeat steps 2-4.
I leave translating this into code as an exercise for the reader.
This can be done using itertools.groupby, operator.itemgetter, enumerate in a comprehension to achieve this as well:
from itertools import groupby
from operator import itemgetter
daypart = '110011100111111100011110'
get_ends, get_one = itemgetter(0,-1), itemgetter(1)
output = ', '.join('{0[0]}-{1[0]}'.format(*get_ends(list(g))) for k,g in groupby(enumerate(daypart), get_one) if k=='1')
print(output)
0-1, 4-6, 9-15, 19-22
get_ends gets the first and last elements in each group and get_one just gets element 1 so to use it as a key.
I'm having trouble trying to create a list of values from a list of tuples, which link to where the second value is the same as the first value in another tuple, that starts and ends with certain values.
For example:
start = 11
end = 0
list_tups = [(0,1),(0, 2),(0, 3),(261, 0),(8, 15),(118, 32),(11, 8),(15, 118),(32, 261)]
So I want to iterate through those list of tups, starting with the one which is the same as the start value and searching through the tups where it'll end with the end value.
So my desired output would be:
[11, 8, 15, 118, 32, 261, 0]
I understand how to check the values i'm just having trouble with interating through the tuples every time to check if there is a tuple in the list that matches the second value.
You are describing pathfinding in a directed graph.
>>> import networkx as nx
>>> g = nx.DiGraph(list_tups)
>>> nx.shortest_path(g, start, end)
[11, 8, 15, 118, 32, 261, 0]
This doesn't work with end = 0 because there is no 0 at the end, but here it is with 32:
>>> start = 11
>>> end = 32
>>> flattened = [i for t in list_tups for i in t]
>>> flattened[flattened.index(start):flattened.index(end, flattened.index(start))+1]
[11, 8, 15, 118, 32]
You can recursively search the tuples, moving the start value closer and closer. The path will be accumulated as we move back up through the chain You may need to tweak the path a little to get your desired outcome (I believe you'll need to append the first starting value, then reverse it).
def find(start, end, tuples, path):
for t in tuples:
if t[0] == start:
if t[1] == end or find(t[1], end, tuples):
path.append(t[1])
return True
return False
I have two Nested NumPy arrays (dateValArr & searchDates). dateValArr contains all dates for May 2011 (1st - 31st) and an associated value each date. searchDates contains 2 dates and an associated value as well (2 dates correspond to a date range).
Using date ranges specified in searchDates Array, I want to find dates in dateValArr array. Next for those selected dates in dateValArr, I want to find the closest value to the specified value of searchDates.
I have come up with is code but for the first part it it only works if only one value is specified.
#setup arrays ---------------------------------------------------------------------------
# Generate dates
st_date = '2011-05-01'
ed_date = '2011-05-31'
dates = pd.date_range(st_date,ed_date).to_numpy(dtype = object)
# Generate Values
val_arr = np.random.uniform(1,12,31)
dateValLs = []
for i,j in zip(dates,val_arr):
dateValLs.append((i,j))
dateValArr = np.asarray(dateValLs)
print(dateValArr)
#out:
[[Timestamp('2011-05-01 00:00:00', freq='D') 7.667399233149668]
[Timestamp('2011-05-02 00:00:00', freq='D') 5.906099813052642]
[Timestamp('2011-05-03 00:00:00', freq='D') 3.254485533826182]
...]
#Generate search dates
searchDates = np.array([(datetime(2011,5,11),datetime(2011,5,20),9),(datetime(2011,5,25),datetime(2011,5,29),2)])
print(searchDates)
#out:
[[datetime.datetime(2011, 5, 11, 0, 0) datetime.datetime(2011, 5, 20, 0, 0) 9]
[datetime.datetime(2011, 5, 25, 0, 0) datetime.datetime(2011, 5, 29, 0, 0) 2]]
#end setup ------------------------------------------------------------------------------
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0][0], dateValArr[:,0] < search_dates[0][1]))
print(x)
out: (array([11, 12, 13, 14, 15, 16, 17, 18], dtype=int64),)
However, the code works only if I select the first element searchDates (searchDates[0][0]). It will not run for all values in searcDates. What I mean if I replace by the following code.
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0], dateValArr[:,0] < search_dates[0]))
Then, I will get the following error: operands could not be broadcast together with shapes (31,) (3,)
To find the closest value I hoping to somehow combine the following line of the code,
n = (np.abs(dateValArr[:,1]-searchDates[:,2])).argmin()
Any ideas on how to solve it.Thanks in advance
Only thing came into my mind is a for loop.
Here is the link for my work
result = np.array([])
for search_term in searchDates:
mask = (dateValArr[:,0] > search_term[0]) & (dateValArr[:,0] < search_term[1])
date_search_result = dateValArr[mask, :]
d = np.abs(date_search_result[:,1] - searchDates[0,2])
result = np.hstack([result, date_search_result[d.argmin()]])
print(result)
I kinda figured out it as well,
date_value = []
for i in search_dates:
dateidx_arr = np.where(np.logical_and(dateValArr[:,0] >= i[0],dateValArr[:,0] <= i[1] )) #Get index of specified date ranges
date_arr = dateValArr[dateidx_arr] #Based on the index get the dates and values
value_arr = (np.abs(date_arr[:,1]-i[2])).argmin() #for those dates calculate the closest value index
date_value.append(date_arr[value_arr]) #Use the index to get the closest date and value
I'm dealing with polygonal data in realtime here, but the problems quite simple.
I have a huge list containing thousands of sets of polygon Indecies (Integers) and I need to simplify the list as "fast" as possible into a list of sets of "connected" Indecies.
i.e. Any sets containing integers that are also in another set become one set in the result. I've read several possible solutions involving sets & graphs etc. All i'm after are a final list of sets which had any degree of commonality.
I'm dealing with lots of data here, but for simplicities sake here's some sample data:
setA = set([0,1,2])
setB = set([6,7,8,9])
setC = set([4,5,6])
setD = set([3,4,5,0])
setE = set([10,11,12])
setF = set([11,13,14,15])
setG = set([16,17,18,19])
listOfSets = [setA,setB,setC,setD,setE,setF,setG]
In this case I'm after a list with a result like this, although ordering is irrelevant:
connectedFacesListOfSets = [ set([0,1,2,3,4,5,6,7,8,9]), set([10,11,12,13,14,15]), set([16,17,18,19])]
I've looked for similar solutions, but the one with the highest votes gave incorrect results on my large test data.
Merge lists that share common elements
It's hard to tell the performance without a sufficiently large set, but here is some basic code to start from:
while True:
merged_one = False
supersets = [listOfSets[0]]
for s in listOfSets[1:]:
in_super_set = False
for ss in supersets:
if s & ss:
ss |= s
merged_one = True
in_super_set = True
break
if not in_super_set:
supersets.append(s)
print supersets
if not merged_one:
break
listOfSets = supersets
This works in 3 iterations on the provided data. And the output is as follows:
[set([0, 1, 2, 3, 4, 5]), set([4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
[set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
[set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
This is a union find problem.
Though I haven't used it, this Python code looks good to me.
http://code.activestate.com/recipes/577225-union-find/
Forgive the messed up caps (autocorrect...):
# the results cotainer
Connected = set()
sets = # some list of sets
# convert the sets to frozensets (which are hashable and can be added to sets themselves)
Sets = map(frozenset, sets)
for s1 in sets:
Res = copy.copy(s1)
For s2 in sets:
If s1 & s2:
Res = res | s2
Connected.add(res)
So.. I think I got it. It's a mess but I got it. Here's what I did:
def connected_valid(li):
for i, l in enumerate(li):
for j, k in enumerate(li):
if i != j and contains(l,k):
return False
return True
def contains(set1, set2):
for s in set1:
if s in set2:
return True
return False
def combine(set1, set2):
set2 |= set1
return set2
def connect_sets(li):
while not connected_valid(li):
s1 = li.pop(0)
s2 = li[0]
if contains(s1, s2):
li[0] = combine(s1,s2)
else:
li.append(s1)
return li
Then in the main function you'd do something like this:
setA = set([0,1,2])
setB = set([6,7,8,9])
setC = set([4,5,6])
setD = set([3,4,5,0])
setE = set([10,11,12])
setF = set([11,13,14,15])
setG = set([16,17,18,19])
connected_sets = connect_sets([setA,setB,setC,setD,setE,setF,setG,])
After running it, I got the following output
print connected_sets
[set([0,1,2,3,4,5,6,7,8,9]), set([10,11,12,13,14,15]), set([16,17,18,19])]
Hope that's what you're looking for.
EDIT: Added code to randomly generate sets:
# Creates a list of 4000 sets with a random number of values ranging from 0 to 20000
sets = []
ma = 0
mi = 21000
for x in range(4000):
rand_num = sample(range(20),1)[0]
tmp_set_li = sample(range(20000), rand_num)
sets.append(set(tmp_set_li))
The last 3 lines can be condensed into one if you really wanted to.
I tried to do something different: this algorithm loops once for each set and once for each element:
# Our test sets
setA = set([0,1,2])
setB = set([6,7,8,9])
setC = set([4,5,6])
setD = set([3,4,5,0])
setE = set([10,11,12])
setF = set([11,13,14,15])
setG = set([16,17,18,19])
list_of_sets = [setA,setB,setC,setD,setE,setF,setG]
# We will use a map to store our new merged sets.
# This map will work as an reference abstraction, so it will
# map set ids to the set or to other set id.
# This map may have an indirection level greater than 1
merged_sets = {}
# We will also use a map between indexes and set ids.
index_to_id = {}
# Given a set id, returns an equivalent set id that refers directly
# to a set in the merged_sets map
def resolve_id(id):
if not isinstance(id, (int, long)):
return None
while isinstance(merged_sets[id], (int, long)):
id = merged_sets[id]
return id
# Points the informed set to the destination id
def link_id(id_source, id_destination):
point_to = merged_sets[id_source]
merged_sets[id_source] = id_destination
if isinstance(point_to, (int, long)):
link_id(point_to, id_destination)
empty_set_found = False
# For each set
for current_set_id, current_set in enumerate(list_of_sets):
if len(current_set) == 0 and empty_set_found:
continue
if len(current_set) == 0:
empty_set_found = True
# Create a set id for the set and place it on the merged sets map
merged_sets[current_set_id] = current_set
# For each index in the current set
possibly_merged_current_set = current_set
for index in current_set:
# See if the index is free, i.e., has not been assigned to any set id
if index not in index_to_id:
# If it is free, then assign the set id to the index
index_to_id[index] = current_set_id
# ... and then go to the next index
else:
# If it is not free, then we may need to merge the sets
# Find out to which set we need to merge the current one,
# ... dereferencing if necessary
id_to_merge = resolve_id(index_to_id[index])
# First we check to see if the assignment is to the current set or not
if id_to_merge == resolve_id(merged_sets[current_set_id]):
continue
# Merge the current set to the one found
print 'Merging %d with %d' % (current_set_id, id_to_merge)
merged_sets[id_to_merge] |= possibly_merged_current_set
possibly_merged_current_set = merged_sets[id_to_merge]
# Map the current set id to the set id of the merged set
link_id(current_set_id, id_to_merge)
# Return all the sets in the merged sets map (ignore the references)
print [x for x in merged_sets.itervalues() if not isinstance(x, (int, long))]
It prints:
Merging 2 with 1
Merging 3 with 0
Merging 3 with 1
Merging 5 with 4
[set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
I have a list of tuples:
[(3,4), (18,27), (4,14)]
and need a code merging tuples which has repeated numbers, making another list where all list elements will only contain unique numbers. The list should be sorted by the length of the tuples, i.e.:
>>> MergeThat([(3,4), (18,27), (4,14)])
[(3,4,14), (18,27)]
>>> MergeThat([(1,3), (15,21), (1,10), (57,66), (76,85), (66,76)])
[(57,66,76,85), (1,3,10), (15,21)]
I understand it's something similar to hierarchical clustering algorithms, which I've read about, but can't figure them out.
Is there a relatively simple code for a MergeThat() function?
I tried hard to figure this out, but only after I tried the approach Ian's answer (thanks!) suggested I realized what the theoretical problem is: The input is a list of edges and defines a graph. We are looking for the strongly connected components of this graph. It's simple as that.
While you can do this efficiently, there is actually no reason to implement this yourself! Just import a good graph library:
import networkx as nx
# one of your examples
g1 = nx.Graph([(1,3), (15,21), (1,10), (57,66), (76,85), (66,76)])
print nx.connected_components(g1) # [[57, 66, 76, 85], [1, 10, 3], [21, 15]]
# my own test case
g2 = nx.Graph([(1,2),(2,10), (20,3), (3,4), (4,10)])
print nx.connected_components(g2) # [[1, 2, 3, 4, 10, 20]]
import itertools
def merge_it(lot):
merged = [ set(x) for x in lot ] # operate on sets only
finished = False
while not finished:
finished = True
for a, b in itertools.combinations(merged, 2):
if a & b:
# we merged in this iteration, we may have to do one more
finished = False
if a in merged: merged.remove(a)
if b in merged: merged.remove(b)
merged.append(a.union(b))
break # don't inflate 'merged' with intermediate results
return merged
if __name__ == '__main__':
print merge_it( [(3,4), (18,27), (4,14)] )
# => [set([18, 27]), set([3, 4, 14])]
print merge_it( [(1,3), (15,21), (1,10), (57,66), (76,85), (66,76)] )
# => [set([21, 15]), set([1, 10, 3]), set([57, 66, 76, 85])]
print merge_it( [(1,2), (2,3), (3,4), (4,5), (5,9)] )
# => [set([1, 2, 3, 4, 5, 9])]
Here's a snippet (including doctests): http://gist.github.com/586252
def collapse(L):
""" The input L is a list that contains tuples of various sizes.
If any tuples have shared elements,
exactly one instance of the shared and unshared elements are merged into the first tuple with a shared element.
This function returns a new list that contain merged tuples and an int that represents how many merges were performed."""
answer = []
merges = 0
seen = [] # a list of all the numbers that we've seen so far
for t in L:
tAdded = False
for num in t:
pleaseMerge = True
if num in seen and pleaseMerge:
answer += merge(t, answer)
merges += 1
pleaseMerge = False
tAdded= True
else:
seen.append(num)
if not tAdded:
answer.append(t)
return (answer, merges)
def merge(t, L):
""" The input L is a list that contains tuples of various sizes.
The input t is a tuple that contains an element that is contained in another tuple in L.
Return a new list that is similar to L but contains the new elements in t added to the tuple with which t has a common element."""
answer = []
while L:
tup = L[0]
tupAdded = False
for i in tup:
if i in t:
try:
L.remove(tup)
newTup = set(tup)
for i in t:
newTup.add(i)
answer.append(tuple(newTup))
tupAdded = True
except ValueError:
pass
if not tupAdded:
L.remove(tup)
answer.append(tup)
return answer
def sortByLength(L):
""" L is a list of n-tuples, where n>0.
This function will return a list with the same contents as L
except that the tuples are sorted in non-ascending order by length"""
lengths = {}
for t in L:
if len(t) in lengths.keys():
lengths[len(t)].append(t)
else:
lengths[len(t)] = [(t)]
l = lengths.keys()[:]
l.sort(reverse=True)
answer = []
for i in l:
answer += lengths[i]
return answer
def MergeThat(L):
answer, merges = collapse(L)
while merges:
answer, merges = collapse(answer)
return sortByLength(answer)
if __name__ == "__main__":
print 'starting'
print MergeThat([(3,4), (18,27), (4,14)])
# [(3, 4, 14), (18, 27)]
print MergeThat([(1,3), (15,21), (1,10), (57,66), (76,85), (66,76)])
# [(57, 66, 76, 85), (1, 10, 3), (15, 21)]
Here's another solution that doesn't use itertools and takes a different, slightly more verbose, approach. The tricky bit of this solution is the merging of cluster sets when t0 in index and t1 in index.
import doctest
def MergeThat(a):
""" http://stackoverflow.com/questions/3744048/python-how-to-merge-a-list-into-clusters
>>> MergeThat([(3,4), (18,27), (4,14)])
[(3, 4, 14), (18, 27)]
>>> MergeThat([(1,3), (15,21), (1,10), (57,66), (76,85), (66,76)])
[(57, 66, 76, 85), (1, 3, 10), (15, 21)]
"""
index = {}
for t0, t1 in a:
if t0 not in index and t1 not in index:
index[t0] = set()
index[t1] = index[t0]
elif t0 in index and t1 in index:
index[t0] |= index[t1]
oldt1 = index[t1]
for x in index.keys():
if index[x] is oldt1:
index[x] = index[t0]
elif t0 not in index:
index[t0] = index[t1]
else:
index[t1] = index[t0]
assert index[t0] is index[t1]
index[t0].add(t0)
index[t0].add(t1)
return sorted([tuple(sorted(x)) for x in set(map(frozenset, index.values()))], key=len, reverse=True)
if __name__ == "__main__":
import doctest
doctest.testmod()
The code others have written will surely work, but here's another option, maybe simpler to understand and maybe less algorithmic complexity.
Keep a dictionary from numbers to the cluster (implemented as a python set) they're a member of. Also include that number in the corresponding set. Process an input pair either as:
Neither element is in the dictionary: create a new set, hook up dictionary links appropriately.
One or the other, but not both elements are in the dictionary: Add the yet-unseen element to the set of its brother, and add its dictionary link into the correct set.
Both elements are seen before, but in different sets: Take the union of the old sets and update all dictionary links to the new set.
You've seen both members before, and they're in the same set: Do nothing.
Afterward, simply collect the unique values from the dictionary and sort in descending order of size. This portion of the job is O(m log n) and thus will not dominate runtime.
This should work in a single pass. Writing the actual code is left as an exercise for the reader.
This is not efficient for huge lists.
def merge_that(lot):
final_list = []
while len(lot) >0 :
temp_set = set(lot[0])
deletable = [0] #list of all tuples consumed by temp_set
for i, tup2 in enumerate(lot[1:]):
if tup2[0] in temp_set or tup2[1] in temp_set:
deletable.append(i)
temp_set = temp_set.union(tup2)
for d in deletable:
del lot[d]
deletable = []
# Some of the tuples consumed later might have missed their brothers
# So, looping again after deleting the consumed tuples
for i, tup2 in enumerate(lot):
if tup2[0] in temp_set or tup2[1] in temp_set:
deletable.append(i)
temp_set = temp_set.union(tup2)
for d in deletable:
del lot[d]
final_list.append(tuple(temp_set))
return final_list
It looks ugly but works.