Python: simple list merging based on intersections











up vote
37
down vote

favorite
16












Consider there are some lists of integers as:



#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n
#--------------------------------------


The question is to merge lists having at least one common element. So the results only for the given part will be as follows:



#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------


What is the most efficient way to do this on large data (elements are just numbers)?
Is tree structure something to think about?
I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1:
Here is the code I was using:



#--------------------------------------
lsts = [[0,1,3],
[1,0,3,4,5,10,11],
[2,8],
[3,1,0,16]];
#--------------------------------------


The function is (buggy!!):



#--------------------------------------
def merge(lsts):
sts = [set(l) for l in lsts]
i = 0
while i < len(sts):
j = i+1
while j < len(sts):
if len(sts[i].intersection(sts[j])) > 0:
sts[i] = sts[i].union(sts[j])
sts.pop(j)
else: j += 1 #---corrected
i += 1
lst = [list(s) for s in sts]
return lst
#--------------------------------------


The result is:



#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------


Update 2:
To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting).
The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!



Update 3:
Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.



Update 4:

There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.










share|improve this question
























  • Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?
    – Ferdinand Beyer
    Feb 2 '12 at 10:42










  • If you show us the code you have now, we may be able to help you find your bug.
    – cha0site
    Feb 2 '12 at 10:44










  • "large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.
    – Rik Poggi
    Feb 2 '12 at 11:03










  • @FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.
    – Developer
    Feb 2 '12 at 12:15










  • @RikPoggi almost both: many lists which each of them could be long.
    – Developer
    Feb 2 '12 at 12:16















up vote
37
down vote

favorite
16












Consider there are some lists of integers as:



#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n
#--------------------------------------


The question is to merge lists having at least one common element. So the results only for the given part will be as follows:



#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------


What is the most efficient way to do this on large data (elements are just numbers)?
Is tree structure something to think about?
I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1:
Here is the code I was using:



#--------------------------------------
lsts = [[0,1,3],
[1,0,3,4,5,10,11],
[2,8],
[3,1,0,16]];
#--------------------------------------


The function is (buggy!!):



#--------------------------------------
def merge(lsts):
sts = [set(l) for l in lsts]
i = 0
while i < len(sts):
j = i+1
while j < len(sts):
if len(sts[i].intersection(sts[j])) > 0:
sts[i] = sts[i].union(sts[j])
sts.pop(j)
else: j += 1 #---corrected
i += 1
lst = [list(s) for s in sts]
return lst
#--------------------------------------


The result is:



#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------


Update 2:
To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting).
The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!



Update 3:
Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.



Update 4:

There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.










share|improve this question
























  • Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?
    – Ferdinand Beyer
    Feb 2 '12 at 10:42










  • If you show us the code you have now, we may be able to help you find your bug.
    – cha0site
    Feb 2 '12 at 10:44










  • "large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.
    – Rik Poggi
    Feb 2 '12 at 11:03










  • @FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.
    – Developer
    Feb 2 '12 at 12:15










  • @RikPoggi almost both: many lists which each of them could be long.
    – Developer
    Feb 2 '12 at 12:16













up vote
37
down vote

favorite
16









up vote
37
down vote

favorite
16






16





Consider there are some lists of integers as:



#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n
#--------------------------------------


The question is to merge lists having at least one common element. So the results only for the given part will be as follows:



#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------


What is the most efficient way to do this on large data (elements are just numbers)?
Is tree structure something to think about?
I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1:
Here is the code I was using:



#--------------------------------------
lsts = [[0,1,3],
[1,0,3,4,5,10,11],
[2,8],
[3,1,0,16]];
#--------------------------------------


The function is (buggy!!):



#--------------------------------------
def merge(lsts):
sts = [set(l) for l in lsts]
i = 0
while i < len(sts):
j = i+1
while j < len(sts):
if len(sts[i].intersection(sts[j])) > 0:
sts[i] = sts[i].union(sts[j])
sts.pop(j)
else: j += 1 #---corrected
i += 1
lst = [list(s) for s in sts]
return lst
#--------------------------------------


The result is:



#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------


Update 2:
To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting).
The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!



Update 3:
Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.



Update 4:

There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.










share|improve this question















Consider there are some lists of integers as:



#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n
#--------------------------------------


The question is to merge lists having at least one common element. So the results only for the given part will be as follows:



#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------


What is the most efficient way to do this on large data (elements are just numbers)?
Is tree structure something to think about?
I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1:
Here is the code I was using:



#--------------------------------------
lsts = [[0,1,3],
[1,0,3,4,5,10,11],
[2,8],
[3,1,0,16]];
#--------------------------------------


The function is (buggy!!):



#--------------------------------------
def merge(lsts):
sts = [set(l) for l in lsts]
i = 0
while i < len(sts):
j = i+1
while j < len(sts):
if len(sts[i].intersection(sts[j])) > 0:
sts[i] = sts[i].union(sts[j])
sts.pop(j)
else: j += 1 #---corrected
i += 1
lst = [list(s) for s in sts]
return lst
#--------------------------------------


The result is:



#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------


Update 2:
To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting).
The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!



Update 3:
Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.



Update 4:

There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.







python merge tree set-intersection equivalence-classes






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 22 '12 at 5:48

























asked Feb 2 '12 at 10:36









Developer

4,36353146




4,36353146












  • Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?
    – Ferdinand Beyer
    Feb 2 '12 at 10:42










  • If you show us the code you have now, we may be able to help you find your bug.
    – cha0site
    Feb 2 '12 at 10:44










  • "large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.
    – Rik Poggi
    Feb 2 '12 at 11:03










  • @FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.
    – Developer
    Feb 2 '12 at 12:15










  • @RikPoggi almost both: many lists which each of them could be long.
    – Developer
    Feb 2 '12 at 12:16


















  • Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?
    – Ferdinand Beyer
    Feb 2 '12 at 10:42










  • If you show us the code you have now, we may be able to help you find your bug.
    – cha0site
    Feb 2 '12 at 10:44










  • "large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.
    – Rik Poggi
    Feb 2 '12 at 11:03










  • @FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.
    – Developer
    Feb 2 '12 at 12:15










  • @RikPoggi almost both: many lists which each of them could be long.
    – Developer
    Feb 2 '12 at 12:16
















Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?
– Ferdinand Beyer
Feb 2 '12 at 10:42




Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?
– Ferdinand Beyer
Feb 2 '12 at 10:42












If you show us the code you have now, we may be able to help you find your bug.
– cha0site
Feb 2 '12 at 10:44




If you show us the code you have now, we may be able to help you find your bug.
– cha0site
Feb 2 '12 at 10:44












"large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.
– Rik Poggi
Feb 2 '12 at 11:03




"large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.
– Rik Poggi
Feb 2 '12 at 11:03












@FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.
– Developer
Feb 2 '12 at 12:15




@FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.
– Developer
Feb 2 '12 at 12:15












@RikPoggi almost both: many lists which each of them could be long.
– Developer
Feb 2 '12 at 12:16




@RikPoggi almost both: many lists which each of them could be long.
– Developer
Feb 2 '12 at 12:16












15 Answers
15






active

oldest

votes

















up vote
21
down vote



accepted










My attempt:



def merge(lsts):
sets = [set(lst) for lst in lsts if lst]
merged = True
while merged:
merged = False
results =
while sets:
common, rest = sets[0], sets[1:]
sets =
for x in rest:
if x.isdisjoint(common):
sets.append(x)
else:
merged = True
common |= x
results.append(common)
sets = results
return sets

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
[6, 97, 32, 93, 55, 14, 70, 32],
[75, 37, 83, 34, 9, 19, 14, 64],
[43, 71],
,
[89, 49, 1, 30, 28, 3, 63],
[35, 21, 68, 94, 57, 94, 9, 3],
[16],
[29, 9, 97, 43],
[17, 63, 24]]
print merge(lst)


Benchmark:



import random

# adapt parameters to your own usage scenario
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5

if False: # change to true to generate the test data file (takes a while)
with open("/tmp/test.txt", "w") as f:
lists =
classes = [
range(class_size * i, class_size * (i + 1)) for i in range(class_count)
]
for c in classes:
# distribute each class across ~300 lists
for i in xrange(list_count_per_class):
lst =
if random.random() < large_list_probability:
size = random.choice(large_list_sizes)
else:
size = random.choice(small_list_sizes)
nums = set(c)
for j in xrange(size):
x = random.choice(list(nums))
lst.append(x)
nums.remove(x)
random.shuffle(lst)
lists.append(lst)
random.shuffle(lists)
for lst in lists:
f.write(" ".join(str(x) for x in lst) + "n")

setup = """
# Niklas'
def merge_niklas(lsts):
sets = [set(lst) for lst in lsts if lst]
merged = 1
while merged:
merged = 0
results =
while sets:
common, rest = sets[0], sets[1:]
sets =
for x in rest:
if x.isdisjoint(common):
sets.append(x)
else:
merged = 1
common |= x
results.append(common)
sets = results
return sets

# Rik's
def merge_rik(data):
sets = (set(e) for e in data if e)
results = [next(sets)]
for e_set in sets:
to_update =
for i, res in enumerate(results):
if not e_set.isdisjoint(res):
to_update.insert(0, i)

if not to_update:
results.append(e_set)
else:
last = results[to_update.pop(-1)]
for i in to_update:
last |= results[i]
del results[i]
last |= e_set
return results

# katrielalex's
def pairs(lst):
i = iter(lst)
first = prev = item = i.next()
for item in i:
yield prev, item
prev = item
yield item, first

import networkx

def merge_katrielalex(lsts):
g = networkx.Graph()
for lst in lsts:
for edge in pairs(lst):
g.add_edge(*edge)
return networkx.connected_components(g)

# agf's (optimized)
from collections import deque

def merge_agf_optimized(lists):
sets = deque(set(lst) for lst in lists if lst)
results =
disjoint = 0
current = sets.pop()
while True:
merged = False
newsets = deque()
for _ in xrange(disjoint, len(sets)):
this = sets.pop()
if not current.isdisjoint(this):
current.update(this)
merged = True
disjoint = 0
else:
newsets.append(this)
disjoint += 1
if sets:
newsets.extendleft(sets)
if not merged:
results.append(current)
try:
current = newsets.pop()
except IndexError:
break
disjoint = 0
sets = newsets
return results

# agf's (simple)
def merge_agf_simple(lists):
newsets, sets = [set(lst) for lst in lists if lst],
while len(sets) != len(newsets):
sets, newsets = newsets,
for aset in sets:
for eachset in newsets:
if not aset.isdisjoint(eachset):
eachset.update(aset)
break
else:
newsets.append(aset)
return newsets

# alexis'
def merge_alexis(data):
bins = range(len(data)) # Initialize each bin[n] == n
nums = dict()

data = [set(m) for m in data] # Convert to sets
for r, row in enumerate(data):
for num in row:
if num not in nums:
# New number: tag it with a pointer to this row's bin
nums[num] = r
continue
else:
dest = locatebin(bins, nums[num])
if dest == r:
continue # already in the same bin

if dest > r:
dest, r = r, dest # always merge into the smallest bin

data[dest].update(data[r])
data[r] = None
# Update our indices to reflect the move
bins[r] = dest
r = dest

# Filter out the empty bins
have = [m for m in data if m]
return have

def locatebin(bins, n):
while bins[n] != n:
n = bins[n]
return n

lsts =
size = 0
num = 0
max = 0
for line in open("/tmp/test.txt", "r"):
lst = [int(x) for x in line.split()]
size += len(lst)
if len(lst) > max:
max = len(lst)
num += 1
lsts.append(lst)
"""

setup += """
print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
""".format(class_count=class_count)

import timeit
print "niklas"
print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
print "rik"
print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
print "katrielalex"
print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
print "agf (1)"
print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
print "agf (2)"
print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
print "alexis"
print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)


These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.



Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:



=====================
# many disjoint classes, large lists
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
=====================

niklas
5000 lists, 50 equally distributed classes, average size 298, max size 999
4.80084705353
rik
5000 lists, 50 equally distributed classes, average size 298, max size 999
9.49251699448
katrielalex
5000 lists, 50 equally distributed classes, average size 298, max size 999
21.5317108631
agf (1)
5000 lists, 50 equally distributed classes, average size 298, max size 999
8.61671280861
agf (2)
5000 lists, 50 equally distributed classes, average size 298, max size 999
5.18117713928
=> alexis
=> 5000 lists, 50 equally distributed classes, average size 298, max size 999
=> 3.73504281044

===================
# less number of classes, large lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
===================

niklas
4500 lists, 15 equally distributed classes, average size 296, max size 999
1.79993700981
rik
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.58237695694
katrielalex
4500 lists, 15 equally distributed classes, average size 296, max size 999
19.5465381145
agf (1)
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.75445604324
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 296, max size 999
=> 1.77850699425
alexis
4500 lists, 15 equally distributed classes, average size 296, max size 999
3.23530197144

===================
# less number of classes, smaller lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.1
===================

niklas
4500 lists, 15 equally distributed classes, average size 95, max size 997
0.773697137833
rik
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.0523750782
katrielalex
4500 lists, 15 equally distributed classes, average size 95, max size 997
6.04466891289
agf (1)
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.20285701752
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 95, max size 997
=> 0.714507102966
alexis
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.1286110878





share|improve this answer



















  • 1




    You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
    – Janne Karila
    Feb 2 '12 at 12:57










  • Janne Karila: Thanks for the information! I changed this.
    – Niklas B.
    Feb 2 '12 at 12:59












  • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
    – Developer
    Feb 2 '12 at 15:45












  • @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
    – Niklas B.
    Feb 2 '12 at 15:59








  • 2




    This is such a fun problem, anyway :-) Thanks for providing the test setup.
    – alexis
    Feb 26 '12 at 14:30


















up vote
13
down vote



+50










I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.



I tried to test and time every solution (all the code here).



Testing



This is the TestCase from the testing module:



class MergeTestCase(unittest.TestCase):

def setUp(self):
with open('./lists/test_list.txt') as f:
self.lsts = json.loads(f.read())
self.merged = self.merge_func(deepcopy(self.lsts))

def test_disjoint(self):
"""Check disjoint-ness of merged results"""
from itertools import combinations
for a,b in combinations(self.merged, 2):
self.assertTrue(a.isdisjoint(b))

def test_coverage(self): # Credit to katrielalex
"""Check coverage original data"""
merged_flat = set()
for s in self.merged:
merged_flat |= s

original_flat = set()
for lst in self.lsts:
original_flat |= set(lst)

self.assertTrue(merged_flat == original_flat)

def test_subset(self): # Credit to WolframH
"""Check that every original data is a subset"""
for lst in self.lsts:
self.assertTrue(any(set(lst) <= e for e in self.merged))


This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.



I couldn't test the following:



katrielalex
steabert


Among the ones I could test, two failed:



  -- Going to test: agf (optimized) --
Check disjoint-ness of merged results ... FAIL

-- Going to test: robert king --
Check disjoint-ness of merged results ... FAIL


Timing



The performances are strongly related with the data test employed.



So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.





  1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.



    I've used the same three sets of parameters he used in his own answer, and I put them in three different files:



    filename = './lists/timing_1.txt'
    class_count = 50,
    class_size = 1000,
    list_count_per_class = 100,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.5,

    filename = './lists/timing_2.txt'
    class_count = 15,
    class_size = 1000,
    list_count_per_class = 300,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.5,

    filename = './lists/timing_3.txt'
    class_count = 15,
    class_size = 1000,
    list_count_per_class = 300,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.1,


    This are the results that I got:



    From file: timing_1.txt



    Timing with: >> Niklas << Benchmark
    Info: 5000 lists, average size 305, max size 999

    Timing Results:
    10.434 -- alexis
    11.476 -- agf
    11.555 -- Niklas B.
    13.622 -- Rik. Poggi
    14.016 -- agf (optimized)
    14.057 -- ChessMaster
    20.208 -- katrielalex
    21.697 -- steabert
    25.101 -- robert king
    76.870 -- Sven Marnach
    133.399 -- hochl


    From file: timing_2.txt



    Timing with: >> Niklas << Benchmark
    Info: 4500 lists, average size 305, max size 999

    Timing Results:
    8.247 -- Niklas B.
    8.286 -- agf
    8.637 -- Rik. Poggi
    8.967 -- alexis
    9.090 -- ChessMaster
    9.091 -- agf (optimized)
    18.186 -- katrielalex
    19.543 -- steabert
    22.852 -- robert king
    70.486 -- Sven Marnach
    104.405 -- hochl


    From file: timing_3.txt



    Timing with: >> Niklas << Benchmark
    Info: 4500 lists, average size 98, max size 999

    Timing Results:
    2.746 -- agf
    2.850 -- Niklas B.
    2.887 -- Rik. Poggi
    2.972 -- alexis
    3.077 -- ChessMaster
    3.174 -- agf (optimized)
    5.811 -- katrielalex
    7.208 -- robert king
    9.193 -- steabert
    23.536 -- Sven Marnach
    37.436 -- hochl



  2. With Sven's testing data I got the following results:



    Timing with: >> Sven << Benchmark
    Info: 200 lists, average size 10, max size 10

    Timing Results:
    2.053 -- alexis
    2.199 -- ChessMaster
    2.410 -- agf (optimized)
    3.394 -- agf
    3.398 -- Rik. Poggi
    3.640 -- robert king
    3.719 -- steabert
    3.776 -- Niklas B.
    3.888 -- hochl
    4.610 -- Sven Marnach
    5.018 -- katrielalex



  3. And finally with Agf's benchmark I got:



    Timing with: >> Agf << Benchmark
    Info: 2000 lists, average size 246, max size 500

    Timing Results:
    3.446 -- Rik. Poggi
    3.500 -- ChessMaster
    3.520 -- agf (optimized)
    3.527 -- Niklas B.
    3.527 -- agf
    3.902 -- hochl
    5.080 -- alexis
    15.997 -- steabert
    16.422 -- katrielalex
    18.317 -- robert king
    1257.152 -- Sven Marnach



As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.



Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)






share|improve this answer























  • how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
    – ChessMaster
    Feb 26 '12 at 20:59










  • @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
    – Rik Poggi
    Feb 26 '12 at 22:38










  • Thanks for your great effort.
    – Developer
    Feb 27 '12 at 14:29






  • 2




    Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
    – Niklas B.
    Feb 27 '12 at 20:33












  • Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
    – tommy.carstensen
    Apr 28 '15 at 18:36


















up vote
7
down vote













Using Matrix Manipulations



Let me preface this answer with the following comment:



THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.



That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:



from numpy import where, newaxis
from scipy import linalg, array, zeros

X = [[0,1,3],[2],[3,1]]


We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:



A = zeros((4,len(X)), dtype=float)
for i,row in enumerate(X):
for val in row: A[val,i] = 1


In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.



S  = linalg.svd(A)


We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.



M  = abs(S[2])


We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.



M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)
V = abs(V)


Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:



idx = where(U > .999)[0]
C = V.T[idx] > 0


I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:



for cluster in C:
print where(A[:,cluster].sum(axis=1))[0]


Which gives, as intended:



[0 1 3]
[2]


Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].





Addendum



Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.





Visual Representation



Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!



My ExampleYour sample data





Faster Implementation



In hindsight, I realized that you can skip the SVD step and compute only a single decomp:



M = dot(A.T,A)
M /= M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)


The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).






share|improve this answer























  • Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
    – Developer
    Feb 3 '12 at 2:37










  • How did you generate these really nice images?
    – Zachary Young
    Feb 3 '12 at 2:53






  • 1




    @ZacharyYoung check MatPlotLib for graphing, see Gallery.
    – Developer
    Feb 3 '12 at 3:20






  • 1




    @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
    – Zachary Young
    Feb 3 '12 at 5:05






  • 1




    At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
    – Developer
    Feb 3 '12 at 5:49


















up vote
5
down vote













EDIT: OK, the other questions has been closed, posting here.



Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.



def pairs(lst):
i = iter(lst)
first = prev = item = i.next()
for item in i:
yield prev, item
prev = item
yield item, first

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx
g = networkx.Graph()
for sub_list in lists:
for edge in pairs(sub_list):
g.add_edge(*edge)

networkx.connected_components(g)
[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]




Explanation



We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.



Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.






share|improve this answer



















  • 2




    Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
    – Niklas B.
    Feb 20 '12 at 0:12












  • Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
    – Developer
    Feb 21 '12 at 11:11










  • @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
    – Katriel
    Feb 21 '12 at 11:25










  • I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
    – robert king
    Feb 21 '12 at 12:38










  • @robertking feel free to edit =)... I just wrote what came to mind!
    – Katriel
    Feb 21 '12 at 14:19




















up vote
4
down vote













Here's my answer. I haven't checked it against today's batch of answers.



The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.



(Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).



def mergelists5(data):
"""Check each number in our arrays only once, merging when we find
a number we have seen before.
"""

bins = range(len(data)) # Initialize each bin[n] == n
nums = dict()

data = [set(m) for m in data ] # Convert to sets
for r, row in enumerate(data):
for num in row:
if num not in nums:
# New number: tag it with a pointer to this row's bin
nums[num] = r
continue
else:
dest = locatebin(bins, nums[num])
if dest == r:
continue # already in the same bin

if dest > r:
dest, r = r, dest # always merge into the smallest bin

data[dest].update(data[r])
data[r] = None
# Update our indices to reflect the move
bins[r] = dest
r = dest

# Filter out the empty bins
have = [ m for m in data if m ]
print len(have), "groups in result"
return have


def locatebin(bins, n):
"""
Find the bin where list n has ended up: Follow bin references until
we find a bin that has not moved.
"""
while bins[n] != n:
n = bins[n]
return n





share|improve this answer























  • This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
    – alexis
    Feb 26 '12 at 14:25










  • (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
    – Rik Poggi
    Feb 27 '12 at 14:05










  • Yet another great approach.
    – Developer
    Feb 27 '12 at 14:36












  • Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
    – alexis
    Feb 28 '12 at 0:14


















up vote
3
down vote













This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.



from collections import deque

def merge(lists):
sets = deque(set(lst) for lst in lists if lst)
results =
disjoint = 0
current = sets.pop()
while True:
merged = False
newsets = deque()
for _ in xrange(disjoint, len(sets)):
this = sets.pop()
if not current.isdisjoint(this):
current.update(this)
merged = True
disjoint = 0
else:
newsets.append(this)
disjoint += 1
if sets:
newsets.extendleft(sets)
if not merged:
results.append(current)
try:
current = newsets.pop()
except IndexError:
break
disjoint = 0
sets = newsets
return results


The less overlap between the sets in a given set of data, the better this will do compared to the other functions.



Here is an example case. If you have 4 sets, you need to compare:




1, 2
1, 3
1, 4
2, 3
2, 4
3, 4


If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.



There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.



The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.



The disjoint counter allows this to be tracked.





My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.



According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.



def merge0(lists):
newsets, sets = [set(lst) for lst in lists if lst],
while len(sets) != len(newsets):
sets, newsets = newsets,
for aset in sets:
for eachset in newsets:
if not aset.isdisjoint(eachset):
eachset.update(aset)
break
else:
newsets.append(aset)
return newsets


No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.



My (faster and simpler) version of the test code:



import random

tenk = range(10000)
lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

setup = """
def merge0(lists):
newsets, sets = [set(lst) for lst in lists if lst],
while len(sets) != len(newsets):
sets, newsets = newsets,
for aset in sets:
for eachset in newsets:
if not aset.isdisjoint(eachset):
eachset.update(aset)
break
else:
newsets.append(aset)
return newsets

def merge1(lsts):
sets = [set(lst) for lst in lsts if lst]
merged = 1
while merged:
merged = 0
results =
while sets:
common, rest = sets[0], sets[1:]
sets =
for x in rest:
if x.isdisjoint(common):
sets.append(x)
else:
merged = 1
common |= x
results.append(common)
sets = results
return sets

lsts = """ + repr(lsts)

import timeit
print timeit.timeit("merge0(lsts)", setup=setup, number=10)
print timeit.timeit("merge1(lsts)", setup=setup, number=10)





share|improve this answer























  • thank you again for pointing out my mistake, i think my code is fine now.
    – ChessMaster
    Feb 23 '12 at 3:42










  • Good to see another method. Thanks. It is elegant.
    – Developer
    Feb 23 '12 at 11:27










  • What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
    – Rik Poggi
    Feb 25 '12 at 18:35










  • @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
    – agf
    Feb 26 '12 at 0:32










  • Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
    – Rik Poggi
    Feb 26 '12 at 1:08




















up vote
1
down vote













This would be my updated approach:



def merge(data):
sets = (set(e) for e in data if e)
results = [next(sets)]
for e_set in sets:
to_update =
for i,res in enumerate(results):
if not e_set.isdisjoint(res):
to_update.insert(0,i)

if not to_update:
results.append(e_set)
else:
last = results[to_update.pop(-1)]
for i in to_update:
last |= results[i]
del results[i]
last |= e_set

return results


Note: During the merging empty lists will be removed.



Update: Reliability.



You need two tests for a 100% reliabilty of success:





  • Check that all the resulting sets are mutually disjointed:



    merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]

    from itertools import combinations
    for a,b in combinations(merged,2):
    if not a.isdisjoint(b):
    raise Exception(a,b) # just an example


  • Check that the merged set cover the original data. (as suggested by katrielalex)



I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.






share|improve this answer























  • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
    – Developer
    Feb 2 '12 at 15:47










  • If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
    – Developer
    Feb 2 '12 at 16:01










  • @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
    – Rik Poggi
    Feb 2 '12 at 16:11












  • @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
    – Niklas B.
    Feb 2 '12 at 16:15












  • @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
    – Rik Poggi
    Feb 2 '12 at 16:46


















up vote
1
down vote













Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:




  • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)

  • Are the elements of the result sets the same as of the input lists?

  • Is every input list a subset of a result set?

  • Is every result set minimal, i.e. is it impossible to split it into two smaller sets?

  • It does not check if there are empty result sets - I don't know if you want them or not...


.



from itertools import chain

def check(lsts, result):
lsts = [set(s) for s in lsts]
all_items = set(chain(*lsts))
all_result_items = set(chain(*result))
num_result_items = sum(len(s) for s in result)
if num_result_items != len(all_result_items):
print("Error: result sets overlap!")
print(num_result_items, len(all_result_items))
print(sorted(map(len, result)), sorted(map(len, lsts)))
if all_items != all_result_items:
print("Error: result doesn't match input lists!")
if not all(any(set(s).issubset(t) for t in result) for s in lst):
print("Error: not all input lists are contained in a result set!")

seen = set()
todo = list(filter(bool, lsts))
done = False
while not done:
deletes =
for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
seen.update(s)
deletes.append(i)
for i in reversed(deletes):
del todo[i]
done = not deletes
if todo:
print("Error: A result set should be split into two or more parts!")
print(todo)





share|improve this answer





















  • It would be neat if you could write this in unit test language =)
    – Katriel
    Feb 20 '12 at 16:34


















up vote
1
down vote













lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx as nx
g = nx.Graph()

for sub_list in lists:
for i in range(1,len(sub_list)):
g.add_edge(sub_list[0],sub_list[i])

print nx.connected_components(g)
#[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]


Performance:



5000 lists, 5 classes, average size 74, max size 1000
15.2264976415


Performance of merge1:



print timeit.timeit("merge1(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26998780571



So it is 11x slower than the fastest.. but the code is much more simple and readable!







share|improve this answer






























    up vote
    1
    down vote













    This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.



    It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.



    def merge(lsts):
    # this is an index list that stores the joined id for each list
    joined = range(len(lsts))
    # create an ordered list with indices
    indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
    # loop throught the ordered list, and if two elements are the same and
    # the lists are not yet joined, alter the list with joined id
    el_0,idx_0 = None,None
    for el,idx in indexed_list:
    if el == el_0 and joined[idx] != joined[idx_0]:
    old = joined[idx]
    rep = joined[idx_0]
    joined = [rep if id == old else id for id in joined]
    el_0, idx_0 = el, idx
    return joined





    share|improve this answer























    • Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
      – Developer
      Feb 7 '12 at 9:07




















    up vote
    1
    down vote













    Firstly I'm not exactly sure if the benchmarks are fair:



    Adding the following code to the start of my function:



    c = Counter(chain(*lists))
    print c[1]
    "88"


    This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).



    Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.



    Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.



    Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:



    I thought I would order everything



    import heapq
    from itertools import chain
    def merge6(lists):
    for l in lists:
    l.sort()
    one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
    previous = one_list.next()

    d = {i:i for i in range(len(lists))}
    for current in one_list:
    if current[0]==previous[0]:
    d[current[1]] = d[previous[1]]
    previous=current

    groups=[ for i in range(len(lists))]
    for k in d:
    groups[d[k]].append(lists[k]) #add a each list to its group

    return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


    lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
    print merge6(lists)
    "[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



    import timeit
    print timeit.timeit("merge1(lsts)", setup=setup, number=10)
    print timeit.timeit("merge4(lsts)", setup=setup, number=10)
    print timeit.timeit("merge6(lsts)", setup=setup, number=10)
    5000 lists, 5 classes, average size 74, max size 1000
    1.26732238315
    5000 lists, 5 classes, average size 74, max size 1000
    1.16062907437
    5000 lists, 5 classes, average size 74, max size 1000
    30.7257182826





    share|improve this answer





















    • Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
      – Developer
      Feb 24 '12 at 2:11










    • Does not work with this input
      – timdiels
      Dec 23 '15 at 12:15


















    up vote
    1
    down vote













    Just for fun...



    def merge(mylists):
    results, sets = , [set(lst) for lst in mylists if lst]
    upd, isd, pop = set.update, set.isdisjoint, sets.pop
    while sets:
    if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
    results.append(pop(0))
    return results


    and my rewrite of the best answer



    def merge(lsts):
    sets = map(set,lsts)
    results =
    while sets:
    first, rest = sets[0], sets[1:]
    merged = False
    sets =
    for s in rest:
    if s and s.isdisjoint(first):
    sets.append(s)
    else:
    first |= s
    merged = True
    if merged: sets.append(first)
    else: results.append(first)
    return results





    share|improve this answer























    • Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
      – agf
      Feb 22 '12 at 20:34










    • @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
      – ChessMaster
      Feb 23 '12 at 3:34










    • Yep, that fixes it. The one you added performs terribly though in my tests.
      – agf
      Feb 24 '12 at 8:30


















    up vote
    1
    down vote













    Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.



    def merge(data):
    parents = {}
    def find(i):
    j = parents.get(i, i)
    if j == i:
    return i
    k = find(j)
    if k != j:
    parents[i] = k
    return k
    for l in filter(None, data):
    parents.update(dict.fromkeys(map(find, l), find(l[0])))
    merged = {}
    for k, v in parents.items():
    merged.setdefault(find(v), ).append(k)
    return merged.values()


    This approach is comparable to the other best algorithms on Rik's benchmarks.






    share|improve this answer























    • FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
      – DSM
      Aug 22 '12 at 2:02


















    up vote
    0
    down vote













    My solution, works well on small lists and is quite readable without dependencies.



    def merge_list(starting_list):
    final_list =
    for i,v in enumerate(starting_list[:-1]):
    if set(v)&set(starting_list[i+1]):
    starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
    else:
    final_list.append(v)
    final_list.append(starting_list[-1])
    return final_list


    Benchmarking it:



    lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]



    %timeit merge_list(lists)



    100000 loops, best of 3: 4.9 µs per loop






    share|improve this answer





















    • This fails on this input
      – timdiels
      Dec 23 '15 at 12:21


















    up vote
    0
    down vote













    This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs:
    (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)






    share|improve this answer





















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f9110837%2fpython-simple-list-merging-based-on-intersections%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      15 Answers
      15






      active

      oldest

      votes








      15 Answers
      15






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      21
      down vote



      accepted










      My attempt:



      def merge(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = True
      while merged:
      merged = False
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = True
      common |= x
      results.append(common)
      sets = results
      return sets

      lst = [[65, 17, 5, 30, 79, 56, 48, 62],
      [6, 97, 32, 93, 55, 14, 70, 32],
      [75, 37, 83, 34, 9, 19, 14, 64],
      [43, 71],
      ,
      [89, 49, 1, 30, 28, 3, 63],
      [35, 21, 68, 94, 57, 94, 9, 3],
      [16],
      [29, 9, 97, 43],
      [17, 63, 24]]
      print merge(lst)


      Benchmark:



      import random

      # adapt parameters to your own usage scenario
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5

      if False: # change to true to generate the test data file (takes a while)
      with open("/tmp/test.txt", "w") as f:
      lists =
      classes = [
      range(class_size * i, class_size * (i + 1)) for i in range(class_count)
      ]
      for c in classes:
      # distribute each class across ~300 lists
      for i in xrange(list_count_per_class):
      lst =
      if random.random() < large_list_probability:
      size = random.choice(large_list_sizes)
      else:
      size = random.choice(small_list_sizes)
      nums = set(c)
      for j in xrange(size):
      x = random.choice(list(nums))
      lst.append(x)
      nums.remove(x)
      random.shuffle(lst)
      lists.append(lst)
      random.shuffle(lists)
      for lst in lists:
      f.write(" ".join(str(x) for x in lst) + "n")

      setup = """
      # Niklas'
      def merge_niklas(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      # Rik's
      def merge_rik(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i, res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0, i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set
      return results

      # katrielalex's
      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      import networkx

      def merge_katrielalex(lsts):
      g = networkx.Graph()
      for lst in lsts:
      for edge in pairs(lst):
      g.add_edge(*edge)
      return networkx.connected_components(g)

      # agf's (optimized)
      from collections import deque

      def merge_agf_optimized(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results

      # agf's (simple)
      def merge_agf_simple(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      # alexis'
      def merge_alexis(data):
      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [m for m in data if m]
      return have

      def locatebin(bins, n):
      while bins[n] != n:
      n = bins[n]
      return n

      lsts =
      size = 0
      num = 0
      max = 0
      for line in open("/tmp/test.txt", "r"):
      lst = [int(x) for x in line.split()]
      size += len(lst)
      if len(lst) > max:
      max = len(lst)
      num += 1
      lsts.append(lst)
      """

      setup += """
      print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
      """.format(class_count=class_count)

      import timeit
      print "niklas"
      print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
      print "rik"
      print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
      print "katrielalex"
      print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
      print "agf (1)"
      print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
      print "agf (2)"
      print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
      print "alexis"
      print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)


      These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.



      Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:



      =====================
      # many disjoint classes, large lists
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      =====================

      niklas
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      4.80084705353
      rik
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      9.49251699448
      katrielalex
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      21.5317108631
      agf (1)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      8.61671280861
      agf (2)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      5.18117713928
      => alexis
      => 5000 lists, 50 equally distributed classes, average size 298, max size 999
      => 3.73504281044

      ===================
      # less number of classes, large lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      1.79993700981
      rik
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.58237695694
      katrielalex
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      19.5465381145
      agf (1)
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.75445604324
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 296, max size 999
      => 1.77850699425
      alexis
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      3.23530197144

      ===================
      # less number of classes, smaller lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.1
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      0.773697137833
      rik
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.0523750782
      katrielalex
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      6.04466891289
      agf (1)
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.20285701752
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 95, max size 997
      => 0.714507102966
      alexis
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.1286110878





      share|improve this answer



















      • 1




        You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
        – Janne Karila
        Feb 2 '12 at 12:57










      • Janne Karila: Thanks for the information! I changed this.
        – Niklas B.
        Feb 2 '12 at 12:59












      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:45












      • @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
        – Niklas B.
        Feb 2 '12 at 15:59








      • 2




        This is such a fun problem, anyway :-) Thanks for providing the test setup.
        – alexis
        Feb 26 '12 at 14:30















      up vote
      21
      down vote



      accepted










      My attempt:



      def merge(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = True
      while merged:
      merged = False
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = True
      common |= x
      results.append(common)
      sets = results
      return sets

      lst = [[65, 17, 5, 30, 79, 56, 48, 62],
      [6, 97, 32, 93, 55, 14, 70, 32],
      [75, 37, 83, 34, 9, 19, 14, 64],
      [43, 71],
      ,
      [89, 49, 1, 30, 28, 3, 63],
      [35, 21, 68, 94, 57, 94, 9, 3],
      [16],
      [29, 9, 97, 43],
      [17, 63, 24]]
      print merge(lst)


      Benchmark:



      import random

      # adapt parameters to your own usage scenario
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5

      if False: # change to true to generate the test data file (takes a while)
      with open("/tmp/test.txt", "w") as f:
      lists =
      classes = [
      range(class_size * i, class_size * (i + 1)) for i in range(class_count)
      ]
      for c in classes:
      # distribute each class across ~300 lists
      for i in xrange(list_count_per_class):
      lst =
      if random.random() < large_list_probability:
      size = random.choice(large_list_sizes)
      else:
      size = random.choice(small_list_sizes)
      nums = set(c)
      for j in xrange(size):
      x = random.choice(list(nums))
      lst.append(x)
      nums.remove(x)
      random.shuffle(lst)
      lists.append(lst)
      random.shuffle(lists)
      for lst in lists:
      f.write(" ".join(str(x) for x in lst) + "n")

      setup = """
      # Niklas'
      def merge_niklas(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      # Rik's
      def merge_rik(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i, res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0, i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set
      return results

      # katrielalex's
      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      import networkx

      def merge_katrielalex(lsts):
      g = networkx.Graph()
      for lst in lsts:
      for edge in pairs(lst):
      g.add_edge(*edge)
      return networkx.connected_components(g)

      # agf's (optimized)
      from collections import deque

      def merge_agf_optimized(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results

      # agf's (simple)
      def merge_agf_simple(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      # alexis'
      def merge_alexis(data):
      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [m for m in data if m]
      return have

      def locatebin(bins, n):
      while bins[n] != n:
      n = bins[n]
      return n

      lsts =
      size = 0
      num = 0
      max = 0
      for line in open("/tmp/test.txt", "r"):
      lst = [int(x) for x in line.split()]
      size += len(lst)
      if len(lst) > max:
      max = len(lst)
      num += 1
      lsts.append(lst)
      """

      setup += """
      print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
      """.format(class_count=class_count)

      import timeit
      print "niklas"
      print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
      print "rik"
      print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
      print "katrielalex"
      print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
      print "agf (1)"
      print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
      print "agf (2)"
      print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
      print "alexis"
      print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)


      These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.



      Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:



      =====================
      # many disjoint classes, large lists
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      =====================

      niklas
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      4.80084705353
      rik
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      9.49251699448
      katrielalex
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      21.5317108631
      agf (1)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      8.61671280861
      agf (2)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      5.18117713928
      => alexis
      => 5000 lists, 50 equally distributed classes, average size 298, max size 999
      => 3.73504281044

      ===================
      # less number of classes, large lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      1.79993700981
      rik
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.58237695694
      katrielalex
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      19.5465381145
      agf (1)
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.75445604324
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 296, max size 999
      => 1.77850699425
      alexis
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      3.23530197144

      ===================
      # less number of classes, smaller lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.1
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      0.773697137833
      rik
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.0523750782
      katrielalex
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      6.04466891289
      agf (1)
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.20285701752
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 95, max size 997
      => 0.714507102966
      alexis
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.1286110878





      share|improve this answer



















      • 1




        You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
        – Janne Karila
        Feb 2 '12 at 12:57










      • Janne Karila: Thanks for the information! I changed this.
        – Niklas B.
        Feb 2 '12 at 12:59












      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:45












      • @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
        – Niklas B.
        Feb 2 '12 at 15:59








      • 2




        This is such a fun problem, anyway :-) Thanks for providing the test setup.
        – alexis
        Feb 26 '12 at 14:30













      up vote
      21
      down vote



      accepted







      up vote
      21
      down vote



      accepted






      My attempt:



      def merge(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = True
      while merged:
      merged = False
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = True
      common |= x
      results.append(common)
      sets = results
      return sets

      lst = [[65, 17, 5, 30, 79, 56, 48, 62],
      [6, 97, 32, 93, 55, 14, 70, 32],
      [75, 37, 83, 34, 9, 19, 14, 64],
      [43, 71],
      ,
      [89, 49, 1, 30, 28, 3, 63],
      [35, 21, 68, 94, 57, 94, 9, 3],
      [16],
      [29, 9, 97, 43],
      [17, 63, 24]]
      print merge(lst)


      Benchmark:



      import random

      # adapt parameters to your own usage scenario
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5

      if False: # change to true to generate the test data file (takes a while)
      with open("/tmp/test.txt", "w") as f:
      lists =
      classes = [
      range(class_size * i, class_size * (i + 1)) for i in range(class_count)
      ]
      for c in classes:
      # distribute each class across ~300 lists
      for i in xrange(list_count_per_class):
      lst =
      if random.random() < large_list_probability:
      size = random.choice(large_list_sizes)
      else:
      size = random.choice(small_list_sizes)
      nums = set(c)
      for j in xrange(size):
      x = random.choice(list(nums))
      lst.append(x)
      nums.remove(x)
      random.shuffle(lst)
      lists.append(lst)
      random.shuffle(lists)
      for lst in lists:
      f.write(" ".join(str(x) for x in lst) + "n")

      setup = """
      # Niklas'
      def merge_niklas(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      # Rik's
      def merge_rik(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i, res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0, i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set
      return results

      # katrielalex's
      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      import networkx

      def merge_katrielalex(lsts):
      g = networkx.Graph()
      for lst in lsts:
      for edge in pairs(lst):
      g.add_edge(*edge)
      return networkx.connected_components(g)

      # agf's (optimized)
      from collections import deque

      def merge_agf_optimized(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results

      # agf's (simple)
      def merge_agf_simple(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      # alexis'
      def merge_alexis(data):
      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [m for m in data if m]
      return have

      def locatebin(bins, n):
      while bins[n] != n:
      n = bins[n]
      return n

      lsts =
      size = 0
      num = 0
      max = 0
      for line in open("/tmp/test.txt", "r"):
      lst = [int(x) for x in line.split()]
      size += len(lst)
      if len(lst) > max:
      max = len(lst)
      num += 1
      lsts.append(lst)
      """

      setup += """
      print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
      """.format(class_count=class_count)

      import timeit
      print "niklas"
      print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
      print "rik"
      print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
      print "katrielalex"
      print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
      print "agf (1)"
      print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
      print "agf (2)"
      print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
      print "alexis"
      print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)


      These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.



      Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:



      =====================
      # many disjoint classes, large lists
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      =====================

      niklas
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      4.80084705353
      rik
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      9.49251699448
      katrielalex
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      21.5317108631
      agf (1)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      8.61671280861
      agf (2)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      5.18117713928
      => alexis
      => 5000 lists, 50 equally distributed classes, average size 298, max size 999
      => 3.73504281044

      ===================
      # less number of classes, large lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      1.79993700981
      rik
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.58237695694
      katrielalex
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      19.5465381145
      agf (1)
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.75445604324
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 296, max size 999
      => 1.77850699425
      alexis
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      3.23530197144

      ===================
      # less number of classes, smaller lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.1
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      0.773697137833
      rik
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.0523750782
      katrielalex
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      6.04466891289
      agf (1)
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.20285701752
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 95, max size 997
      => 0.714507102966
      alexis
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.1286110878





      share|improve this answer














      My attempt:



      def merge(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = True
      while merged:
      merged = False
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = True
      common |= x
      results.append(common)
      sets = results
      return sets

      lst = [[65, 17, 5, 30, 79, 56, 48, 62],
      [6, 97, 32, 93, 55, 14, 70, 32],
      [75, 37, 83, 34, 9, 19, 14, 64],
      [43, 71],
      ,
      [89, 49, 1, 30, 28, 3, 63],
      [35, 21, 68, 94, 57, 94, 9, 3],
      [16],
      [29, 9, 97, 43],
      [17, 63, 24]]
      print merge(lst)


      Benchmark:



      import random

      # adapt parameters to your own usage scenario
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5

      if False: # change to true to generate the test data file (takes a while)
      with open("/tmp/test.txt", "w") as f:
      lists =
      classes = [
      range(class_size * i, class_size * (i + 1)) for i in range(class_count)
      ]
      for c in classes:
      # distribute each class across ~300 lists
      for i in xrange(list_count_per_class):
      lst =
      if random.random() < large_list_probability:
      size = random.choice(large_list_sizes)
      else:
      size = random.choice(small_list_sizes)
      nums = set(c)
      for j in xrange(size):
      x = random.choice(list(nums))
      lst.append(x)
      nums.remove(x)
      random.shuffle(lst)
      lists.append(lst)
      random.shuffle(lists)
      for lst in lists:
      f.write(" ".join(str(x) for x in lst) + "n")

      setup = """
      # Niklas'
      def merge_niklas(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      # Rik's
      def merge_rik(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i, res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0, i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set
      return results

      # katrielalex's
      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      import networkx

      def merge_katrielalex(lsts):
      g = networkx.Graph()
      for lst in lsts:
      for edge in pairs(lst):
      g.add_edge(*edge)
      return networkx.connected_components(g)

      # agf's (optimized)
      from collections import deque

      def merge_agf_optimized(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results

      # agf's (simple)
      def merge_agf_simple(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      # alexis'
      def merge_alexis(data):
      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [m for m in data if m]
      return have

      def locatebin(bins, n):
      while bins[n] != n:
      n = bins[n]
      return n

      lsts =
      size = 0
      num = 0
      max = 0
      for line in open("/tmp/test.txt", "r"):
      lst = [int(x) for x in line.split()]
      size += len(lst)
      if len(lst) > max:
      max = len(lst)
      num += 1
      lsts.append(lst)
      """

      setup += """
      print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
      """.format(class_count=class_count)

      import timeit
      print "niklas"
      print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
      print "rik"
      print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
      print "katrielalex"
      print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
      print "agf (1)"
      print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
      print "agf (2)"
      print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
      print "alexis"
      print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)


      These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.



      Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:



      =====================
      # many disjoint classes, large lists
      class_count = 50
      class_size = 1000
      list_count_per_class = 100
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      =====================

      niklas
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      4.80084705353
      rik
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      9.49251699448
      katrielalex
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      21.5317108631
      agf (1)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      8.61671280861
      agf (2)
      5000 lists, 50 equally distributed classes, average size 298, max size 999
      5.18117713928
      => alexis
      => 5000 lists, 50 equally distributed classes, average size 298, max size 999
      => 3.73504281044

      ===================
      # less number of classes, large lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.5
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      1.79993700981
      rik
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.58237695694
      katrielalex
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      19.5465381145
      agf (1)
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      2.75445604324
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 296, max size 999
      => 1.77850699425
      alexis
      4500 lists, 15 equally distributed classes, average size 296, max size 999
      3.23530197144

      ===================
      # less number of classes, smaller lists
      class_count = 15
      class_size = 1000
      list_count_per_class = 300
      large_list_sizes = list(range(100, 1000))
      small_list_sizes = list(range(0, 100))
      large_list_probability = 0.1
      ===================

      niklas
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      0.773697137833
      rik
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.0523750782
      katrielalex
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      6.04466891289
      agf (1)
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.20285701752
      => agf (2)
      => 4500 lists, 15 equally distributed classes, average size 95, max size 997
      => 0.714507102966
      alexis
      4500 lists, 15 equally distributed classes, average size 95, max size 997
      1.1286110878






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 11 at 16:25


























      community wiki





      13 revs, 2 users 74%
      Niklas B.









      • 1




        You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
        – Janne Karila
        Feb 2 '12 at 12:57










      • Janne Karila: Thanks for the information! I changed this.
        – Niklas B.
        Feb 2 '12 at 12:59












      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:45












      • @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
        – Niklas B.
        Feb 2 '12 at 15:59








      • 2




        This is such a fun problem, anyway :-) Thanks for providing the test setup.
        – alexis
        Feb 26 '12 at 14:30














      • 1




        You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
        – Janne Karila
        Feb 2 '12 at 12:57










      • Janne Karila: Thanks for the information! I changed this.
        – Niklas B.
        Feb 2 '12 at 12:59












      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:45












      • @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
        – Niklas B.
        Feb 2 '12 at 15:59








      • 2




        This is such a fun problem, anyway :-) Thanks for providing the test setup.
        – alexis
        Feb 26 '12 at 14:30








      1




      1




      You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
      – Janne Karila
      Feb 2 '12 at 12:57




      You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.
      – Janne Karila
      Feb 2 '12 at 12:57












      Janne Karila: Thanks for the information! I changed this.
      – Niklas B.
      Feb 2 '12 at 12:59






      Janne Karila: Thanks for the information! I changed this.
      – Niklas B.
      Feb 2 '12 at 12:59














      lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
      – Developer
      Feb 2 '12 at 15:45






      lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
      – Developer
      Feb 2 '12 at 15:45














      @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
      – Niklas B.
      Feb 2 '12 at 15:59






      @Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!
      – Niklas B.
      Feb 2 '12 at 15:59






      2




      2




      This is such a fun problem, anyway :-) Thanks for providing the test setup.
      – alexis
      Feb 26 '12 at 14:30




      This is such a fun problem, anyway :-) Thanks for providing the test setup.
      – alexis
      Feb 26 '12 at 14:30












      up vote
      13
      down vote



      +50










      I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.



      I tried to test and time every solution (all the code here).



      Testing



      This is the TestCase from the testing module:



      class MergeTestCase(unittest.TestCase):

      def setUp(self):
      with open('./lists/test_list.txt') as f:
      self.lsts = json.loads(f.read())
      self.merged = self.merge_func(deepcopy(self.lsts))

      def test_disjoint(self):
      """Check disjoint-ness of merged results"""
      from itertools import combinations
      for a,b in combinations(self.merged, 2):
      self.assertTrue(a.isdisjoint(b))

      def test_coverage(self): # Credit to katrielalex
      """Check coverage original data"""
      merged_flat = set()
      for s in self.merged:
      merged_flat |= s

      original_flat = set()
      for lst in self.lsts:
      original_flat |= set(lst)

      self.assertTrue(merged_flat == original_flat)

      def test_subset(self): # Credit to WolframH
      """Check that every original data is a subset"""
      for lst in self.lsts:
      self.assertTrue(any(set(lst) <= e for e in self.merged))


      This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.



      I couldn't test the following:



      katrielalex
      steabert


      Among the ones I could test, two failed:



        -- Going to test: agf (optimized) --
      Check disjoint-ness of merged results ... FAIL

      -- Going to test: robert king --
      Check disjoint-ness of merged results ... FAIL


      Timing



      The performances are strongly related with the data test employed.



      So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.





      1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.



        I've used the same three sets of parameters he used in his own answer, and I put them in three different files:



        filename = './lists/timing_1.txt'
        class_count = 50,
        class_size = 1000,
        list_count_per_class = 100,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_2.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_3.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.1,


        This are the results that I got:



        From file: timing_1.txt



        Timing with: >> Niklas << Benchmark
        Info: 5000 lists, average size 305, max size 999

        Timing Results:
        10.434 -- alexis
        11.476 -- agf
        11.555 -- Niklas B.
        13.622 -- Rik. Poggi
        14.016 -- agf (optimized)
        14.057 -- ChessMaster
        20.208 -- katrielalex
        21.697 -- steabert
        25.101 -- robert king
        76.870 -- Sven Marnach
        133.399 -- hochl


        From file: timing_2.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 305, max size 999

        Timing Results:
        8.247 -- Niklas B.
        8.286 -- agf
        8.637 -- Rik. Poggi
        8.967 -- alexis
        9.090 -- ChessMaster
        9.091 -- agf (optimized)
        18.186 -- katrielalex
        19.543 -- steabert
        22.852 -- robert king
        70.486 -- Sven Marnach
        104.405 -- hochl


        From file: timing_3.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 98, max size 999

        Timing Results:
        2.746 -- agf
        2.850 -- Niklas B.
        2.887 -- Rik. Poggi
        2.972 -- alexis
        3.077 -- ChessMaster
        3.174 -- agf (optimized)
        5.811 -- katrielalex
        7.208 -- robert king
        9.193 -- steabert
        23.536 -- Sven Marnach
        37.436 -- hochl



      2. With Sven's testing data I got the following results:



        Timing with: >> Sven << Benchmark
        Info: 200 lists, average size 10, max size 10

        Timing Results:
        2.053 -- alexis
        2.199 -- ChessMaster
        2.410 -- agf (optimized)
        3.394 -- agf
        3.398 -- Rik. Poggi
        3.640 -- robert king
        3.719 -- steabert
        3.776 -- Niklas B.
        3.888 -- hochl
        4.610 -- Sven Marnach
        5.018 -- katrielalex



      3. And finally with Agf's benchmark I got:



        Timing with: >> Agf << Benchmark
        Info: 2000 lists, average size 246, max size 500

        Timing Results:
        3.446 -- Rik. Poggi
        3.500 -- ChessMaster
        3.520 -- agf (optimized)
        3.527 -- Niklas B.
        3.527 -- agf
        3.902 -- hochl
        5.080 -- alexis
        15.997 -- steabert
        16.422 -- katrielalex
        18.317 -- robert king
        1257.152 -- Sven Marnach



      As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.



      Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)






      share|improve this answer























      • how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
        – ChessMaster
        Feb 26 '12 at 20:59










      • @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
        – Rik Poggi
        Feb 26 '12 at 22:38










      • Thanks for your great effort.
        – Developer
        Feb 27 '12 at 14:29






      • 2




        Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
        – Niklas B.
        Feb 27 '12 at 20:33












      • Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
        – tommy.carstensen
        Apr 28 '15 at 18:36















      up vote
      13
      down vote



      +50










      I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.



      I tried to test and time every solution (all the code here).



      Testing



      This is the TestCase from the testing module:



      class MergeTestCase(unittest.TestCase):

      def setUp(self):
      with open('./lists/test_list.txt') as f:
      self.lsts = json.loads(f.read())
      self.merged = self.merge_func(deepcopy(self.lsts))

      def test_disjoint(self):
      """Check disjoint-ness of merged results"""
      from itertools import combinations
      for a,b in combinations(self.merged, 2):
      self.assertTrue(a.isdisjoint(b))

      def test_coverage(self): # Credit to katrielalex
      """Check coverage original data"""
      merged_flat = set()
      for s in self.merged:
      merged_flat |= s

      original_flat = set()
      for lst in self.lsts:
      original_flat |= set(lst)

      self.assertTrue(merged_flat == original_flat)

      def test_subset(self): # Credit to WolframH
      """Check that every original data is a subset"""
      for lst in self.lsts:
      self.assertTrue(any(set(lst) <= e for e in self.merged))


      This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.



      I couldn't test the following:



      katrielalex
      steabert


      Among the ones I could test, two failed:



        -- Going to test: agf (optimized) --
      Check disjoint-ness of merged results ... FAIL

      -- Going to test: robert king --
      Check disjoint-ness of merged results ... FAIL


      Timing



      The performances are strongly related with the data test employed.



      So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.





      1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.



        I've used the same three sets of parameters he used in his own answer, and I put them in three different files:



        filename = './lists/timing_1.txt'
        class_count = 50,
        class_size = 1000,
        list_count_per_class = 100,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_2.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_3.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.1,


        This are the results that I got:



        From file: timing_1.txt



        Timing with: >> Niklas << Benchmark
        Info: 5000 lists, average size 305, max size 999

        Timing Results:
        10.434 -- alexis
        11.476 -- agf
        11.555 -- Niklas B.
        13.622 -- Rik. Poggi
        14.016 -- agf (optimized)
        14.057 -- ChessMaster
        20.208 -- katrielalex
        21.697 -- steabert
        25.101 -- robert king
        76.870 -- Sven Marnach
        133.399 -- hochl


        From file: timing_2.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 305, max size 999

        Timing Results:
        8.247 -- Niklas B.
        8.286 -- agf
        8.637 -- Rik. Poggi
        8.967 -- alexis
        9.090 -- ChessMaster
        9.091 -- agf (optimized)
        18.186 -- katrielalex
        19.543 -- steabert
        22.852 -- robert king
        70.486 -- Sven Marnach
        104.405 -- hochl


        From file: timing_3.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 98, max size 999

        Timing Results:
        2.746 -- agf
        2.850 -- Niklas B.
        2.887 -- Rik. Poggi
        2.972 -- alexis
        3.077 -- ChessMaster
        3.174 -- agf (optimized)
        5.811 -- katrielalex
        7.208 -- robert king
        9.193 -- steabert
        23.536 -- Sven Marnach
        37.436 -- hochl



      2. With Sven's testing data I got the following results:



        Timing with: >> Sven << Benchmark
        Info: 200 lists, average size 10, max size 10

        Timing Results:
        2.053 -- alexis
        2.199 -- ChessMaster
        2.410 -- agf (optimized)
        3.394 -- agf
        3.398 -- Rik. Poggi
        3.640 -- robert king
        3.719 -- steabert
        3.776 -- Niklas B.
        3.888 -- hochl
        4.610 -- Sven Marnach
        5.018 -- katrielalex



      3. And finally with Agf's benchmark I got:



        Timing with: >> Agf << Benchmark
        Info: 2000 lists, average size 246, max size 500

        Timing Results:
        3.446 -- Rik. Poggi
        3.500 -- ChessMaster
        3.520 -- agf (optimized)
        3.527 -- Niklas B.
        3.527 -- agf
        3.902 -- hochl
        5.080 -- alexis
        15.997 -- steabert
        16.422 -- katrielalex
        18.317 -- robert king
        1257.152 -- Sven Marnach



      As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.



      Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)






      share|improve this answer























      • how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
        – ChessMaster
        Feb 26 '12 at 20:59










      • @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
        – Rik Poggi
        Feb 26 '12 at 22:38










      • Thanks for your great effort.
        – Developer
        Feb 27 '12 at 14:29






      • 2




        Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
        – Niklas B.
        Feb 27 '12 at 20:33












      • Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
        – tommy.carstensen
        Apr 28 '15 at 18:36













      up vote
      13
      down vote



      +50







      up vote
      13
      down vote



      +50




      +50




      I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.



      I tried to test and time every solution (all the code here).



      Testing



      This is the TestCase from the testing module:



      class MergeTestCase(unittest.TestCase):

      def setUp(self):
      with open('./lists/test_list.txt') as f:
      self.lsts = json.loads(f.read())
      self.merged = self.merge_func(deepcopy(self.lsts))

      def test_disjoint(self):
      """Check disjoint-ness of merged results"""
      from itertools import combinations
      for a,b in combinations(self.merged, 2):
      self.assertTrue(a.isdisjoint(b))

      def test_coverage(self): # Credit to katrielalex
      """Check coverage original data"""
      merged_flat = set()
      for s in self.merged:
      merged_flat |= s

      original_flat = set()
      for lst in self.lsts:
      original_flat |= set(lst)

      self.assertTrue(merged_flat == original_flat)

      def test_subset(self): # Credit to WolframH
      """Check that every original data is a subset"""
      for lst in self.lsts:
      self.assertTrue(any(set(lst) <= e for e in self.merged))


      This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.



      I couldn't test the following:



      katrielalex
      steabert


      Among the ones I could test, two failed:



        -- Going to test: agf (optimized) --
      Check disjoint-ness of merged results ... FAIL

      -- Going to test: robert king --
      Check disjoint-ness of merged results ... FAIL


      Timing



      The performances are strongly related with the data test employed.



      So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.





      1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.



        I've used the same three sets of parameters he used in his own answer, and I put them in three different files:



        filename = './lists/timing_1.txt'
        class_count = 50,
        class_size = 1000,
        list_count_per_class = 100,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_2.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_3.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.1,


        This are the results that I got:



        From file: timing_1.txt



        Timing with: >> Niklas << Benchmark
        Info: 5000 lists, average size 305, max size 999

        Timing Results:
        10.434 -- alexis
        11.476 -- agf
        11.555 -- Niklas B.
        13.622 -- Rik. Poggi
        14.016 -- agf (optimized)
        14.057 -- ChessMaster
        20.208 -- katrielalex
        21.697 -- steabert
        25.101 -- robert king
        76.870 -- Sven Marnach
        133.399 -- hochl


        From file: timing_2.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 305, max size 999

        Timing Results:
        8.247 -- Niklas B.
        8.286 -- agf
        8.637 -- Rik. Poggi
        8.967 -- alexis
        9.090 -- ChessMaster
        9.091 -- agf (optimized)
        18.186 -- katrielalex
        19.543 -- steabert
        22.852 -- robert king
        70.486 -- Sven Marnach
        104.405 -- hochl


        From file: timing_3.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 98, max size 999

        Timing Results:
        2.746 -- agf
        2.850 -- Niklas B.
        2.887 -- Rik. Poggi
        2.972 -- alexis
        3.077 -- ChessMaster
        3.174 -- agf (optimized)
        5.811 -- katrielalex
        7.208 -- robert king
        9.193 -- steabert
        23.536 -- Sven Marnach
        37.436 -- hochl



      2. With Sven's testing data I got the following results:



        Timing with: >> Sven << Benchmark
        Info: 200 lists, average size 10, max size 10

        Timing Results:
        2.053 -- alexis
        2.199 -- ChessMaster
        2.410 -- agf (optimized)
        3.394 -- agf
        3.398 -- Rik. Poggi
        3.640 -- robert king
        3.719 -- steabert
        3.776 -- Niklas B.
        3.888 -- hochl
        4.610 -- Sven Marnach
        5.018 -- katrielalex



      3. And finally with Agf's benchmark I got:



        Timing with: >> Agf << Benchmark
        Info: 2000 lists, average size 246, max size 500

        Timing Results:
        3.446 -- Rik. Poggi
        3.500 -- ChessMaster
        3.520 -- agf (optimized)
        3.527 -- Niklas B.
        3.527 -- agf
        3.902 -- hochl
        5.080 -- alexis
        15.997 -- steabert
        16.422 -- katrielalex
        18.317 -- robert king
        1257.152 -- Sven Marnach



      As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.



      Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)






      share|improve this answer














      I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.



      I tried to test and time every solution (all the code here).



      Testing



      This is the TestCase from the testing module:



      class MergeTestCase(unittest.TestCase):

      def setUp(self):
      with open('./lists/test_list.txt') as f:
      self.lsts = json.loads(f.read())
      self.merged = self.merge_func(deepcopy(self.lsts))

      def test_disjoint(self):
      """Check disjoint-ness of merged results"""
      from itertools import combinations
      for a,b in combinations(self.merged, 2):
      self.assertTrue(a.isdisjoint(b))

      def test_coverage(self): # Credit to katrielalex
      """Check coverage original data"""
      merged_flat = set()
      for s in self.merged:
      merged_flat |= s

      original_flat = set()
      for lst in self.lsts:
      original_flat |= set(lst)

      self.assertTrue(merged_flat == original_flat)

      def test_subset(self): # Credit to WolframH
      """Check that every original data is a subset"""
      for lst in self.lsts:
      self.assertTrue(any(set(lst) <= e for e in self.merged))


      This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.



      I couldn't test the following:



      katrielalex
      steabert


      Among the ones I could test, two failed:



        -- Going to test: agf (optimized) --
      Check disjoint-ness of merged results ... FAIL

      -- Going to test: robert king --
      Check disjoint-ness of merged results ... FAIL


      Timing



      The performances are strongly related with the data test employed.



      So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.





      1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.



        I've used the same three sets of parameters he used in his own answer, and I put them in three different files:



        filename = './lists/timing_1.txt'
        class_count = 50,
        class_size = 1000,
        list_count_per_class = 100,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_2.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.5,

        filename = './lists/timing_3.txt'
        class_count = 15,
        class_size = 1000,
        list_count_per_class = 300,
        large_list_sizes = (100, 1000),
        small_list_sizes = (0, 100),
        large_list_probability = 0.1,


        This are the results that I got:



        From file: timing_1.txt



        Timing with: >> Niklas << Benchmark
        Info: 5000 lists, average size 305, max size 999

        Timing Results:
        10.434 -- alexis
        11.476 -- agf
        11.555 -- Niklas B.
        13.622 -- Rik. Poggi
        14.016 -- agf (optimized)
        14.057 -- ChessMaster
        20.208 -- katrielalex
        21.697 -- steabert
        25.101 -- robert king
        76.870 -- Sven Marnach
        133.399 -- hochl


        From file: timing_2.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 305, max size 999

        Timing Results:
        8.247 -- Niklas B.
        8.286 -- agf
        8.637 -- Rik. Poggi
        8.967 -- alexis
        9.090 -- ChessMaster
        9.091 -- agf (optimized)
        18.186 -- katrielalex
        19.543 -- steabert
        22.852 -- robert king
        70.486 -- Sven Marnach
        104.405 -- hochl


        From file: timing_3.txt



        Timing with: >> Niklas << Benchmark
        Info: 4500 lists, average size 98, max size 999

        Timing Results:
        2.746 -- agf
        2.850 -- Niklas B.
        2.887 -- Rik. Poggi
        2.972 -- alexis
        3.077 -- ChessMaster
        3.174 -- agf (optimized)
        5.811 -- katrielalex
        7.208 -- robert king
        9.193 -- steabert
        23.536 -- Sven Marnach
        37.436 -- hochl



      2. With Sven's testing data I got the following results:



        Timing with: >> Sven << Benchmark
        Info: 200 lists, average size 10, max size 10

        Timing Results:
        2.053 -- alexis
        2.199 -- ChessMaster
        2.410 -- agf (optimized)
        3.394 -- agf
        3.398 -- Rik. Poggi
        3.640 -- robert king
        3.719 -- steabert
        3.776 -- Niklas B.
        3.888 -- hochl
        4.610 -- Sven Marnach
        5.018 -- katrielalex



      3. And finally with Agf's benchmark I got:



        Timing with: >> Agf << Benchmark
        Info: 2000 lists, average size 246, max size 500

        Timing Results:
        3.446 -- Rik. Poggi
        3.500 -- ChessMaster
        3.520 -- agf (optimized)
        3.527 -- Niklas B.
        3.527 -- agf
        3.902 -- hochl
        5.080 -- alexis
        15.997 -- steabert
        16.422 -- katrielalex
        18.317 -- robert king
        1257.152 -- Sven Marnach



      As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.



      Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited May 23 '17 at 11:45









      Community

      11




      11










      answered Feb 26 '12 at 16:38









      Rik Poggi

      18.9k54871




      18.9k54871












      • how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
        – ChessMaster
        Feb 26 '12 at 20:59










      • @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
        – Rik Poggi
        Feb 26 '12 at 22:38










      • Thanks for your great effort.
        – Developer
        Feb 27 '12 at 14:29






      • 2




        Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
        – Niklas B.
        Feb 27 '12 at 20:33












      • Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
        – tommy.carstensen
        Apr 28 '15 at 18:36


















      • how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
        – ChessMaster
        Feb 26 '12 at 20:59










      • @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
        – Rik Poggi
        Feb 26 '12 at 22:38










      • Thanks for your great effort.
        – Developer
        Feb 27 '12 at 14:29






      • 2




        Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
        – Niklas B.
        Feb 27 '12 at 20:33












      • Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
        – tommy.carstensen
        Apr 28 '15 at 18:36
















      how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
      – ChessMaster
      Feb 26 '12 at 20:59




      how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.
      – ChessMaster
      Feb 26 '12 at 20:59












      @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
      – Rik Poggi
      Feb 26 '12 at 22:38




      @ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)
      – Rik Poggi
      Feb 26 '12 at 22:38












      Thanks for your great effort.
      – Developer
      Feb 27 '12 at 14:29




      Thanks for your great effort.
      – Developer
      Feb 27 '12 at 14:29




      2




      2




      Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
      – Niklas B.
      Feb 27 '12 at 20:33






      Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!
      – Niklas B.
      Feb 27 '12 at 20:33














      Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
      – tommy.carstensen
      Apr 28 '15 at 18:36




      Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...
      – tommy.carstensen
      Apr 28 '15 at 18:36










      up vote
      7
      down vote













      Using Matrix Manipulations



      Let me preface this answer with the following comment:



      THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.



      That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:



      from numpy import where, newaxis
      from scipy import linalg, array, zeros

      X = [[0,1,3],[2],[3,1]]


      We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:



      A = zeros((4,len(X)), dtype=float)
      for i,row in enumerate(X):
      for val in row: A[val,i] = 1


      In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.



      S  = linalg.svd(A)


      We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.



      M  = abs(S[2])


      We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.



      M /=  M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)
      V = abs(V)


      Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:



      idx = where(U > .999)[0]
      C = V.T[idx] > 0


      I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:



      for cluster in C:
      print where(A[:,cluster].sum(axis=1))[0]


      Which gives, as intended:



      [0 1 3]
      [2]


      Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].





      Addendum



      Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.





      Visual Representation



      Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!



      My ExampleYour sample data





      Faster Implementation



      In hindsight, I realized that you can skip the SVD step and compute only a single decomp:



      M = dot(A.T,A)
      M /= M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)


      The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).






      share|improve this answer























      • Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
        – Developer
        Feb 3 '12 at 2:37










      • How did you generate these really nice images?
        – Zachary Young
        Feb 3 '12 at 2:53






      • 1




        @ZacharyYoung check MatPlotLib for graphing, see Gallery.
        – Developer
        Feb 3 '12 at 3:20






      • 1




        @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
        – Zachary Young
        Feb 3 '12 at 5:05






      • 1




        At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
        – Developer
        Feb 3 '12 at 5:49















      up vote
      7
      down vote













      Using Matrix Manipulations



      Let me preface this answer with the following comment:



      THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.



      That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:



      from numpy import where, newaxis
      from scipy import linalg, array, zeros

      X = [[0,1,3],[2],[3,1]]


      We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:



      A = zeros((4,len(X)), dtype=float)
      for i,row in enumerate(X):
      for val in row: A[val,i] = 1


      In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.



      S  = linalg.svd(A)


      We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.



      M  = abs(S[2])


      We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.



      M /=  M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)
      V = abs(V)


      Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:



      idx = where(U > .999)[0]
      C = V.T[idx] > 0


      I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:



      for cluster in C:
      print where(A[:,cluster].sum(axis=1))[0]


      Which gives, as intended:



      [0 1 3]
      [2]


      Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].





      Addendum



      Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.





      Visual Representation



      Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!



      My ExampleYour sample data





      Faster Implementation



      In hindsight, I realized that you can skip the SVD step and compute only a single decomp:



      M = dot(A.T,A)
      M /= M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)


      The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).






      share|improve this answer























      • Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
        – Developer
        Feb 3 '12 at 2:37










      • How did you generate these really nice images?
        – Zachary Young
        Feb 3 '12 at 2:53






      • 1




        @ZacharyYoung check MatPlotLib for graphing, see Gallery.
        – Developer
        Feb 3 '12 at 3:20






      • 1




        @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
        – Zachary Young
        Feb 3 '12 at 5:05






      • 1




        At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
        – Developer
        Feb 3 '12 at 5:49













      up vote
      7
      down vote










      up vote
      7
      down vote









      Using Matrix Manipulations



      Let me preface this answer with the following comment:



      THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.



      That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:



      from numpy import where, newaxis
      from scipy import linalg, array, zeros

      X = [[0,1,3],[2],[3,1]]


      We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:



      A = zeros((4,len(X)), dtype=float)
      for i,row in enumerate(X):
      for val in row: A[val,i] = 1


      In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.



      S  = linalg.svd(A)


      We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.



      M  = abs(S[2])


      We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.



      M /=  M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)
      V = abs(V)


      Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:



      idx = where(U > .999)[0]
      C = V.T[idx] > 0


      I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:



      for cluster in C:
      print where(A[:,cluster].sum(axis=1))[0]


      Which gives, as intended:



      [0 1 3]
      [2]


      Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].





      Addendum



      Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.





      Visual Representation



      Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!



      My ExampleYour sample data





      Faster Implementation



      In hindsight, I realized that you can skip the SVD step and compute only a single decomp:



      M = dot(A.T,A)
      M /= M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)


      The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).






      share|improve this answer














      Using Matrix Manipulations



      Let me preface this answer with the following comment:



      THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.



      That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:



      from numpy import where, newaxis
      from scipy import linalg, array, zeros

      X = [[0,1,3],[2],[3,1]]


      We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:



      A = zeros((4,len(X)), dtype=float)
      for i,row in enumerate(X):
      for val in row: A[val,i] = 1


      In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.



      S  = linalg.svd(A)


      We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.



      M  = abs(S[2])


      We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.



      M /=  M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)
      V = abs(V)


      Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:



      idx = where(U > .999)[0]
      C = V.T[idx] > 0


      I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:



      for cluster in C:
      print where(A[:,cluster].sum(axis=1))[0]


      Which gives, as intended:



      [0 1 3]
      [2]


      Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].





      Addendum



      Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.





      Visual Representation



      Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!



      My ExampleYour sample data





      Faster Implementation



      In hindsight, I realized that you can skip the SVD step and compute only a single decomp:



      M = dot(A.T,A)
      M /= M.sum(axis=1)[:,newaxis]
      U,V = linalg.eig(M,left=True, right=False)


      The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 3 '12 at 14:36

























      answered Feb 2 '12 at 16:01









      Hooked

      45.2k25133202




      45.2k25133202












      • Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
        – Developer
        Feb 3 '12 at 2:37










      • How did you generate these really nice images?
        – Zachary Young
        Feb 3 '12 at 2:53






      • 1




        @ZacharyYoung check MatPlotLib for graphing, see Gallery.
        – Developer
        Feb 3 '12 at 3:20






      • 1




        @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
        – Zachary Young
        Feb 3 '12 at 5:05






      • 1




        At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
        – Developer
        Feb 3 '12 at 5:49


















      • Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
        – Developer
        Feb 3 '12 at 2:37










      • How did you generate these really nice images?
        – Zachary Young
        Feb 3 '12 at 2:53






      • 1




        @ZacharyYoung check MatPlotLib for graphing, see Gallery.
        – Developer
        Feb 3 '12 at 3:20






      • 1




        @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
        – Zachary Young
        Feb 3 '12 at 5:05






      • 1




        At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
        – Developer
        Feb 3 '12 at 5:49
















      Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
      – Developer
      Feb 3 '12 at 2:37




      Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.
      – Developer
      Feb 3 '12 at 2:37












      How did you generate these really nice images?
      – Zachary Young
      Feb 3 '12 at 2:53




      How did you generate these really nice images?
      – Zachary Young
      Feb 3 '12 at 2:53




      1




      1




      @ZacharyYoung check MatPlotLib for graphing, see Gallery.
      – Developer
      Feb 3 '12 at 3:20




      @ZacharyYoung check MatPlotLib for graphing, see Gallery.
      – Developer
      Feb 3 '12 at 3:20




      1




      1




      @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
      – Zachary Young
      Feb 3 '12 at 5:05




      @Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)
      – Zachary Young
      Feb 3 '12 at 5:05




      1




      1




      At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
      – Developer
      Feb 3 '12 at 5:49




      At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?
      – Developer
      Feb 3 '12 at 5:49










      up vote
      5
      down vote













      EDIT: OK, the other questions has been closed, posting here.



      Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.



      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

      import networkx
      g = networkx.Graph()
      for sub_list in lists:
      for edge in pairs(sub_list):
      g.add_edge(*edge)

      networkx.connected_components(g)
      [[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]




      Explanation



      We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.



      Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.






      share|improve this answer



















      • 2




        Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
        – Niklas B.
        Feb 20 '12 at 0:12












      • Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
        – Developer
        Feb 21 '12 at 11:11










      • @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
        – Katriel
        Feb 21 '12 at 11:25










      • I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
        – robert king
        Feb 21 '12 at 12:38










      • @robertking feel free to edit =)... I just wrote what came to mind!
        – Katriel
        Feb 21 '12 at 14:19

















      up vote
      5
      down vote













      EDIT: OK, the other questions has been closed, posting here.



      Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.



      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

      import networkx
      g = networkx.Graph()
      for sub_list in lists:
      for edge in pairs(sub_list):
      g.add_edge(*edge)

      networkx.connected_components(g)
      [[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]




      Explanation



      We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.



      Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.






      share|improve this answer



















      • 2




        Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
        – Niklas B.
        Feb 20 '12 at 0:12












      • Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
        – Developer
        Feb 21 '12 at 11:11










      • @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
        – Katriel
        Feb 21 '12 at 11:25










      • I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
        – robert king
        Feb 21 '12 at 12:38










      • @robertking feel free to edit =)... I just wrote what came to mind!
        – Katriel
        Feb 21 '12 at 14:19















      up vote
      5
      down vote










      up vote
      5
      down vote









      EDIT: OK, the other questions has been closed, posting here.



      Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.



      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

      import networkx
      g = networkx.Graph()
      for sub_list in lists:
      for edge in pairs(sub_list):
      g.add_edge(*edge)

      networkx.connected_components(g)
      [[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]




      Explanation



      We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.



      Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.






      share|improve this answer














      EDIT: OK, the other questions has been closed, posting here.



      Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.



      def pairs(lst):
      i = iter(lst)
      first = prev = item = i.next()
      for item in i:
      yield prev, item
      prev = item
      yield item, first

      lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

      import networkx
      g = networkx.Graph()
      for sub_list in lists:
      for edge in pairs(sub_list):
      g.add_edge(*edge)

      networkx.connected_components(g)
      [[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]




      Explanation



      We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.



      Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited May 23 '17 at 11:53









      Community

      11




      11










      answered Feb 19 '12 at 22:46









      Katriel

      83.1k15102147




      83.1k15102147








      • 2




        Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
        – Niklas B.
        Feb 20 '12 at 0:12












      • Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
        – Developer
        Feb 21 '12 at 11:11










      • @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
        – Katriel
        Feb 21 '12 at 11:25










      • I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
        – robert king
        Feb 21 '12 at 12:38










      • @robertking feel free to edit =)... I just wrote what came to mind!
        – Katriel
        Feb 21 '12 at 14:19
















      • 2




        Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
        – Niklas B.
        Feb 20 '12 at 0:12












      • Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
        – Developer
        Feb 21 '12 at 11:11










      • @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
        – Katriel
        Feb 21 '12 at 11:25










      • I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
        – robert king
        Feb 21 '12 at 12:38










      • @robertking feel free to edit =)... I just wrote what came to mind!
        – Katriel
        Feb 21 '12 at 14:19










      2




      2




      Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
      – Niklas B.
      Feb 20 '12 at 0:12






      Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)
      – Niklas B.
      Feb 20 '12 at 0:12














      Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
      – Developer
      Feb 21 '12 at 11:11




      Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.
      – Developer
      Feb 21 '12 at 11:11












      @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
      – Katriel
      Feb 21 '12 at 11:25




      @NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.
      – Katriel
      Feb 21 '12 at 11:25












      I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
      – robert king
      Feb 21 '12 at 12:38




      I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.
      – robert king
      Feb 21 '12 at 12:38












      @robertking feel free to edit =)... I just wrote what came to mind!
      – Katriel
      Feb 21 '12 at 14:19






      @robertking feel free to edit =)... I just wrote what came to mind!
      – Katriel
      Feb 21 '12 at 14:19












      up vote
      4
      down vote













      Here's my answer. I haven't checked it against today's batch of answers.



      The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.



      (Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).



      def mergelists5(data):
      """Check each number in our arrays only once, merging when we find
      a number we have seen before.
      """

      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data ] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [ m for m in data if m ]
      print len(have), "groups in result"
      return have


      def locatebin(bins, n):
      """
      Find the bin where list n has ended up: Follow bin references until
      we find a bin that has not moved.
      """
      while bins[n] != n:
      n = bins[n]
      return n





      share|improve this answer























      • This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
        – alexis
        Feb 26 '12 at 14:25










      • (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
        – Rik Poggi
        Feb 27 '12 at 14:05










      • Yet another great approach.
        – Developer
        Feb 27 '12 at 14:36












      • Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
        – alexis
        Feb 28 '12 at 0:14















      up vote
      4
      down vote













      Here's my answer. I haven't checked it against today's batch of answers.



      The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.



      (Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).



      def mergelists5(data):
      """Check each number in our arrays only once, merging when we find
      a number we have seen before.
      """

      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data ] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [ m for m in data if m ]
      print len(have), "groups in result"
      return have


      def locatebin(bins, n):
      """
      Find the bin where list n has ended up: Follow bin references until
      we find a bin that has not moved.
      """
      while bins[n] != n:
      n = bins[n]
      return n





      share|improve this answer























      • This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
        – alexis
        Feb 26 '12 at 14:25










      • (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
        – Rik Poggi
        Feb 27 '12 at 14:05










      • Yet another great approach.
        – Developer
        Feb 27 '12 at 14:36












      • Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
        – alexis
        Feb 28 '12 at 0:14













      up vote
      4
      down vote










      up vote
      4
      down vote









      Here's my answer. I haven't checked it against today's batch of answers.



      The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.



      (Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).



      def mergelists5(data):
      """Check each number in our arrays only once, merging when we find
      a number we have seen before.
      """

      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data ] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [ m for m in data if m ]
      print len(have), "groups in result"
      return have


      def locatebin(bins, n):
      """
      Find the bin where list n has ended up: Follow bin references until
      we find a bin that has not moved.
      """
      while bins[n] != n:
      n = bins[n]
      return n





      share|improve this answer














      Here's my answer. I haven't checked it against today's batch of answers.



      The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.



      (Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).



      def mergelists5(data):
      """Check each number in our arrays only once, merging when we find
      a number we have seen before.
      """

      bins = range(len(data)) # Initialize each bin[n] == n
      nums = dict()

      data = [set(m) for m in data ] # Convert to sets
      for r, row in enumerate(data):
      for num in row:
      if num not in nums:
      # New number: tag it with a pointer to this row's bin
      nums[num] = r
      continue
      else:
      dest = locatebin(bins, nums[num])
      if dest == r:
      continue # already in the same bin

      if dest > r:
      dest, r = r, dest # always merge into the smallest bin

      data[dest].update(data[r])
      data[r] = None
      # Update our indices to reflect the move
      bins[r] = dest
      r = dest

      # Filter out the empty bins
      have = [ m for m in data if m ]
      print len(have), "groups in result"
      return have


      def locatebin(bins, n):
      """
      Find the bin where list n has ended up: Follow bin references until
      we find a bin that has not moved.
      """
      while bins[n] != n:
      n = bins[n]
      return n






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 26 '12 at 13:50

























      answered Feb 26 '12 at 12:56









      alexis

      33.3k954113




      33.3k954113












      • This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
        – alexis
        Feb 26 '12 at 14:25










      • (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
        – Rik Poggi
        Feb 27 '12 at 14:05










      • Yet another great approach.
        – Developer
        Feb 27 '12 at 14:36












      • Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
        – alexis
        Feb 28 '12 at 0:14


















      • This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
        – alexis
        Feb 26 '12 at 14:25










      • (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
        – Rik Poggi
        Feb 27 '12 at 14:05










      • Yet another great approach.
        – Developer
        Feb 27 '12 at 14:36












      • Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
        – alexis
        Feb 28 '12 at 0:14
















      This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
      – alexis
      Feb 26 '12 at 14:25




      This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)
      – alexis
      Feb 26 '12 at 14:25












      (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
      – Rik Poggi
      Feb 27 '12 at 14:05




      (+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)
      – Rik Poggi
      Feb 27 '12 at 14:05












      Yet another great approach.
      – Developer
      Feb 27 '12 at 14:36






      Yet another great approach.
      – Developer
      Feb 27 '12 at 14:36














      Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
      – alexis
      Feb 28 '12 at 0:14




      Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.
      – alexis
      Feb 28 '12 at 0:14










      up vote
      3
      down vote













      This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.



      from collections import deque

      def merge(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results


      The less overlap between the sets in a given set of data, the better this will do compared to the other functions.



      Here is an example case. If you have 4 sets, you need to compare:




      1, 2
      1, 3
      1, 4
      2, 3
      2, 4
      3, 4


      If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.



      There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.



      The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.



      The disjoint counter allows this to be tracked.





      My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.



      According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.



      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets


      No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.



      My (faster and simpler) version of the test code:



      import random

      tenk = range(10000)
      lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

      setup = """
      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      def merge1(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      lsts = """ + repr(lsts)

      import timeit
      print timeit.timeit("merge0(lsts)", setup=setup, number=10)
      print timeit.timeit("merge1(lsts)", setup=setup, number=10)





      share|improve this answer























      • thank you again for pointing out my mistake, i think my code is fine now.
        – ChessMaster
        Feb 23 '12 at 3:42










      • Good to see another method. Thanks. It is elegant.
        – Developer
        Feb 23 '12 at 11:27










      • What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
        – Rik Poggi
        Feb 25 '12 at 18:35










      • @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
        – agf
        Feb 26 '12 at 0:32










      • Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
        – Rik Poggi
        Feb 26 '12 at 1:08

















      up vote
      3
      down vote













      This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.



      from collections import deque

      def merge(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results


      The less overlap between the sets in a given set of data, the better this will do compared to the other functions.



      Here is an example case. If you have 4 sets, you need to compare:




      1, 2
      1, 3
      1, 4
      2, 3
      2, 4
      3, 4


      If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.



      There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.



      The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.



      The disjoint counter allows this to be tracked.





      My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.



      According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.



      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets


      No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.



      My (faster and simpler) version of the test code:



      import random

      tenk = range(10000)
      lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

      setup = """
      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      def merge1(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      lsts = """ + repr(lsts)

      import timeit
      print timeit.timeit("merge0(lsts)", setup=setup, number=10)
      print timeit.timeit("merge1(lsts)", setup=setup, number=10)





      share|improve this answer























      • thank you again for pointing out my mistake, i think my code is fine now.
        – ChessMaster
        Feb 23 '12 at 3:42










      • Good to see another method. Thanks. It is elegant.
        – Developer
        Feb 23 '12 at 11:27










      • What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
        – Rik Poggi
        Feb 25 '12 at 18:35










      • @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
        – agf
        Feb 26 '12 at 0:32










      • Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
        – Rik Poggi
        Feb 26 '12 at 1:08















      up vote
      3
      down vote










      up vote
      3
      down vote









      This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.



      from collections import deque

      def merge(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results


      The less overlap between the sets in a given set of data, the better this will do compared to the other functions.



      Here is an example case. If you have 4 sets, you need to compare:




      1, 2
      1, 3
      1, 4
      2, 3
      2, 4
      3, 4


      If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.



      There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.



      The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.



      The disjoint counter allows this to be tracked.





      My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.



      According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.



      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets


      No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.



      My (faster and simpler) version of the test code:



      import random

      tenk = range(10000)
      lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

      setup = """
      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      def merge1(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      lsts = """ + repr(lsts)

      import timeit
      print timeit.timeit("merge0(lsts)", setup=setup, number=10)
      print timeit.timeit("merge1(lsts)", setup=setup, number=10)





      share|improve this answer














      This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.



      from collections import deque

      def merge(lists):
      sets = deque(set(lst) for lst in lists if lst)
      results =
      disjoint = 0
      current = sets.pop()
      while True:
      merged = False
      newsets = deque()
      for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
      current.update(this)
      merged = True
      disjoint = 0
      else:
      newsets.append(this)
      disjoint += 1
      if sets:
      newsets.extendleft(sets)
      if not merged:
      results.append(current)
      try:
      current = newsets.pop()
      except IndexError:
      break
      disjoint = 0
      sets = newsets
      return results


      The less overlap between the sets in a given set of data, the better this will do compared to the other functions.



      Here is an example case. If you have 4 sets, you need to compare:




      1, 2
      1, 3
      1, 4
      2, 3
      2, 4
      3, 4


      If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.



      There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.



      The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.



      The disjoint counter allows this to be tracked.





      My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.



      According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.



      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets


      No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.



      My (faster and simpler) version of the test code:



      import random

      tenk = range(10000)
      lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

      setup = """
      def merge0(lists):
      newsets, sets = [set(lst) for lst in lists if lst],
      while len(sets) != len(newsets):
      sets, newsets = newsets,
      for aset in sets:
      for eachset in newsets:
      if not aset.isdisjoint(eachset):
      eachset.update(aset)
      break
      else:
      newsets.append(aset)
      return newsets

      def merge1(lsts):
      sets = [set(lst) for lst in lsts if lst]
      merged = 1
      while merged:
      merged = 0
      results =
      while sets:
      common, rest = sets[0], sets[1:]
      sets =
      for x in rest:
      if x.isdisjoint(common):
      sets.append(x)
      else:
      merged = 1
      common |= x
      results.append(common)
      sets = results
      return sets

      lsts = """ + repr(lsts)

      import timeit
      print timeit.timeit("merge0(lsts)", setup=setup, number=10)
      print timeit.timeit("merge1(lsts)", setup=setup, number=10)






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 25 '12 at 13:44

























      answered Feb 22 '12 at 18:14









      agf

      110k25221208




      110k25221208












      • thank you again for pointing out my mistake, i think my code is fine now.
        – ChessMaster
        Feb 23 '12 at 3:42










      • Good to see another method. Thanks. It is elegant.
        – Developer
        Feb 23 '12 at 11:27










      • What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
        – Rik Poggi
        Feb 25 '12 at 18:35










      • @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
        – agf
        Feb 26 '12 at 0:32










      • Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
        – Rik Poggi
        Feb 26 '12 at 1:08




















      • thank you again for pointing out my mistake, i think my code is fine now.
        – ChessMaster
        Feb 23 '12 at 3:42










      • Good to see another method. Thanks. It is elegant.
        – Developer
        Feb 23 '12 at 11:27










      • What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
        – Rik Poggi
        Feb 25 '12 at 18:35










      • @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
        – agf
        Feb 26 '12 at 0:32










      • Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
        – Rik Poggi
        Feb 26 '12 at 1:08


















      thank you again for pointing out my mistake, i think my code is fine now.
      – ChessMaster
      Feb 23 '12 at 3:42




      thank you again for pointing out my mistake, i think my code is fine now.
      – ChessMaster
      Feb 23 '12 at 3:42












      Good to see another method. Thanks. It is elegant.
      – Developer
      Feb 23 '12 at 11:27




      Good to see another method. Thanks. It is elegant.
      – Developer
      Feb 23 '12 at 11:27












      What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
      – Rik Poggi
      Feb 25 '12 at 18:35




      What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.
      – Rik Poggi
      Feb 25 '12 at 18:35












      @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
      – agf
      Feb 26 '12 at 0:32




      @RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.
      – agf
      Feb 26 '12 at 0:32












      Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
      – Rik Poggi
      Feb 26 '12 at 1:08






      Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.
      – Rik Poggi
      Feb 26 '12 at 1:08












      up vote
      1
      down vote













      This would be my updated approach:



      def merge(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i,res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0,i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set

      return results


      Note: During the merging empty lists will be removed.



      Update: Reliability.



      You need two tests for a 100% reliabilty of success:





      • Check that all the resulting sets are mutually disjointed:



        merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]

        from itertools import combinations
        for a,b in combinations(merged,2):
        if not a.isdisjoint(b):
        raise Exception(a,b) # just an example


      • Check that the merged set cover the original data. (as suggested by katrielalex)



      I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.






      share|improve this answer























      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:47










      • If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
        – Developer
        Feb 2 '12 at 16:01










      • @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
        – Rik Poggi
        Feb 2 '12 at 16:11












      • @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
        – Niklas B.
        Feb 2 '12 at 16:15












      • @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
        – Rik Poggi
        Feb 2 '12 at 16:46















      up vote
      1
      down vote













      This would be my updated approach:



      def merge(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i,res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0,i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set

      return results


      Note: During the merging empty lists will be removed.



      Update: Reliability.



      You need two tests for a 100% reliabilty of success:





      • Check that all the resulting sets are mutually disjointed:



        merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]

        from itertools import combinations
        for a,b in combinations(merged,2):
        if not a.isdisjoint(b):
        raise Exception(a,b) # just an example


      • Check that the merged set cover the original data. (as suggested by katrielalex)



      I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.






      share|improve this answer























      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:47










      • If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
        – Developer
        Feb 2 '12 at 16:01










      • @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
        – Rik Poggi
        Feb 2 '12 at 16:11












      • @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
        – Niklas B.
        Feb 2 '12 at 16:15












      • @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
        – Rik Poggi
        Feb 2 '12 at 16:46













      up vote
      1
      down vote










      up vote
      1
      down vote









      This would be my updated approach:



      def merge(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i,res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0,i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set

      return results


      Note: During the merging empty lists will be removed.



      Update: Reliability.



      You need two tests for a 100% reliabilty of success:





      • Check that all the resulting sets are mutually disjointed:



        merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]

        from itertools import combinations
        for a,b in combinations(merged,2):
        if not a.isdisjoint(b):
        raise Exception(a,b) # just an example


      • Check that the merged set cover the original data. (as suggested by katrielalex)



      I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.






      share|improve this answer














      This would be my updated approach:



      def merge(data):
      sets = (set(e) for e in data if e)
      results = [next(sets)]
      for e_set in sets:
      to_update =
      for i,res in enumerate(results):
      if not e_set.isdisjoint(res):
      to_update.insert(0,i)

      if not to_update:
      results.append(e_set)
      else:
      last = results[to_update.pop(-1)]
      for i in to_update:
      last |= results[i]
      del results[i]
      last |= e_set

      return results


      Note: During the merging empty lists will be removed.



      Update: Reliability.



      You need two tests for a 100% reliabilty of success:





      • Check that all the resulting sets are mutually disjointed:



        merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]

        from itertools import combinations
        for a,b in combinations(merged,2):
        if not a.isdisjoint(b):
        raise Exception(a,b) # just an example


      • Check that the merged set cover the original data. (as suggested by katrielalex)



      I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 19 '12 at 23:38

























      answered Feb 2 '12 at 14:55









      Rik Poggi

      18.9k54871




      18.9k54871












      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:47










      • If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
        – Developer
        Feb 2 '12 at 16:01










      • @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
        – Rik Poggi
        Feb 2 '12 at 16:11












      • @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
        – Niklas B.
        Feb 2 '12 at 16:15












      • @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
        – Rik Poggi
        Feb 2 '12 at 16:46


















      • lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
        – Developer
        Feb 2 '12 at 15:47










      • If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
        – Developer
        Feb 2 '12 at 16:01










      • @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
        – Rik Poggi
        Feb 2 '12 at 16:11












      • @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
        – Niklas B.
        Feb 2 '12 at 16:15












      • @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
        – Rik Poggi
        Feb 2 '12 at 16:46
















      lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
      – Developer
      Feb 2 '12 at 15:47




      lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], , [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.
      – Developer
      Feb 2 '12 at 15:47












      If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
      – Developer
      Feb 2 '12 at 16:01




      If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!
      – Developer
      Feb 2 '12 at 16:01












      @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
      – Rik Poggi
      Feb 2 '12 at 16:11






      @Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.
      – Rik Poggi
      Feb 2 '12 at 16:11














      @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
      – Niklas B.
      Feb 2 '12 at 16:15






      @Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.
      – Niklas B.
      Feb 2 '12 at 16:15














      @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
      – Rik Poggi
      Feb 2 '12 at 16:46




      @NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.
      – Rik Poggi
      Feb 2 '12 at 16:46










      up vote
      1
      down vote













      Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:




      • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)

      • Are the elements of the result sets the same as of the input lists?

      • Is every input list a subset of a result set?

      • Is every result set minimal, i.e. is it impossible to split it into two smaller sets?

      • It does not check if there are empty result sets - I don't know if you want them or not...


      .



      from itertools import chain

      def check(lsts, result):
      lsts = [set(s) for s in lsts]
      all_items = set(chain(*lsts))
      all_result_items = set(chain(*result))
      num_result_items = sum(len(s) for s in result)
      if num_result_items != len(all_result_items):
      print("Error: result sets overlap!")
      print(num_result_items, len(all_result_items))
      print(sorted(map(len, result)), sorted(map(len, lsts)))
      if all_items != all_result_items:
      print("Error: result doesn't match input lists!")
      if not all(any(set(s).issubset(t) for t in result) for s in lst):
      print("Error: not all input lists are contained in a result set!")

      seen = set()
      todo = list(filter(bool, lsts))
      done = False
      while not done:
      deletes =
      for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
      if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
      seen.update(s)
      deletes.append(i)
      for i in reversed(deletes):
      del todo[i]
      done = not deletes
      if todo:
      print("Error: A result set should be split into two or more parts!")
      print(todo)





      share|improve this answer





















      • It would be neat if you could write this in unit test language =)
        – Katriel
        Feb 20 '12 at 16:34















      up vote
      1
      down vote













      Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:




      • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)

      • Are the elements of the result sets the same as of the input lists?

      • Is every input list a subset of a result set?

      • Is every result set minimal, i.e. is it impossible to split it into two smaller sets?

      • It does not check if there are empty result sets - I don't know if you want them or not...


      .



      from itertools import chain

      def check(lsts, result):
      lsts = [set(s) for s in lsts]
      all_items = set(chain(*lsts))
      all_result_items = set(chain(*result))
      num_result_items = sum(len(s) for s in result)
      if num_result_items != len(all_result_items):
      print("Error: result sets overlap!")
      print(num_result_items, len(all_result_items))
      print(sorted(map(len, result)), sorted(map(len, lsts)))
      if all_items != all_result_items:
      print("Error: result doesn't match input lists!")
      if not all(any(set(s).issubset(t) for t in result) for s in lst):
      print("Error: not all input lists are contained in a result set!")

      seen = set()
      todo = list(filter(bool, lsts))
      done = False
      while not done:
      deletes =
      for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
      if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
      seen.update(s)
      deletes.append(i)
      for i in reversed(deletes):
      del todo[i]
      done = not deletes
      if todo:
      print("Error: A result set should be split into two or more parts!")
      print(todo)





      share|improve this answer





















      • It would be neat if you could write this in unit test language =)
        – Katriel
        Feb 20 '12 at 16:34













      up vote
      1
      down vote










      up vote
      1
      down vote









      Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:




      • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)

      • Are the elements of the result sets the same as of the input lists?

      • Is every input list a subset of a result set?

      • Is every result set minimal, i.e. is it impossible to split it into two smaller sets?

      • It does not check if there are empty result sets - I don't know if you want them or not...


      .



      from itertools import chain

      def check(lsts, result):
      lsts = [set(s) for s in lsts]
      all_items = set(chain(*lsts))
      all_result_items = set(chain(*result))
      num_result_items = sum(len(s) for s in result)
      if num_result_items != len(all_result_items):
      print("Error: result sets overlap!")
      print(num_result_items, len(all_result_items))
      print(sorted(map(len, result)), sorted(map(len, lsts)))
      if all_items != all_result_items:
      print("Error: result doesn't match input lists!")
      if not all(any(set(s).issubset(t) for t in result) for s in lst):
      print("Error: not all input lists are contained in a result set!")

      seen = set()
      todo = list(filter(bool, lsts))
      done = False
      while not done:
      deletes =
      for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
      if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
      seen.update(s)
      deletes.append(i)
      for i in reversed(deletes):
      del todo[i]
      done = not deletes
      if todo:
      print("Error: A result set should be split into two or more parts!")
      print(todo)





      share|improve this answer












      Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:




      • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)

      • Are the elements of the result sets the same as of the input lists?

      • Is every input list a subset of a result set?

      • Is every result set minimal, i.e. is it impossible to split it into two smaller sets?

      • It does not check if there are empty result sets - I don't know if you want them or not...


      .



      from itertools import chain

      def check(lsts, result):
      lsts = [set(s) for s in lsts]
      all_items = set(chain(*lsts))
      all_result_items = set(chain(*result))
      num_result_items = sum(len(s) for s in result)
      if num_result_items != len(all_result_items):
      print("Error: result sets overlap!")
      print(num_result_items, len(all_result_items))
      print(sorted(map(len, result)), sorted(map(len, lsts)))
      if all_items != all_result_items:
      print("Error: result doesn't match input lists!")
      if not all(any(set(s).issubset(t) for t in result) for s in lst):
      print("Error: not all input lists are contained in a result set!")

      seen = set()
      todo = list(filter(bool, lsts))
      done = False
      while not done:
      deletes =
      for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
      if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
      seen.update(s)
      deletes.append(i)
      for i in reversed(deletes):
      del todo[i]
      done = not deletes
      if todo:
      print("Error: A result set should be split into two or more parts!")
      print(todo)






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Feb 20 '12 at 3:21









      WolframH

      3,5541425




      3,5541425












      • It would be neat if you could write this in unit test language =)
        – Katriel
        Feb 20 '12 at 16:34


















      • It would be neat if you could write this in unit test language =)
        – Katriel
        Feb 20 '12 at 16:34
















      It would be neat if you could write this in unit test language =)
      – Katriel
      Feb 20 '12 at 16:34




      It would be neat if you could write this in unit test language =)
      – Katriel
      Feb 20 '12 at 16:34










      up vote
      1
      down vote













      lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

      import networkx as nx
      g = nx.Graph()

      for sub_list in lists:
      for i in range(1,len(sub_list)):
      g.add_edge(sub_list[0],sub_list[i])

      print nx.connected_components(g)
      #[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]


      Performance:



      5000 lists, 5 classes, average size 74, max size 1000
      15.2264976415


      Performance of merge1:



      print timeit.timeit("merge1(lsts)", setup=setup, number=10)
      5000 lists, 5 classes, average size 74, max size 1000
      1.26998780571



      So it is 11x slower than the fastest.. but the code is much more simple and readable!







      share|improve this answer



























        up vote
        1
        down vote













        lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

        import networkx as nx
        g = nx.Graph()

        for sub_list in lists:
        for i in range(1,len(sub_list)):
        g.add_edge(sub_list[0],sub_list[i])

        print nx.connected_components(g)
        #[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]


        Performance:



        5000 lists, 5 classes, average size 74, max size 1000
        15.2264976415


        Performance of merge1:



        print timeit.timeit("merge1(lsts)", setup=setup, number=10)
        5000 lists, 5 classes, average size 74, max size 1000
        1.26998780571



        So it is 11x slower than the fastest.. but the code is much more simple and readable!







        share|improve this answer

























          up vote
          1
          down vote










          up vote
          1
          down vote









          lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

          import networkx as nx
          g = nx.Graph()

          for sub_list in lists:
          for i in range(1,len(sub_list)):
          g.add_edge(sub_list[0],sub_list[i])

          print nx.connected_components(g)
          #[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]


          Performance:



          5000 lists, 5 classes, average size 74, max size 1000
          15.2264976415


          Performance of merge1:



          print timeit.timeit("merge1(lsts)", setup=setup, number=10)
          5000 lists, 5 classes, average size 74, max size 1000
          1.26998780571



          So it is 11x slower than the fastest.. but the code is much more simple and readable!







          share|improve this answer














          lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

          import networkx as nx
          g = nx.Graph()

          for sub_list in lists:
          for i in range(1,len(sub_list)):
          g.add_edge(sub_list[0],sub_list[i])

          print nx.connected_components(g)
          #[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]


          Performance:



          5000 lists, 5 classes, average size 74, max size 1000
          15.2264976415


          Performance of merge1:



          print timeit.timeit("merge1(lsts)", setup=setup, number=10)
          5000 lists, 5 classes, average size 74, max size 1000
          1.26998780571



          So it is 11x slower than the fastest.. but the code is much more simple and readable!








          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 22 '12 at 0:01

























          answered Feb 21 '12 at 23:40









          robert king

          9,17264986




          9,17264986






















              up vote
              1
              down vote













              This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.



              It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.



              def merge(lsts):
              # this is an index list that stores the joined id for each list
              joined = range(len(lsts))
              # create an ordered list with indices
              indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
              # loop throught the ordered list, and if two elements are the same and
              # the lists are not yet joined, alter the list with joined id
              el_0,idx_0 = None,None
              for el,idx in indexed_list:
              if el == el_0 and joined[idx] != joined[idx_0]:
              old = joined[idx]
              rep = joined[idx_0]
              joined = [rep if id == old else id for id in joined]
              el_0, idx_0 = el, idx
              return joined





              share|improve this answer























              • Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
                – Developer
                Feb 7 '12 at 9:07

















              up vote
              1
              down vote













              This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.



              It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.



              def merge(lsts):
              # this is an index list that stores the joined id for each list
              joined = range(len(lsts))
              # create an ordered list with indices
              indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
              # loop throught the ordered list, and if two elements are the same and
              # the lists are not yet joined, alter the list with joined id
              el_0,idx_0 = None,None
              for el,idx in indexed_list:
              if el == el_0 and joined[idx] != joined[idx_0]:
              old = joined[idx]
              rep = joined[idx_0]
              joined = [rep if id == old else id for id in joined]
              el_0, idx_0 = el, idx
              return joined





              share|improve this answer























              • Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
                – Developer
                Feb 7 '12 at 9:07















              up vote
              1
              down vote










              up vote
              1
              down vote









              This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.



              It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.



              def merge(lsts):
              # this is an index list that stores the joined id for each list
              joined = range(len(lsts))
              # create an ordered list with indices
              indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
              # loop throught the ordered list, and if two elements are the same and
              # the lists are not yet joined, alter the list with joined id
              el_0,idx_0 = None,None
              for el,idx in indexed_list:
              if el == el_0 and joined[idx] != joined[idx_0]:
              old = joined[idx]
              rep = joined[idx_0]
              joined = [rep if id == old else id for id in joined]
              el_0, idx_0 = el, idx
              return joined





              share|improve this answer














              This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.



              It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.



              def merge(lsts):
              # this is an index list that stores the joined id for each list
              joined = range(len(lsts))
              # create an ordered list with indices
              indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
              # loop throught the ordered list, and if two elements are the same and
              # the lists are not yet joined, alter the list with joined id
              el_0,idx_0 = None,None
              for el,idx in indexed_list:
              if el == el_0 and joined[idx] != joined[idx_0]:
              old = joined[idx]
              rep = joined[idx_0]
              joined = [rep if id == old else id for id in joined]
              el_0, idx_0 = el, idx
              return joined






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Feb 22 '12 at 18:25









              agf

              110k25221208




              110k25221208










              answered Feb 7 '12 at 8:58









              steabert

              4,03011929




              4,03011929












              • Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
                – Developer
                Feb 7 '12 at 9:07




















              • Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
                – Developer
                Feb 7 '12 at 9:07


















              Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
              – Developer
              Feb 7 '12 at 9:07






              Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.
              – Developer
              Feb 7 '12 at 9:07












              up vote
              1
              down vote













              Firstly I'm not exactly sure if the benchmarks are fair:



              Adding the following code to the start of my function:



              c = Counter(chain(*lists))
              print c[1]
              "88"


              This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).



              Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.



              Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.



              Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:



              I thought I would order everything



              import heapq
              from itertools import chain
              def merge6(lists):
              for l in lists:
              l.sort()
              one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
              previous = one_list.next()

              d = {i:i for i in range(len(lists))}
              for current in one_list:
              if current[0]==previous[0]:
              d[current[1]] = d[previous[1]]
              previous=current

              groups=[ for i in range(len(lists))]
              for k in d:
              groups[d[k]].append(lists[k]) #add a each list to its group

              return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
              print merge6(lists)
              "[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



              import timeit
              print timeit.timeit("merge1(lsts)", setup=setup, number=10)
              print timeit.timeit("merge4(lsts)", setup=setup, number=10)
              print timeit.timeit("merge6(lsts)", setup=setup, number=10)
              5000 lists, 5 classes, average size 74, max size 1000
              1.26732238315
              5000 lists, 5 classes, average size 74, max size 1000
              1.16062907437
              5000 lists, 5 classes, average size 74, max size 1000
              30.7257182826





              share|improve this answer





















              • Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
                – Developer
                Feb 24 '12 at 2:11










              • Does not work with this input
                – timdiels
                Dec 23 '15 at 12:15















              up vote
              1
              down vote













              Firstly I'm not exactly sure if the benchmarks are fair:



              Adding the following code to the start of my function:



              c = Counter(chain(*lists))
              print c[1]
              "88"


              This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).



              Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.



              Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.



              Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:



              I thought I would order everything



              import heapq
              from itertools import chain
              def merge6(lists):
              for l in lists:
              l.sort()
              one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
              previous = one_list.next()

              d = {i:i for i in range(len(lists))}
              for current in one_list:
              if current[0]==previous[0]:
              d[current[1]] = d[previous[1]]
              previous=current

              groups=[ for i in range(len(lists))]
              for k in d:
              groups[d[k]].append(lists[k]) #add a each list to its group

              return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
              print merge6(lists)
              "[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



              import timeit
              print timeit.timeit("merge1(lsts)", setup=setup, number=10)
              print timeit.timeit("merge4(lsts)", setup=setup, number=10)
              print timeit.timeit("merge6(lsts)", setup=setup, number=10)
              5000 lists, 5 classes, average size 74, max size 1000
              1.26732238315
              5000 lists, 5 classes, average size 74, max size 1000
              1.16062907437
              5000 lists, 5 classes, average size 74, max size 1000
              30.7257182826





              share|improve this answer





















              • Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
                – Developer
                Feb 24 '12 at 2:11










              • Does not work with this input
                – timdiels
                Dec 23 '15 at 12:15













              up vote
              1
              down vote










              up vote
              1
              down vote









              Firstly I'm not exactly sure if the benchmarks are fair:



              Adding the following code to the start of my function:



              c = Counter(chain(*lists))
              print c[1]
              "88"


              This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).



              Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.



              Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.



              Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:



              I thought I would order everything



              import heapq
              from itertools import chain
              def merge6(lists):
              for l in lists:
              l.sort()
              one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
              previous = one_list.next()

              d = {i:i for i in range(len(lists))}
              for current in one_list:
              if current[0]==previous[0]:
              d[current[1]] = d[previous[1]]
              previous=current

              groups=[ for i in range(len(lists))]
              for k in d:
              groups[d[k]].append(lists[k]) #add a each list to its group

              return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
              print merge6(lists)
              "[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



              import timeit
              print timeit.timeit("merge1(lsts)", setup=setup, number=10)
              print timeit.timeit("merge4(lsts)", setup=setup, number=10)
              print timeit.timeit("merge6(lsts)", setup=setup, number=10)
              5000 lists, 5 classes, average size 74, max size 1000
              1.26732238315
              5000 lists, 5 classes, average size 74, max size 1000
              1.16062907437
              5000 lists, 5 classes, average size 74, max size 1000
              30.7257182826





              share|improve this answer












              Firstly I'm not exactly sure if the benchmarks are fair:



              Adding the following code to the start of my function:



              c = Counter(chain(*lists))
              print c[1]
              "88"


              This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).



              Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.



              Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.



              Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:



              I thought I would order everything



              import heapq
              from itertools import chain
              def merge6(lists):
              for l in lists:
              l.sort()
              one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
              previous = one_list.next()

              d = {i:i for i in range(len(lists))}
              for current in one_list:
              if current[0]==previous[0]:
              d[current[1]] = d[previous[1]]
              previous=current

              groups=[ for i in range(len(lists))]
              for k in d:
              groups[d[k]].append(lists[k]) #add a each list to its group

              return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
              print merge6(lists)
              "[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



              import timeit
              print timeit.timeit("merge1(lsts)", setup=setup, number=10)
              print timeit.timeit("merge4(lsts)", setup=setup, number=10)
              print timeit.timeit("merge6(lsts)", setup=setup, number=10)
              5000 lists, 5 classes, average size 74, max size 1000
              1.26732238315
              5000 lists, 5 classes, average size 74, max size 1000
              1.16062907437
              5000 lists, 5 classes, average size 74, max size 1000
              30.7257182826






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Feb 23 '12 at 22:16









              robert king

              9,17264986




              9,17264986












              • Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
                – Developer
                Feb 24 '12 at 2:11










              • Does not work with this input
                – timdiels
                Dec 23 '15 at 12:15


















              • Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
                – Developer
                Feb 24 '12 at 2:11










              • Does not work with this input
                – timdiels
                Dec 23 '15 at 12:15
















              Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
              – Developer
              Feb 24 '12 at 2:11




              Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.
              – Developer
              Feb 24 '12 at 2:11












              Does not work with this input
              – timdiels
              Dec 23 '15 at 12:15




              Does not work with this input
              – timdiels
              Dec 23 '15 at 12:15










              up vote
              1
              down vote













              Just for fun...



              def merge(mylists):
              results, sets = , [set(lst) for lst in mylists if lst]
              upd, isd, pop = set.update, set.isdisjoint, sets.pop
              while sets:
              if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
              results.append(pop(0))
              return results


              and my rewrite of the best answer



              def merge(lsts):
              sets = map(set,lsts)
              results =
              while sets:
              first, rest = sets[0], sets[1:]
              merged = False
              sets =
              for s in rest:
              if s and s.isdisjoint(first):
              sets.append(s)
              else:
              first |= s
              merged = True
              if merged: sets.append(first)
              else: results.append(first)
              return results





              share|improve this answer























              • Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
                – agf
                Feb 22 '12 at 20:34










              • @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
                – ChessMaster
                Feb 23 '12 at 3:34










              • Yep, that fixes it. The one you added performs terribly though in my tests.
                – agf
                Feb 24 '12 at 8:30















              up vote
              1
              down vote













              Just for fun...



              def merge(mylists):
              results, sets = , [set(lst) for lst in mylists if lst]
              upd, isd, pop = set.update, set.isdisjoint, sets.pop
              while sets:
              if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
              results.append(pop(0))
              return results


              and my rewrite of the best answer



              def merge(lsts):
              sets = map(set,lsts)
              results =
              while sets:
              first, rest = sets[0], sets[1:]
              merged = False
              sets =
              for s in rest:
              if s and s.isdisjoint(first):
              sets.append(s)
              else:
              first |= s
              merged = True
              if merged: sets.append(first)
              else: results.append(first)
              return results





              share|improve this answer























              • Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
                – agf
                Feb 22 '12 at 20:34










              • @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
                – ChessMaster
                Feb 23 '12 at 3:34










              • Yep, that fixes it. The one you added performs terribly though in my tests.
                – agf
                Feb 24 '12 at 8:30













              up vote
              1
              down vote










              up vote
              1
              down vote









              Just for fun...



              def merge(mylists):
              results, sets = , [set(lst) for lst in mylists if lst]
              upd, isd, pop = set.update, set.isdisjoint, sets.pop
              while sets:
              if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
              results.append(pop(0))
              return results


              and my rewrite of the best answer



              def merge(lsts):
              sets = map(set,lsts)
              results =
              while sets:
              first, rest = sets[0], sets[1:]
              merged = False
              sets =
              for s in rest:
              if s and s.isdisjoint(first):
              sets.append(s)
              else:
              first |= s
              merged = True
              if merged: sets.append(first)
              else: results.append(first)
              return results





              share|improve this answer














              Just for fun...



              def merge(mylists):
              results, sets = , [set(lst) for lst in mylists if lst]
              upd, isd, pop = set.update, set.isdisjoint, sets.pop
              while sets:
              if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
              results.append(pop(0))
              return results


              and my rewrite of the best answer



              def merge(lsts):
              sets = map(set,lsts)
              results =
              while sets:
              first, rest = sets[0], sets[1:]
              merged = False
              sets =
              for s in rest:
              if s and s.isdisjoint(first):
              sets.append(s)
              else:
              first |= s
              merged = True
              if merged: sets.append(first)
              else: results.append(first)
              return results






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Feb 26 '12 at 21:12

























              answered Feb 21 '12 at 18:52









              ChessMaster

              3611211




              3611211












              • Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
                – agf
                Feb 22 '12 at 20:34










              • @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
                – ChessMaster
                Feb 23 '12 at 3:34










              • Yep, that fixes it. The one you added performs terribly though in my tests.
                – agf
                Feb 24 '12 at 8:30


















              • Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
                – agf
                Feb 22 '12 at 20:34










              • @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
                – ChessMaster
                Feb 23 '12 at 3:34










              • Yep, that fixes it. The one you added performs terribly though in my tests.
                – agf
                Feb 24 '12 at 8:30
















              Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
              – agf
              Feb 22 '12 at 20:34




              Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).
              – agf
              Feb 22 '12 at 20:34












              @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
              – ChessMaster
              Feb 23 '12 at 3:34




              @agf algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.
              – ChessMaster
              Feb 23 '12 at 3:34












              Yep, that fixes it. The one you added performs terribly though in my tests.
              – agf
              Feb 24 '12 at 8:30




              Yep, that fixes it. The one you added performs terribly though in my tests.
              – agf
              Feb 24 '12 at 8:30










              up vote
              1
              down vote













              Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.



              def merge(data):
              parents = {}
              def find(i):
              j = parents.get(i, i)
              if j == i:
              return i
              k = find(j)
              if k != j:
              parents[i] = k
              return k
              for l in filter(None, data):
              parents.update(dict.fromkeys(map(find, l), find(l[0])))
              merged = {}
              for k, v in parents.items():
              merged.setdefault(find(v), ).append(k)
              return merged.values()


              This approach is comparable to the other best algorithms on Rik's benchmarks.






              share|improve this answer























              • FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
                – DSM
                Aug 22 '12 at 2:02















              up vote
              1
              down vote













              Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.



              def merge(data):
              parents = {}
              def find(i):
              j = parents.get(i, i)
              if j == i:
              return i
              k = find(j)
              if k != j:
              parents[i] = k
              return k
              for l in filter(None, data):
              parents.update(dict.fromkeys(map(find, l), find(l[0])))
              merged = {}
              for k, v in parents.items():
              merged.setdefault(find(v), ).append(k)
              return merged.values()


              This approach is comparable to the other best algorithms on Rik's benchmarks.






              share|improve this answer























              • FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
                – DSM
                Aug 22 '12 at 2:02













              up vote
              1
              down vote










              up vote
              1
              down vote









              Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.



              def merge(data):
              parents = {}
              def find(i):
              j = parents.get(i, i)
              if j == i:
              return i
              k = find(j)
              if k != j:
              parents[i] = k
              return k
              for l in filter(None, data):
              parents.update(dict.fromkeys(map(find, l), find(l[0])))
              merged = {}
              for k, v in parents.items():
              merged.setdefault(find(v), ).append(k)
              return merged.values()


              This approach is comparable to the other best algorithms on Rik's benchmarks.






              share|improve this answer














              Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.



              def merge(data):
              parents = {}
              def find(i):
              j = parents.get(i, i)
              if j == i:
              return i
              k = find(j)
              if k != j:
              parents[i] = k
              return k
              for l in filter(None, data):
              parents.update(dict.fromkeys(map(find, l), find(l[0])))
              merged = {}
              for k, v in parents.items():
              merged.setdefault(find(v), ).append(k)
              return merged.values()


              This approach is comparable to the other best algorithms on Rik's benchmarks.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited May 23 '17 at 12:17









              Community

              11




              11










              answered Aug 22 '12 at 0:15









              ecatmur

              112k15209300




              112k15209300












              • FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
                – DSM
                Aug 22 '12 at 2:02


















              • FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
                – DSM
                Aug 22 '12 at 2:02
















              FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
              – DSM
              Aug 22 '12 at 2:02




              FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..
              – DSM
              Aug 22 '12 at 2:02










              up vote
              0
              down vote













              My solution, works well on small lists and is quite readable without dependencies.



              def merge_list(starting_list):
              final_list =
              for i,v in enumerate(starting_list[:-1]):
              if set(v)&set(starting_list[i+1]):
              starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
              else:
              final_list.append(v)
              final_list.append(starting_list[-1])
              return final_list


              Benchmarking it:



              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]



              %timeit merge_list(lists)



              100000 loops, best of 3: 4.9 µs per loop






              share|improve this answer





















              • This fails on this input
                – timdiels
                Dec 23 '15 at 12:21















              up vote
              0
              down vote













              My solution, works well on small lists and is quite readable without dependencies.



              def merge_list(starting_list):
              final_list =
              for i,v in enumerate(starting_list[:-1]):
              if set(v)&set(starting_list[i+1]):
              starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
              else:
              final_list.append(v)
              final_list.append(starting_list[-1])
              return final_list


              Benchmarking it:



              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]



              %timeit merge_list(lists)



              100000 loops, best of 3: 4.9 µs per loop






              share|improve this answer





















              • This fails on this input
                – timdiels
                Dec 23 '15 at 12:21













              up vote
              0
              down vote










              up vote
              0
              down vote









              My solution, works well on small lists and is quite readable without dependencies.



              def merge_list(starting_list):
              final_list =
              for i,v in enumerate(starting_list[:-1]):
              if set(v)&set(starting_list[i+1]):
              starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
              else:
              final_list.append(v)
              final_list.append(starting_list[-1])
              return final_list


              Benchmarking it:



              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]



              %timeit merge_list(lists)



              100000 loops, best of 3: 4.9 µs per loop






              share|improve this answer












              My solution, works well on small lists and is quite readable without dependencies.



              def merge_list(starting_list):
              final_list =
              for i,v in enumerate(starting_list[:-1]):
              if set(v)&set(starting_list[i+1]):
              starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
              else:
              final_list.append(v)
              final_list.append(starting_list[-1])
              return final_list


              Benchmarking it:



              lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]



              %timeit merge_list(lists)



              100000 loops, best of 3: 4.9 µs per loop







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Feb 23 '15 at 22:11









              Chrismit

              990618




              990618












              • This fails on this input
                – timdiels
                Dec 23 '15 at 12:21


















              • This fails on this input
                – timdiels
                Dec 23 '15 at 12:21
















              This fails on this input
              – timdiels
              Dec 23 '15 at 12:21




              This fails on this input
              – timdiels
              Dec 23 '15 at 12:21










              up vote
              0
              down vote













              This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs:
              (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)






              share|improve this answer

























                up vote
                0
                down vote













                This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs:
                (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)






                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs:
                  (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)






                  share|improve this answer












                  This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs:
                  (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Sep 2 '16 at 18:33









                  ekardes

                  219137




                  219137






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f9110837%2fpython-simple-list-merging-based-on-intersections%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Full-time equivalent

                      さくらももこ

                      13 indicted, 8 arrested in Calif. drug cartel investigation