What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!
You can use networkx and graphs for that:
import networkx as nx
G = nx.Graph([edge for nodes in df['col1'] for edge in zip(nodes, nodes[1:])])
result = pd.Series(nx.connected_components(G))
This is basically treating every number as a node, and whenever two number are in the same list then you connect them. Finally you find the connected components.
Output:
0 {1, 2, 4, 8, 10, 19}
1 {16, 17, 15}
2 {18, 3}
This is not straightforward. Merging lists has many pitfalls.
One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.
Here is your graph:
You can thus:
generate successive edges with add_edges_from
find the connected_components
craft a dictionary and map the first item of each list
groupby and merge the lists (you could use the connected components directly but I'm giving a pandas solution in case you have more columns to handle)
import networkx as nx
G = nx.Graph()
for l in df['col1']:
G.add_edges_from(zip(l, l[1:]))
groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}
out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x)))
)
output:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
Seems more like a Python problem than pandas one, so here's one attempt that checks every after list, merges (and removes) if intersecting:
vals = d["col1"]
# while there are at least 1 more list after to process...
i = 0
while i < len(vals) - 1:
current = set(vals[i])
# for the next lists...
j = i + 1
while j < len(vals):
# any intersection?
# then update the current and delete the other
other = vals[j]
if current.intersection(other):
current.update(other)
del vals[j]
else:
# no intersection, so keep going for next lists
j += 1
# put back the updated current back, and move on
vals[i] = current
i += 1
at the end, vals is
In [108]: vals
Out[108]: [{1, 2, 4, 8, 10, 19}, {15, 16, 17}, {3, 18}]
In [109]: pd.Series(map(list, vals))
Out[109]:
0 [1, 2, 19, 4, 8, 10]
1 [16, 17, 15]
2 [18, 3]
dtype: object
if you don't want vals modified, can chain .copy() for it.
To add on mozway's answer. It wasn't clear from the question, but I also had rows with single-valued lists. This values aren't clearly added to the graph when calling add_edges_from(zip(l, l[1:]), since l[1:] is empty. I solved it adding a singular node to the graph when encountering emtpy l[1:] lists. I leave the solution in case anyone needs it.
import networkx as nx
import pandas as pd
d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4], [9]]}
df= pd.DataFrame(data=d)
G = nx.Graph()
for l in df['col1']:
if len(l[1:]) == 0:
G.add_node(l[0])
else:
G.add_edges_from(zip(l, l[1:]))
groups = {k: v for v, l in enumerate(nx.connected_components(G)) for k in l}
out= (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x))))
Result:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
3 [9]
I am attempting to create a piece of code that will watch a counter with an output something like:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
I want the code to be able to tally the total and tell me how many counts are missed for example if this happened:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24, 25, 26, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2]
I would get a total of 92 still, but get feedback that 8 are missing.
I have gotten very close with the following code:
Blk_Tot = 0
CBN = 0
LBN = 0
x = 0
y = 0
z = 0
MissedBlocks = 0
for i in range(len(a1)):
CBN = a1[i]
if CBN - LBN <= 0:
if LBN == 30:
y = 30 - abs(CBN - LBN)
elif LBN < 30:
z = 30 - LBN
y = 30 - abs(CBN - LBN) + z
print(z)
Blk_Tot = Blk_Tot + y
else:
x = CBN - LBN
Blk_Tot = Blk_Tot + x
if x > 1:
MissedBlocks = MissedBlocks - 1 + x
LBN = CBN
print(Blk_Tot)
print(MissedBlocks)
If I delete anywhere between 1 and 30 it works perfectly, however if I delete across 30, say 29,30,1,2 it breaks.I don't expect it to be able to miss 30 in a row and still be able to come up with a proper total however.
Anyone have any ideas on how this might be achieved? I feel like I am missing an obvious answer :D
Sorry I think I was unclear, a1 is a counter coming from an external device that counts from 1 to 30 and then wraps around to 1 again. Each count is actually part of a message to show that the message was received; so say 1 2 4, I know that the 3rd message is missing. What I am trying to do is found out the total that should have been recieved and how many are missing from the count.
Update after an idea from the posts below, another method of doing this maybe:
Input:
123456
List[1,2,3,4,5,6]
1.Check first input to see which part of the list it is in and start from there (in case we don't start from zero)
2.every time an input is received check if that matches the next value in the array
3.if not then how many steps does it take to find that value
You don't need to keep track if you past the 30 line.
Just compare with the ideal sequence and count the missing numbers.
No knowledge if parts missing at the end.
No knowledge if more than 30 parts are missing in a block.
from itertools import cycle
def idealSeqGen():
for i in cycle(range(1,31)):
yield(i)
def receivedSeqGen():
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1,2]
for i in a1:
yield(i)
receivedSeq = receivedSeqGen()
idealSeq = idealSeqGen()
missing = 0
ideal = next(idealSeq)
try:
while True:
received = next(receivedSeq)
while received != ideal:
missing += 1
ideal = next(idealSeq)
ideal = next(idealSeq)
except StopIteration:
pass
print (f'There are {missing} items missing')
Edit
The loop part can be a little bit simpler
missing = 0
try:
while True:
ideal = next(idealSeq)
received = next(receivedSeq)
while received != ideal:
missing += 1
ideal = next(idealSeq)
except StopIteration:
pass
print (f'There are {missing} items missing')
In general, if you want to count the number of differences between two lists, you can easily use a dictionary. The other answer would also work, but it is highly inefficient for even slightly larger lists.
def counter(lst):
# create a dictionary with count of each element
d = {}
for i in lst:
if d.get(i, None):
d[i] += 1
else:
d[i] = 1
return d
def compare(d1, d2):
# d1 and d2 are dictionaries
ans = 0
for i in d1.values():
if d2.get(i, None):
# comapares the common values in both lists
ans += abs(d1[i]-d2[i])
d2[i] = 0
else:
#for elements only in the first list
ans += d1[i]
for i in d2.values():
# for elements only in the second list
if d2[i]>0:
ans += d2[i]
return ans
l1 = [...]
l2 = [...]
print(compare(counter(l1), counter(l2)))
New code to check for missing elements from a repeating sequence pattern
Now that I have understood your question more clearly, here's the code. The assumption in this code is the list will always be in ascending order from 1 thru 30 and then repeats again from 1. There can be missing elements between 1 and 30 but the order will always be in ascending order between 1 and 30.
If the source data is as shown in list a1, then the code will result in 8 missing elements.
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1,2]
a2 = a1.copy()
c = 1
missing = 0
while a2:
if a2[0] == c:
c+=1
a2.pop(0)
elif a2[0] > c:
missing +=1
c+=1
elif a2[0] < c:
missing += 31-c
c = 1
if c == 31: c=1
print (f'There are {missing} items missing in the list')
The output of this will be:
There are 8 items missing in the list
Let me know if this addresses your question
earlier code to compare two lists
You cannot use set as the items are repeated. So you need to sort them and find out how many times each element is in both lists. The difference will give you the missing counts. You may have an element in a1 but not in a2 or vice versa. So finding out the absolute count of missing items will give you the results.
I will update the response with better variables in my next update.
Here's how I did it:
code with comments:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
a2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2]
#step 1: Find out which list is longer. We will use that as the master list
if len(a1) > len(a2):
master_list = a1.copy()
second_list = a2.copy()
else:
master_list = a2.copy()
second_list = a1.copy()
#step 2: We must sort both master and second list
# so we can compare against each other
master_list.sort()
second_list.sort()
#set the counter to zero
missing = 0
#iterate through the master list and find all values in master against second list
#for each iteration, remove the value in master[0] from both master and second list
#when you have iterated through the full list, you will get an empty master_list
#this will help you to use while statement to iterate until master_list is empty
while master_list:
#pick the first element of master list to search for
x = master_list[0]
#count the number of times master_list[0] is found in both master and second list
a_count = master_list.count(x)
b_count = second_list.count(x)
#absolute difference of both gives you how many are missing from each other
#master may have 4 occurrences and second may have 2 occurrences. abs diff is 2
#master may have 2 occurrences and second may have 5 occurrences. abs diff is 3
missing += abs(a_count - b_count)
#now remove all occurrences of master_list[0] from both master and second list
master_list = [i for i in master_list if i != x]
second_list = [i for i in second_list if i != x]
#iterate until master_list is empty
#you may end up with a few items in second_list that are not found in master list
#add them to the missing items list
#thats your absolute total of all missing items between lists a1 and a2
#if you want to know the difference between the bigger list and shorter one,
#then don't add the missing items from second list
missing += len(second_list)
#now print the count of missig elements between the two lists
print ('Total number of missing elements are:', missing)
The output from this is:
Total number of missing elements are: 7
If you want to find out which elements are missing, then you need to add a few more lines of code.
In the above example, elements 27,28,29,30, 4, 5 are missing from a2 and 31 from a1. So total number of missing elements is 7.
code without comments:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
a2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2]
if len(a1) > len(a2):
master_list = a1.copy()
second_list = a2.copy()
else:
master_list = a2.copy()
second_list = a1.copy()
master_list.sort()
second_list.sort()
missing = 0
while master_list:
x = master_list[0]
a_count = master_list.count(x)
b_count = second_list.count(x)
missing += abs(a_count - b_count)
master_list = [i for i in master_list if i != x]
second_list = [i for i in second_list if i != x]
missing += len(second_list)
print ('Total number of missing elements are:', missing)