Removing non-matching items from datasets

Removing non-matching items from datasets - python

I have two datasets consisting of lists of nested lists such that each item in the list looks like list1[i]= [a, x, y b] and list2[j] = [c, x, y, d] and where the length of the two lists does not necessarily match. I'd like to be able to go through the lists, preserve their order, and eliminate any of the sub-lists that do not contain matching x values. In the end, I want to get two lists of identical length and where for each index, the x value is the same in corresponding sub lists.
Right now I have a somewhat messy code that assumes that the set of x values in list2 is a subset of those in list1 (true at the moment) and then proceeds to remove items where the x values don't match.
len_diff = len(list1) - len(list2)
if len_diff > 0:
removed = []
for (counter, row) in enumerate(list2):
while list1[counter][1] != list2[counter][1]:
removed.append(list1.pop(counter))
new_len_diff = len(list1) - len(list2)
if new_len_diff < 0:
raise IndexError('Data sets do not completely overlap')
else:
for i in range(new_len_diff):
removed.append(temp_data.pop())
So basically I'm removing any items that don't x values match until they start matching again and then removing the end of list1 beyond the x values in list2 (raising an exception if I've cut too much out of list1).
Is there a better way to do this?
I don't necessarily need to relax the assumption that all x values in list2 are in list1 at the moment but it would make this code more useful to me in the future for other data manipulations. The biggest hole in my code now is that if there is a gap in my list1 data, I'll remove my entire list.

You should try this:
list1 = list2 = [x for x in list1 if x[1] in zip(*list2)[1]]
EDIT
Based on the comments below, the OP adapted this answer to do what was wanted by doing
list1 = [x for x in list1 if x[1] in zip(*list2)[1]]
list2 = [x for x in list2 if x[1] in zip(*list1)[1]]

Related

Grouping a grouped list of str without duplicates

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.

You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

Generating a list using another list and an index list

Suppose I have the following two list and a smaller list of indices:
list1=[2,3,4,6,7]
list2=[0,0,0,0,0]
idx=[1,2]
I want to replace the values in list 2 using the values in list 1 at the specified indices.
I could do so using the following loop:
for i in idx:
list2[i]=list1[i]
If I just have list1 and idx , how could I write a list comprehension to generate list2 (same length as list1)such that list2 has values of list1 at indices idx or 0 otherwise.

This will call __contains__ on every call for idx but should be reasonable for small(ish) lists.
list2 = [list1[i] if i in idx else 0 for i in range(len(list1))]
or
list2 = [e if i in idx else 0 for i, e in enumerate(list1)]
Also, do not write code like this. It is much less readable than your example. Furthermore, numpy may give you the kind of syntax you desire without sacrificing readability or speed.
import numpy as np
...
arr1 = np.array(list1)
arr2 = np.zeros_like(list1)
arr2[idx] = arr1[idx]

I assume that you want to generate list2 by using appending values of list1 at specific indexes. All you need to do this is to check whether the idx list contains any values and then use a for each loop to append the specific list1 values to list2. If idx is empty then you would only append list1[0] to list2.
if(len(idx) > 0):
for i in idx:
list2.append(list1[i])
else:
list2.append(list1[0])

Python trouble with matching tuples

For reference this is my code:
list1 = [('10.180.13.101', '10.50.60.30', 'STCMGMTUNIX01')]
list2 = [('0.0.0.0', 'STCMGMTUNIX01')]
for i in list1:
for j in list2:
for k in j:
print (k)
if k.upper() in i:
matching_app.add(j)
for i in matching_app:
print (i)
When I run it, it does not match. This list can contain two or three variables and I need it to add it to the matching_app set if ANY value from list2 = ANY value from list1. It does not work unless the tuples are of equal length.
Any direction to how to resolve this logic error will be appreciated.

You can solve this in a few different ways. Here are two approaches:
Looping:
list1 = [('10.180.13.101', '10.50.60.30', 'STCMGMTUNIX01')]
list2 = [('0.0.0.0', 'STCMGMTUNIX01')]
matches = []
for i in list1[0]:
if i in list2[0]:
matches.append(i)
print(matches)
#['STCMGMTUNIX01']
List Comp with a set
merged = list(list1[0] + list2[0])
matches2 = set([i for i in merged if merged.count(i) > 1])
print(matches2)
#{'STCMGMTUNIX01'}

I'm not clear of what you want to do. You have two lists, each containing exactly one tuple. There also seems to be one missing comma in the first tuple.
For finding an item from a list in another list you can:
list1 = ['10.180.13.101', '10.50.60.30', 'STCMGMTUNIX01']
list2 = ['0.0.0.0', 'STCMGMTUNIX01']
for item in list2:
if item.upper() in list1: # Check if item is in list
print(item, 'found in', list1)
Works the same way with tuples.

Referencing value and values after it

Let's say I have a list:
list1 = [1,2,3,4,5,6,7,8,9,10]
I want a loop that will, for every value, check if a concatenated version of it and any other beyond it are the same as an existing value in another list. Make sense? No? Here's what I want to come out.
The lists
list1 = ['3','ex','fish nets','orange','banana','exampl','apple']
list2 = ['e','x','blah','exam','p','l','blahblah']
Finally, we take these lists and let's say I want every time a value and any adjacent number of values after it are equivalent to a value in list1 for them to concatenate. (i.e. values e and x concatenate to be ex which exists in list1.) So, it modifies list2 to be:
list2 = ['ex','blah','exam','p','l','blahblah']
The same would be for three or four or however many values are in the list. The loop would reexamine the rest of possible combinations (only left -> right) and do the same. exam, p, and l concatenate to the value in list1 exampl. list2 then becomes:
list2 = ['ex','blah','exampl','blahblah']
The wording on this is pretty poor but I hope the examples were in-depth enough to give a representation of what I need.

list1 = ['3','ex','fish nets','orange','banana','exampl','apple']
list2 = ['e','x','blah','exam','p','l','blahblah']
a=list2
j=['e','x','blah','exam','p','l','blahblah']
t=[]
for y in range(0, len(j)):
for i in range(1,len(j)-y):
r=''
t=j[y:y+i+1] #Create a list of adjacent elements.
for x in t:
r+=str(x) #make it string and check if it fits.
if r in list1:
list2[y]=r
for e in range(y+1,y+i+1):
list2[e]=0 # to avoid a mess with indexes.
while 0 in list2:
list2.remove(0)
print list2
I know it's very clumpsy, but it seems to work.

Finding duplicates in few lists

In my case duplicate is not a an item that reappear in one list, but also in the same positions on another lists. For example:
list1 = [1,2,3,3,3,4,5,5]
list2 = ['a','b','b','c','b','d','e','e']
list3 = ['T1','T2','T3','T4','T3','T4','T5','T5']
So the position of the real duplicates in all 3 lists is [2,4] and [6,7]. Because in list1 3 is repeated, in list2 'b' is repeated in the same position as in list1, in list 3 'T3'. in second case 5,e,T5 represent duplicated items in the same positions in their lists. I have a hard time to present results "automatically" in one step.
1) I find duplicate in first list
# Find Duplicated part numbers (exact maches)
def list_duplicates(seq):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list(seen_twice)
# List of Duplicated part numbers
D_list1 = list_duplicates(list1)
D_list2 = list_duplicates(list2)
2) Then I find the positions of given duplicate and look at that position in second list
# find the row position of duplicated part numbers
def list_position_duplicates(list1,n,D_list1):
position = []
gen = (i for i,x in enumerate(data) if x == D_list1[n])
for i in gen: position.append(i)
return position
# Actual calculation find the row position of duplicated part numbers, beginning and end
lpd_part = list_position_duplicates(list1,1,D_list1)
start = lpd_part[0]
end = lpd_part[-1]
lpd_parent = list_position_duplicates(list2[start:end+1],0,D_list2)
So in step 2 I need to put n (position of found duplicate in the list), I would like to do this step automatically, to have a position of duplicated elements in the same positions in the lists. For all duplicates in the same time, and not one by one "manualy". I think it just need a for loop or if, but I'm new to Python and I tried many combinations and it didn't work.

You can use items from all 3 lists on the same index as key and store the the corresponding index as value(in a list). If for any key there are more than 1 indices stored in the list, it is duplicate:
from itertools import izip
def solve(*lists):
d = {}
for i, k in enumerate(izip(*lists)):
d.setdefault(k, []).append(i)
for k, v in d.items():
if len(v) > 1:
print k, v
solve(list1, list2, list3)
#(3, 'b', 'T3') [2, 4]
#(5, 'e', 'T5') [6, 7]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing non-matching items from datasets - python

You should try this: list1 = list2 = [x for x in list1 if x[1] in zip(list2)[1]] EDIT Based on the comments below, the OP adapted this answer to do what was wanted by doing list1 = [x for x in list1 if x[1] in zip(list2)[1]] list2 = [x for x in list2 if x[1] in zip(*list1)[1]]

Related

Grouping a grouped list of str without duplicates

Generating a list using another list and an index list

Python trouble with matching tuples

Referencing value and values after it

Finding duplicates in few lists

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing non-matching items from datasets - python

You should try this: list1 = list2 = [x for x in list1 if x[1] in zip(*list2)[1]] EDIT Based on the comments below, the OP adapted this answer to do what was wanted by doing list1 = [x for x in list1 if x[1] in zip(*list2)[1]] list2 = [x for x in list2 if x[1] in zip(*list1)[1]]

Related

Grouping a grouped list of str without duplicates

Generating a list using another list and an index list

Python trouble with matching tuples

Referencing value and values after it

Finding duplicates in few lists

Categories

Resources

You should try this: list1 = list2 = [x for x in list1 if x[1] in zip(list2)[1]] EDIT Based on the comments below, the OP adapted this answer to do what was wanted by doing list1 = [x for x in list1 if x[1] in zip(list2)[1]] list2 = [x for x in list2 if x[1] in zip(*list1)[1]]