Matching two string lists that partially match into another list - python

I am trying to match a List containing strings (50 strings) with a list containing strings that are part of some of the strings of the previous list (5 strings). I will post the complete code in order to give context below but I also want to give a short example:
List1 = ['abcd12', 'efgh34', 'ijkl56', 'mnop78']
List2 = ['abc', 'ijk']
I want to return a list of the strings from List1 that have matches in List2. I have tried to do something with set.intersection but it seems you can't do partial matches with it (or at I can't with my limited abilities). I also tried any() but I had no success making it work with my lists. In my book it says I should use a nested loop but I don't know which function I should use and how regarding lists.
Here is the complete code as reference:
#!/usr/bin/env python3.4
# -*- coding: utf-8 -*-
import random
def generateSequences (n):
L = []
dna = ["A","G","C","T"]
for i in range(int(n)):
random_sequence=''
for i in range(50):
random_sequence+=random.choice(dna)
L.append(random_sequence)
print(L)
return L
def generatePrefixes (p, L):
S = [x[:20] for x in L]
D = []
for i in range(p):
randomPrefix = random.choice(S)
D.append(randomPrefix)
return S, D
if __name__ == "__main__":
L = generateSequences(15)
print (L)
S, D = generatePrefixes(5, L)
print (S)
print (D)
edit: As this was flagged as a possible duplicate i want to edit this in order to say that in this post python is used and the other is for R. I don't know R and if there are any similarities but it doesn't look like that to me at first glance. Sorry for the inconvenience.

Using a nested for loop:
def intersect(List1, List2):
# empty list for values that match
ret = []
for i in List2:
for j in List1:
if i in j:
ret.append(j)
return ret
List1 = ['abcd12', 'efgh34', 'ijkl56', 'mnop78']
List2 = ['abc', 'ijk']
print(intersect(List1, List2))

This may not be the most efficient way, but it works
matches = []
for seq_1 in List1:
for seq_2 in List2:
if seq_1 in seq_2 or seq_2 in seq_1:
matches.append(seq_1)
continue

You can just compare strings, I remove any duplicates from a result list from list1 that contain list2 items. This basically does it what you want:
f = []
for i in list1:
for j in list2:
if j in i:
f.append(i)
result = list(set(f))

Try
[l1 for l1 in List1 if any([l2 in l1 for l2 in List2])]

Related

Python parsing a list of strings through another list and return matches

I have two text files that I have turned into lists. List1 has lines that look like this:
'U|blah|USAA032812134||blah|blah|25|USAA032812134|blah|A||4||blah|2019-05-28 12:54:59|blah|123456||blah'
list2 has lines that look like this:
['smuspid\n', 'USAA032367605\n', 'USAA032367776\n', 'USAA044754265\n', 'USAA044754267\n']
I want to return every line in list1 that has a match in list2. I've tried using regex for this:
found = []
check = re.compile('|'.join(list2))
for elem in list1:
if check.match(elem):
found.append(elem)
but my code above is returning an empty list. Any suggestions?
I guess you can do that without a regular expression:
Method 1
list1 = ['U|blah|USAA032812134||blah|blah|25|USAA032812134|blah|A||4||blah|2019-05-28 12:54:59|blah|123456||blah']
list2 = ['|USAA032812134', '|USAA0328121304', '|USAA032999812134']
found = []
for i in list1:
for j in list2:
if j in i:
found.append(j)
print(found)
Output 1
['|USAA032812134']
Method 2 using List Comprehension
list1 = ['U|blah|USAA032812134||blah|blah|25|USAA032812134|blah|A||4||blah|2019-05-28 12:54:59|blah|123456||blah']
list2 = ['|USAA032812134', '|USAA0328121304', '|USAA032999812134', 'blah']
print([j for i in list1 for j in list2 if j in i])
Output 2
['|USAA032812134', 'blah']
Method 3: strip() for new lines
You can simply strip() and append() to your found list:
list1 = ['U|blah|USAA032812134||blah|blah|25|USAA032812134|blah|A||4||blah|2019-05-28 12:54:59|blah|123456||blah']
list2 = ['smuspid\n', 'USAA032812134\n', 'USAA032367605\n', 'USAA032367776\n',
'USAA044754265\n', 'USAA044754267\n']
found = []
for i in list1:
for j in list2:
if j.strip() in i:
found.append(j.strip())
print(found)
Output 3
['USAA032812134']

Python trouble with matching tuples

For reference this is my code:
list1 = [('10.180.13.101', '10.50.60.30', 'STCMGMTUNIX01')]
list2 = [('0.0.0.0', 'STCMGMTUNIX01')]
for i in list1:
for j in list2:
for k in j:
print (k)
if k.upper() in i:
matching_app.add(j)
for i in matching_app:
print (i)
When I run it, it does not match. This list can contain two or three variables and I need it to add it to the matching_app set if ANY value from list2 = ANY value from list1. It does not work unless the tuples are of equal length.
Any direction to how to resolve this logic error will be appreciated.
You can solve this in a few different ways. Here are two approaches:
Looping:
list1 = [('10.180.13.101', '10.50.60.30', 'STCMGMTUNIX01')]
list2 = [('0.0.0.0', 'STCMGMTUNIX01')]
matches = []
for i in list1[0]:
if i in list2[0]:
matches.append(i)
print(matches)
#['STCMGMTUNIX01']
List Comp with a set
merged = list(list1[0] + list2[0])
matches2 = set([i for i in merged if merged.count(i) > 1])
print(matches2)
#{'STCMGMTUNIX01'}
I'm not clear of what you want to do. You have two lists, each containing exactly one tuple. There also seems to be one missing comma in the first tuple.
For finding an item from a list in another list you can:
list1 = ['10.180.13.101', '10.50.60.30', 'STCMGMTUNIX01']
list2 = ['0.0.0.0', 'STCMGMTUNIX01']
for item in list2:
if item.upper() in list1: # Check if item is in list
print(item, 'found in', list1)
Works the same way with tuples.

Python, list of lists with sorted numbers by 4 first digits

I have a list which contains numbers and I want to create a new list which contains separate lists of all numbers with the same 4 first digits.
Example: make l2 from l1
l1 = [100023,100069,222236,22258,41415258,41413265,1214568...]
l2 = [[100023,100069],[222236,22258],[41415258,41413265],[1214568]...]
how can I create l2 from l1?
I tried iterating over the elements of l1 but w/o success!!
def main():
l1=[100023,100069,222236,22258,41415258,41413265,1214568]
l2=[[100023,100069],[222236,22258],[41415258,41413265],[1214568]]
x=0
n=1
for i in l2:
if i[0:4] == l2[n][0:4]:
l2[x].append(i)
else:
l2[x+1].append(i)
print(l2)
if __name__ == '__main__':
main()
Still not know how to proceed..
You could create a dict as intermediate result and then convert this dict back to a list.
You also need to convert your integers to strings first.
l1 = [100023,100069,222236,22258,41415258,41413265,1214568]
l2 = []
l2dict = {}
for i in l1:
prefix = str(i)[0:4]
if prefix in l2dict.keys():
l2dict[prefix].append(i)
else:
l2dict[prefix] = [i]
for item in l2dict.values():
l2.append(item)
print(l2)
You could use itertools.groupby after converting the list elements to strings and using the first for digits as keys:
import itertools
l2 = [list(value) for key, value
in itertools.groupby(l1, lambda x: str(x)[:4])]
print(l2)
EDIT: Frieder's solution is pretty much how this is implemented behind the scenes.

How to merge n lists together item by item for each list

I want to make one large list for entering into a database with values from 4 different lists. I want it to be like
[[list1[0], list2[0], list3[0], list4[0]], [list1[1], list2[1], list3[1], list4[1]], etc.....]
Another issue is that currently the data is received like this:
[ [ [list1[0], list1[1], [list1[3]]], [[list2[0]]], etc.....]
I've tried looping through each list using indexs and adding them to a new list based on those but it hasn't worked, I'm pretty sure it didn't work because some of the lists are different lengths (they're not meant to be but it's automated data so sometimes there's a mistake).
Anyone know what's the best way to go about this? Thanks.
First list can be constructed using zip function as follows (for 4 lists):
list1 = [1,2,3,4]
list2 = [5,6,7,8]
list3 = [9,10,11,12]
list4 = [13,14,15,16]
res = list(zip(list1,list2,list3,list4))
For arbitrtary number of lists stored in another list u can use *-notation to unpack outer list:
lists = [...]
res = list(zip(*lists))
To construct list of lists for zipping from you data in second issue use flatten concept to it and then zip:
def flatten(l):
res = []
for el in l:
if(isinstance(el, list)):
res += flatten(el)
else:
res.append(el)
return res
auto_data = [...]
res = list(zip(*[flatten(el) for el in auto_data]))
Some clarification at the end:
zip function construct results of the smallest length between all inputs, then you need to extend data in list comprehension in last code string to be one length to not lose some info.
So if I understand correctly, this is your input:
l = [[1.1,1.2,1.3,1.4],[2.1,2.2,2.3,2.4],[3.1,3.2,3.3,3.4],[4.1,4.2,4.3,4.4]]
and you would like to have this output
[[1.1,2.1,3.1,4.1],...]
If so, this could be done by using zip
zip(*l)
Make a for loop which only gives you the counter variable. Use that variable to index the lists. Make a temporary list , fill it up with the values from the other lists. Add that list to the final one. With this you will et the desired structure.
nestedlist = []
for counter in range(0,x):
temporarylist = []
temporarylist.append(firstlist[counter])
temporarylist.append(secondlist[counter])
temporarylist.append(thirdlist[counter])
temporarylist.append(fourthlist[counter])
nestedlist.append(temporarylist)
If all the 4 lists are the same length you can use this code to make it even nicer.
nestedlist = []
for counter in range(0,len(firstlist)): #changed line
temporarylist = []
temporarylist.append(firstlist[counter])
temporarylist.append(secondlist[counter])
temporarylist.append(thirdlist[counter])
temporarylist.append(fourthlist[counter])
nestedlist.append(temporarylist)
This comprehension should work, with a little help from zip:
mylist = [i for i in zip(list1, list2, list3, list4)]
But this assumes all the list are of the same length. If that's not the case (or you're not sure of that), you can "pad" them first, to be of same length.
def padlist(some_list, desired_length, pad_with):
while len(some_list) < desired_length:
some_list.append(pad_with)
return some_list
list_of_lists = [list1, list2, list3, list4]
maxlength = len(max(list_of_lists, key=len))
list_of_lists = [padlist(l, maxlength, 0) for l in list_of_lists]
And now do the above comprehension statement, works well in my testing of it
mylist = [i for i in zip(*list_of_lists)]
If the flatten concept doesn't work, try this out:
import numpy as np
myArray = np.array([[list1[0], list2[0], list3[0], list4[0]], [list1[1], list2[1], list3[1], list4[1]]])
np.hstack(myArray)
Also that one should work:
np.concatenate(myArray, axis=1)
Just for those who will search for the solution of this problem when lists are of the same length:
def flatten(lists):
results = []
for numbers in lists:
for output in numbers:
results.append(output)
return results
print(flatten(n))

Reduce list based off of element substrings

I'm looking for the most efficient way to reduce a given list based off of substrings already in the list.
For example
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
would be reduced to:
mylist = ['abcd','qrs']
because both 'abcd' and 'qrs' are the smallest substring of other elements in that list. I was able to do this with about 30 lines of code, but I suspect there is a crafty one-liner out there..
this seems to be working (but not so efficient i suppose)
def reduce_prefixes(strings):
sorted_strings = sorted(strings)
return [element
for index, element in enumerate(sorted_strings)
if all(not previous.startswith(element) and
not element.startswith(previous)
for previous in sorted_strings[:index])]
tests:
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
'qrs', 'qrst', 'qrstu'])
['abcd', 'qrs']
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
'qrs', 'qrst', 'qrstu',
'gabcd', 'gab', 'ab'])
['ab', 'gab', 'qrs']
Probably not the most efficient, but at least short:
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
outlist = []
for l in mylist:
if any(o.startswith(l) for o in outlist):
# l is a prefix of some elements in outlist, so it replaces them
outlist = [ o for o in outlist if not o.startswith(l) ] + [ l ]
if not any(l.startswith(o) for o in outlist):
# l has no prefix in outlist yet, so it becomes a prefix candidate
outlist.append(l)
print(outlist)
One solution is to iterate over all the strings and split them based on if they had different characters, and recursively apply that function.
def reduce_substrings(strings):
return list(_reduce_substrings(map(iter, strings)))
def _reduce_substrings(strings):
# A dictionary of characters to a list of strings that begin with that character
nexts = {}
for string in strings:
try:
nexts.setdefault(next(string), []).append(string)
except StopIteration:
# Reached the end of this string. It is the only shortest substring.
yield ''
return
for next_char, next_strings in nexts.items():
for next_substrings in _reduce_substrings(next_strings):
yield next_char + next_substrings
This splits it into a dictionary based on the character, and tries to find the shortest substring out of those that it split into a different list in the dictionary.
Of course, because of the recursive nature of this function, a one-liner wouldn't be possible as efficiently.
Try this one:
import re
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
new_list=[]
for i in mylist:
if re.match("^abcd$",i):
new_list.append(i)
elif re.match("^qrs$",i):
new_list.append(i)
print(new_list)
#['abcd', 'qrs']

Categories