Merging items in list given condition - python

Let's say I have ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'].
If the second word in each element of the list is same as first word in any other elements in the list, they should be merged into one item. The order matters as well.
['A B C D E G', 'X Y Z'] should be the final product.
Letters will not form a cycle, i.e., ['A B', 'B C', 'C A'] is not possible.
As for why G was added at the end, let's say we are at 'A B C' and we see 'C D'. And later on in the list we have 'C G'. We add 'C D' since it comes first, so we have 'A B C D', and then if there is 'D E' we merge that to 'A B C D E'. Now at the end of a 'chain', we add 'C G' so we have 'A B C D E G'. If there had been a 'B H', since B comes first, then it would be 'A B C D E H G'.
Since order matters, ['A B', 'C A'] is still ['A B', 'C A'] since B does not connect to C.
Another example is:
If we have ['A B', 'A C'] then at step 1 we just have A B and then we see A C and we merge that into A B and we have A B C.
I have tried many things, such as dissecting each element into two lists, then indexing and removing etc. It is way too complex and I feel there should be a more intuitive way to solve this in Python. Alas, I can't seem to figure it out.

A simple algorithm solving this appears to be:
initialize results as empty list
repeat for each pair in input list:
repeat for each sublist R in results:
if R contains the first item of pair, append second item to R and continue with next pair
if no sublist R contained the first item of pair, append pair as new sublist to results
The implementation in Python is straightforward and should be self-explanatory:
def merge_pairs(pairs):
results = []
for pair in pairs:
first, second = pair.split()
for result in results:
if first in result:
result.append(second)
break
else:
results.append([first, second])
return [' '.join(result) for result in results]
The only extra steps are the conversions between space-separated letters and lists of letters by .split() and ' '.join().
Note the else clause of the for statement which is only executed if no break was encountered.
Some examples:
>>> merge_pairs(['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'])
['A B C D E G', 'X Y Z']
>>> merge_pairs(['A B', 'A C'])
['A B C']
>>> merge_pairs(['B C', 'A B'])
['B C', 'A B']

bit messy but this works.
a = ['AB', 'BC', 'XY', 'CD', 'YZ', 'DE', 'CG']
used_item = []
result = []
for f, fword in enumerate(a):
for s, sword in enumerate(a):
if f == s:
continue
if fword[1] == sword[0]:
if f in used_item or s in used_item:
idx = [i for i, w in enumerate(result) if fword[1] in w][0]
result = [r + sword[1] if i == idx else r for i, r in enumerate(result) ]
used_item.append(f)
used_item.append(s)
else:
result.append(fword+sword[1])
used_item.append(f)
used_item.append(s)
print(result)
output is ['ABCDGE', 'XYZ']
you could sort 'ABCDGE' if necessary.

I appreciate all the above answers and wanted to attempt the question using Graph Theory.
This question is a classic example of connected components, where we can assign a unique identifier to each node (nodes are the different characters in the list = A, B, C, .... Z) accordingly.
Also, as we need to maintain the order, DFS is the correct choice to proceed with.
Algorithm:
1. Treat the characters A, B, C, .... Z as different 26 nodes.
2. Perform DFS on each node.
2.1 If the node has not been assigned an unique identifier:
2.1.1. Assign a new unique identifier to the node.
2.1.2. dfs(node)
2.2 Else:
no nothing
The function dfs() will group the nodes according to the unique identifier. While performing DFS, the child gets the same unique identifier as its parent.
Have a look at the following implementation:
class Graph:
def __init__(self, n):
self.number_of_nodes = n
self.visited = []
self.adjacency_list = []
self.identifier = [-1] * self.number_of_nodes
self.merged_list = []
def add_edge(self, edge_start, edge_end):
if(self.encode(edge_end) not in self.adjacency_list[self.encode(edge_start)]):
self.adjacency_list[self.encode(edge_start)] = self.adjacency_list[self.encode(edge_start)] + [self.encode(edge_end)]
def initialize_graph(self):
self.visited = [False] * self.number_of_nodes
for i in range(0, self.number_of_nodes):
self.adjacency_list = self.adjacency_list + [[]]
def get_adjacency_list(self):
return self.adjacency_list
def encode(self, node):
return ord(node) - 65
def decode(self, node):
return chr(node + 65)
def dfs(self, start_index):
if(self.visited[self.encode(start_index)] == True):
return
self.visited[self.encode(start_index)] = True
for node in self.adjacency_list[self.encode(start_index)]:
if(self.identifier[node] == -1):
self.identifier[node] = self.identifier[self.encode(start_index)]
if(self.visited[node] == False):
self.merged_list[self.identifier[node]] = self.merged_list[self.identifier[node]] + self.decode(node)
self.dfs(self.decode(node))
graph = Graph(26)
graph.initialize_graph()
input_list = ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G']
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
edge_end = edge[1]
graph.add_edge(edge_start, edge_end)
unique_identifier = 0
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
if(graph.identifier[graph.encode(edge_start)] == -1):
graph.identifier[graph.encode(edge_start)] = unique_identifier
graph.merged_list = graph.merged_list + [edge_start]
unique_identifier = unique_identifier + 1
graph.dfs(edge_start)
print(graph.merged_list)
Output:
['ABCDEG', 'XYZ']

Related

How to find a substring in a string list?

I try to find some string in a list, but have problems because of word order.
list = ['a b c d', 'e f g', 'h i j k']
str = 'e g'
I need to find the 2nd item in a list and output it.
You can use combination of any() and all() to check the presence in one line:
>>> my_list = ['a b c d', 'e f g', 'h i j k']
>>> my_str = 'e g'
>>> any(all(s in sub_list for s in my_str.split()) for sub_list in my_list)
True
Here, above expression will return True / False depending on whether the char in your strings are present inside the list.
To also get the get that sub-list as return value, you can modify above expression by skipping any() with list comprehension as:
>>> [sub_list for sub_list in my_list if all(s in sub_list for s in my_str.split())]
['e f g']
It'll return the list of strings containing your chars.
You can try:
for l in list:
l_words = l.split(" ")
if all([x in l_words for x in str.split(" ")]):
print(l_words)
You can try this
list = ['a b c d', 'e f g', 'h i j k']
str = list[2].split()
for letter in str:
print(letter)
This can be achieved by using sets and list comprehension
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
print([i for i in ls if len(set(s.replace(" ", "")).intersection(set(i.replace(" ", "")))) == len(s.replace(" ", ""))])
OR
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
s_set = set(s.replace(" ", ""))
print([i for i in ls if len(s_set.intersection(set(i.replace(" ", "")))) == len(s_set)])
Output
['e f g']
The list comprehension is removing all the items in ls that all the chars from s are not including inside the list item, by that you will get all the ls items that all the s chars are in them.

Python dictionary, constant complexity way to return all keys in dict contain certain string

I have a dictionary like: mydict = {'A B C':0, 'A B E':1, 'E F':0}
Then I have a search key search_string = 'A B'
where I would like to find all keys and values that the search_string is part of the mydict.keys(). So in this can 'A B C' and 'A B E' will satisfy.
Since mydict can be very large. Is there a constant time complexity to search this rather than:
result = [search_string in key for key, val in mydict.items()]
I am also open to restructure the dictionary if needed.
If the search is always for prefixes of string then you can use a prefix tree or Trie which is an existing Python module.
Trie allows finding matches in O(M) time, where M is the maximum
string length
reference
(i.e. depends upon max key length rather than number of keys).
Code
from pytrie import StringTrie
def create_prefix(dict):
" Creates a prefix tree based upon a dictionary "
# create empty trie
trie = StringTrie()
for k in dict:
trie[k] = k
return trie
Test 1
# Preprocess to create prefix tree
mydict = {'A B C':0, 'A B E':1, 'E F':0}
prefix_tree = create_prefix(mydict)
# Now you can use search tree multile times to speed individual searches
for search_string in ['A B', 'A B C', 'E', 'B']:
results = prefix_tree.values(search_string) # # .values resturn list that has this as a prefix
if results:
print(f'Search String {search_string} found in keys {results}')
else:
print(f'Search String {search_string} not found')
Output
Search String A B found in keys ['A B C', 'A B E']
Search String A B C found in keys ['A B C']
Search String E found in keys ['E F']
Search String B not found
Test 2 (added to answer question from OP)
mydict = {'A B C':0, 'A B C D':0, 'A B C D E':0}
prefix_tree = create_prefix(mydict)
# Now you can use search tree multile times to speed individual searches
for search_string in ['A B', 'A B C', 'A B C D', 'A B C D E', 'B C']:
results = prefix_tree.values(search_string) # # .values resturn list that has this as a prefix
if results:
print(f'Search String {search_string} found in keys {results}')
else:
print(f'Search String {search_string} not found')
Output
Search String A B found in keys ['A B C', 'A B C D', 'A B C D E']
Search String A B C found in keys ['A B C', 'A B C D', 'A B C D E']
Search String A B C D found in keys ['A B C D', 'A B C D E']
Search String A B C D E found in keys ['A B C D E']
Search String B C not found
You have two potential solutions here- the first doesn't have O(1) complexity, but it's probably the way you'll want to go:
We can try building a tree and doing the search that way- so essentially:
You could have mydict look like this:
test_dict = {
'A': {
'B': {
'C': 0,
'E': 1
}
},
'E': {
'F': 1
}
}
def get_recursive_values(mydict):
results = []
for key in mydict:
if isinstance(mydict[key], dict):
results.extend(get_recursive_values(mydict[key]))
else:
results.append(mydict[key])
return results
def search(mydict, search_text):
components = search_text.split(' ')
if components[0] in mydict:
next_res = mydict[components[0]]
if isinstance(next_res, dict):
if len(components) == 1:
return get_recursive_values(next_res)
return search(next_res, " ".join(components[1:]))
else:
return [mydict[components[0]]]
raise KeyError(components[0])
Probably could be written a little nicer, but that'll work for you- try calling search(test_dict, 'A B')
and you'll get both the results.
Another potential solution would be, if you don't care about insertion time, to have all the values for all the different keys- this may sound a bit ridiculous, but you'll get values in O(1) time but insertion time will be large- i.e.
'A': [0, 1],
'A B': [0, 1],
'A B C': [0],
'A B E': [1],
'E': [1],
'E F': [1]
}
def insert(mydict, key, value):
for k in mydict:
if k.startswith(key):
mydict[k].append(value)
mydict[key] = [value]

Extracting all uppercase words following each other from string

Out of a string like this "A B c de F G A" I would like to get the following list: ["A B", "F G A"]. That means, I need to get all the sequences of uppercase words.
I tried something like this:
text = "A B c de F G A"
result = []
for i, word in enumerate(text.split()):
if word[0].isupper():
s = ""
while word[0].isupper():
s += word
i += 1
word = text[i]
result.append(s)
But it produces a the following output: ['A', 'BB', 'F', 'G', 'A']
I suppose it happens because you can't skip a list element by just incrementing i. How can I avoid this situation and get the right output?
You can use itertools.groupby:
import itertools
s = "A B c de F G A"
new_s = [' '.join(b) for a, b in itertools.groupby(s.split(), key=str.isupper) if a]
Output:
['A B', 'F G A']
You can use re.split to split a string with a regex.
import re
def get_upper_sequences(s):
return re.split(r'\s+[a-z][a-z\s]*', s)
Example
>>> get_upper_sequences( "A B c de F G A")
['A B', 'F G A']
Here is solution without itertools or re:
def findTitles(text):
filtered = " ".join([x if x.istitle() else " " for x in text.split()])
return [y.strip() for y in filtered.split(" ") if y]
print(findTitles(text="A B c de F G A"))
#['A B', 'F G A']
print(findTitles(text="A Bbb c de F G A"))
#['A Bbb', 'F G A']
The following example will extract all uppercase words following each other from a string:
string="A B c de F G A"
import re
[val for val in re.split('[a-z]*',string.strip()) if val != " "]

Concatenating Arbitrary number of items of a string in Python

Given a list ['a','b','c','d','e','f'].No. of divisions to be made 2.. So In the first string i want to take the 0,2,4 elements of the list, and then concatenate them separated by a space delimiter with the second string of 1,3,5 elements.
The output needs to be in the form of k = ["a c e", "b d f"]
The actual program is to take in a string (eg {ball,bat,doll,choclate,bat,kite}), also take in the input of the number of kids who take those gifts(eg 2), and then divide them so that the frst kid gets a gift, goes to the back of the line, the second kid takes the gift and stands at the back, in that way all kids take gifts. If gifts remain then the first kid again takes a gift and the cycle continues....
desired output for above eg: {"ball doll bat" , "bat choclate kite"}
Here is a general way to do this for any number of groups:
def merge(lst, ngroups):
return [' '.join(lst[start::ngroups]) for start in xrange(ngroups)]
Here is how it's used:
>>> lst = ['a','b','c','d','e','f']
>>> merge(lst, 2)
['a c e', 'b d f']
>>> merge(lst, 3)
['a d', 'b e', 'c f']
lst = ['a','b','c','d','e','f']
k = [" ".join(lst[::2]), " ".join(lst[1::2])]
output:
['a c e', 'b d f']
more generic solution:
def group(lst, n):
return [" ".join(lst[i::n]) for i in xrange(n)]
lst = ['a','b','c','d','e','f']
print group(lst, 3)
output:
['a d', 'b e', 'c f']

Lining up DNA base pairs in an array

Trying to import a series of base pairs into an array.
I want it in the form ['AA','AT','AG','AC'...]
Here's my code:
paths = [str(x[4:7]) for x in mm_start]
paths
['A A', 'A T', 'A G', 'A C', 'T A', 'T T', 'T G', 'T C', 'G A', 'G T', 'G G', 'G C', 'C A', 'C T', 'C G', 'C C']
I get spaces in between the letters!
This strip command isn't helping either.
paths = str(paths).replace(" ","")
paths
"['AA','AT','AG','AC','TA','TT','TG','TC','GA','GT','GG','GC','CA','CT','CG','CC']"
Now I get a (") at the beginning and end of this array.
Any ideas very welcome!
The text file has the base pairs laid out
1 2 3 4 A A
1 2 3 1 A T
...
Thanks
You're converting the list into a string. You really want to say
paths = [pair.replace(" ", "") for pair in paths]
That iterates over each pair in your list of strings and removes spaces rather than converting the entire list to a string.
paths = str(paths).replace(" ","")
Well of course it doesn't work, because you converted the list to a string (and you didn't convert it back). What you need to do is apply the replace to each element of the list, not to the string expression of the list.
paths = [str(x[4:7]).replace(" ","") for x in mm_start]
(that's only one way of many).

Categories