Extracting all uppercase words following each other from string - python

Out of a string like this "A B c de F G A" I would like to get the following list: ["A B", "F G A"]. That means, I need to get all the sequences of uppercase words.
I tried something like this:
text = "A B c de F G A"
result = []
for i, word in enumerate(text.split()):
if word[0].isupper():
s = ""
while word[0].isupper():
s += word
i += 1
word = text[i]
result.append(s)
But it produces a the following output: ['A', 'BB', 'F', 'G', 'A']
I suppose it happens because you can't skip a list element by just incrementing i. How can I avoid this situation and get the right output?

You can use itertools.groupby:
import itertools
s = "A B c de F G A"
new_s = [' '.join(b) for a, b in itertools.groupby(s.split(), key=str.isupper) if a]
Output:
['A B', 'F G A']

You can use re.split to split a string with a regex.
import re
def get_upper_sequences(s):
return re.split(r'\s+[a-z][a-z\s]*', s)
Example
>>> get_upper_sequences( "A B c de F G A")
['A B', 'F G A']

Here is solution without itertools or re:
def findTitles(text):
filtered = " ".join([x if x.istitle() else " " for x in text.split()])
return [y.strip() for y in filtered.split(" ") if y]
print(findTitles(text="A B c de F G A"))
#['A B', 'F G A']
print(findTitles(text="A Bbb c de F G A"))
#['A Bbb', 'F G A']

The following example will extract all uppercase words following each other from a string:
string="A B c de F G A"
import re
[val for val in re.split('[a-z]*',string.strip()) if val != " "]

Related

How to compare the elements of the list and matrix and eliminate spaces

I have a matrix(l2) and a list(l1).
I want to check if any word of the list(elements in l1) are in the matrix or not. And If there is, the algorithm should attach the separated words. For example,
l1=['soheila','mahin','rose','pari']
l2=[['so he ila ha','soh e i la h a','so h eil a ha'],
['ma h in','m ah in','mahi n'],
['r os es', 'ro se s','ros e s'],
['pa ri sa','pari s a','p ar i sa']]
and my desire output is:
output=[['soheila ha','soheila h a','soheila ha'],
['mahin','mahin','mahin'],
['roses', 'rose s','rose s'],
['pari sa','pari s a','pari sa']]
As you see in this example, the spaces between the character of the words in the matrix and the list are deleted but the rest of words no.
I use this code to delete all of the spaces between characters in the matrix:
import numpy.core.defchararray as np_f
l3=list()
l3=l2
new_data = np_f.replace(l3, ' ', '')
#print(new_data)
for count_i, value in enumerate(l2):
result = [i for i in l3[count_i] and new_data[count_i] if l1[count_i] in i]
print(result)
But I don't know how to code the condition of the problem to produce the output matrix. I search alot and I know I can do it by using equalsIgnoreCase in Java but I'm at the beginner level of Python and have weaknesses. Could you help me to solve this problem and produce the output matrix?
You can do it like this
l1=['soheila','mahin','rose','pari']
l2=[['so he ila ha','soh e i la h a','so h eil a ha'],
['ma h in','m ah in','mahi n'],
['r os es', 'ro se s','ros e s'],
['pa ri sa','pari s a','p ar i sa']]
l = [[ word + " " + subword.replace(' ', '').replace(word, '') for subword in sublist if word in subword.replace(' ', '')] for sublist in l2 for word in l1]
result = list(filter(None, l))
print(result)
Output:
[['soheila ha', 'soheila ha', 'soheila ha'], ['mahin ', 'mahin ', 'mahin '], ['rose s', 'rose s', 'rose s'], ['pari sa', 'pari sa', 'pari sa']]
If you don’t mind my asking why not just split each index in the L2 list via the spaces:
L3=[]
For i in L2:
L3.append("".join(i .split(" ")))
Then check normally if a work in L1 is in L3
Haven’t checked if this would work but this is what I thought of, off the top of my head.

Merging items in list given condition

Let's say I have ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'].
If the second word in each element of the list is same as first word in any other elements in the list, they should be merged into one item. The order matters as well.
['A B C D E G', 'X Y Z'] should be the final product.
Letters will not form a cycle, i.e., ['A B', 'B C', 'C A'] is not possible.
As for why G was added at the end, let's say we are at 'A B C' and we see 'C D'. And later on in the list we have 'C G'. We add 'C D' since it comes first, so we have 'A B C D', and then if there is 'D E' we merge that to 'A B C D E'. Now at the end of a 'chain', we add 'C G' so we have 'A B C D E G'. If there had been a 'B H', since B comes first, then it would be 'A B C D E H G'.
Since order matters, ['A B', 'C A'] is still ['A B', 'C A'] since B does not connect to C.
Another example is:
If we have ['A B', 'A C'] then at step 1 we just have A B and then we see A C and we merge that into A B and we have A B C.
I have tried many things, such as dissecting each element into two lists, then indexing and removing etc. It is way too complex and I feel there should be a more intuitive way to solve this in Python. Alas, I can't seem to figure it out.
A simple algorithm solving this appears to be:
initialize results as empty list
repeat for each pair in input list:
repeat for each sublist R in results:
if R contains the first item of pair, append second item to R and continue with next pair
if no sublist R contained the first item of pair, append pair as new sublist to results
The implementation in Python is straightforward and should be self-explanatory:
def merge_pairs(pairs):
results = []
for pair in pairs:
first, second = pair.split()
for result in results:
if first in result:
result.append(second)
break
else:
results.append([first, second])
return [' '.join(result) for result in results]
The only extra steps are the conversions between space-separated letters and lists of letters by .split() and ' '.join().
Note the else clause of the for statement which is only executed if no break was encountered.
Some examples:
>>> merge_pairs(['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'])
['A B C D E G', 'X Y Z']
>>> merge_pairs(['A B', 'A C'])
['A B C']
>>> merge_pairs(['B C', 'A B'])
['B C', 'A B']
bit messy but this works.
a = ['AB', 'BC', 'XY', 'CD', 'YZ', 'DE', 'CG']
used_item = []
result = []
for f, fword in enumerate(a):
for s, sword in enumerate(a):
if f == s:
continue
if fword[1] == sword[0]:
if f in used_item or s in used_item:
idx = [i for i, w in enumerate(result) if fword[1] in w][0]
result = [r + sword[1] if i == idx else r for i, r in enumerate(result) ]
used_item.append(f)
used_item.append(s)
else:
result.append(fword+sword[1])
used_item.append(f)
used_item.append(s)
print(result)
output is ['ABCDGE', 'XYZ']
you could sort 'ABCDGE' if necessary.
I appreciate all the above answers and wanted to attempt the question using Graph Theory.
This question is a classic example of connected components, where we can assign a unique identifier to each node (nodes are the different characters in the list = A, B, C, .... Z) accordingly.
Also, as we need to maintain the order, DFS is the correct choice to proceed with.
Algorithm:
1. Treat the characters A, B, C, .... Z as different 26 nodes.
2. Perform DFS on each node.
2.1 If the node has not been assigned an unique identifier:
2.1.1. Assign a new unique identifier to the node.
2.1.2. dfs(node)
2.2 Else:
no nothing
The function dfs() will group the nodes according to the unique identifier. While performing DFS, the child gets the same unique identifier as its parent.
Have a look at the following implementation:
class Graph:
def __init__(self, n):
self.number_of_nodes = n
self.visited = []
self.adjacency_list = []
self.identifier = [-1] * self.number_of_nodes
self.merged_list = []
def add_edge(self, edge_start, edge_end):
if(self.encode(edge_end) not in self.adjacency_list[self.encode(edge_start)]):
self.adjacency_list[self.encode(edge_start)] = self.adjacency_list[self.encode(edge_start)] + [self.encode(edge_end)]
def initialize_graph(self):
self.visited = [False] * self.number_of_nodes
for i in range(0, self.number_of_nodes):
self.adjacency_list = self.adjacency_list + [[]]
def get_adjacency_list(self):
return self.adjacency_list
def encode(self, node):
return ord(node) - 65
def decode(self, node):
return chr(node + 65)
def dfs(self, start_index):
if(self.visited[self.encode(start_index)] == True):
return
self.visited[self.encode(start_index)] = True
for node in self.adjacency_list[self.encode(start_index)]:
if(self.identifier[node] == -1):
self.identifier[node] = self.identifier[self.encode(start_index)]
if(self.visited[node] == False):
self.merged_list[self.identifier[node]] = self.merged_list[self.identifier[node]] + self.decode(node)
self.dfs(self.decode(node))
graph = Graph(26)
graph.initialize_graph()
input_list = ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G']
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
edge_end = edge[1]
graph.add_edge(edge_start, edge_end)
unique_identifier = 0
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
if(graph.identifier[graph.encode(edge_start)] == -1):
graph.identifier[graph.encode(edge_start)] = unique_identifier
graph.merged_list = graph.merged_list + [edge_start]
unique_identifier = unique_identifier + 1
graph.dfs(edge_start)
print(graph.merged_list)
Output:
['ABCDEG', 'XYZ']

How to find a substring in a string list?

I try to find some string in a list, but have problems because of word order.
list = ['a b c d', 'e f g', 'h i j k']
str = 'e g'
I need to find the 2nd item in a list and output it.
You can use combination of any() and all() to check the presence in one line:
>>> my_list = ['a b c d', 'e f g', 'h i j k']
>>> my_str = 'e g'
>>> any(all(s in sub_list for s in my_str.split()) for sub_list in my_list)
True
Here, above expression will return True / False depending on whether the char in your strings are present inside the list.
To also get the get that sub-list as return value, you can modify above expression by skipping any() with list comprehension as:
>>> [sub_list for sub_list in my_list if all(s in sub_list for s in my_str.split())]
['e f g']
It'll return the list of strings containing your chars.
You can try:
for l in list:
l_words = l.split(" ")
if all([x in l_words for x in str.split(" ")]):
print(l_words)
You can try this
list = ['a b c d', 'e f g', 'h i j k']
str = list[2].split()
for letter in str:
print(letter)
This can be achieved by using sets and list comprehension
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
print([i for i in ls if len(set(s.replace(" ", "")).intersection(set(i.replace(" ", "")))) == len(s.replace(" ", ""))])
OR
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
s_set = set(s.replace(" ", ""))
print([i for i in ls if len(s_set.intersection(set(i.replace(" ", "")))) == len(s_set)])
Output
['e f g']
The list comprehension is removing all the items in ls that all the chars from s are not including inside the list item, by that you will get all the ls items that all the s chars are in them.

Python dictionary, constant complexity way to return all keys in dict contain certain string

I have a dictionary like: mydict = {'A B C':0, 'A B E':1, 'E F':0}
Then I have a search key search_string = 'A B'
where I would like to find all keys and values that the search_string is part of the mydict.keys(). So in this can 'A B C' and 'A B E' will satisfy.
Since mydict can be very large. Is there a constant time complexity to search this rather than:
result = [search_string in key for key, val in mydict.items()]
I am also open to restructure the dictionary if needed.
If the search is always for prefixes of string then you can use a prefix tree or Trie which is an existing Python module.
Trie allows finding matches in O(M) time, where M is the maximum
string length
reference
(i.e. depends upon max key length rather than number of keys).
Code
from pytrie import StringTrie
def create_prefix(dict):
" Creates a prefix tree based upon a dictionary "
# create empty trie
trie = StringTrie()
for k in dict:
trie[k] = k
return trie
Test 1
# Preprocess to create prefix tree
mydict = {'A B C':0, 'A B E':1, 'E F':0}
prefix_tree = create_prefix(mydict)
# Now you can use search tree multile times to speed individual searches
for search_string in ['A B', 'A B C', 'E', 'B']:
results = prefix_tree.values(search_string) # # .values resturn list that has this as a prefix
if results:
print(f'Search String {search_string} found in keys {results}')
else:
print(f'Search String {search_string} not found')
Output
Search String A B found in keys ['A B C', 'A B E']
Search String A B C found in keys ['A B C']
Search String E found in keys ['E F']
Search String B not found
Test 2 (added to answer question from OP)
mydict = {'A B C':0, 'A B C D':0, 'A B C D E':0}
prefix_tree = create_prefix(mydict)
# Now you can use search tree multile times to speed individual searches
for search_string in ['A B', 'A B C', 'A B C D', 'A B C D E', 'B C']:
results = prefix_tree.values(search_string) # # .values resturn list that has this as a prefix
if results:
print(f'Search String {search_string} found in keys {results}')
else:
print(f'Search String {search_string} not found')
Output
Search String A B found in keys ['A B C', 'A B C D', 'A B C D E']
Search String A B C found in keys ['A B C', 'A B C D', 'A B C D E']
Search String A B C D found in keys ['A B C D', 'A B C D E']
Search String A B C D E found in keys ['A B C D E']
Search String B C not found
You have two potential solutions here- the first doesn't have O(1) complexity, but it's probably the way you'll want to go:
We can try building a tree and doing the search that way- so essentially:
You could have mydict look like this:
test_dict = {
'A': {
'B': {
'C': 0,
'E': 1
}
},
'E': {
'F': 1
}
}
def get_recursive_values(mydict):
results = []
for key in mydict:
if isinstance(mydict[key], dict):
results.extend(get_recursive_values(mydict[key]))
else:
results.append(mydict[key])
return results
def search(mydict, search_text):
components = search_text.split(' ')
if components[0] in mydict:
next_res = mydict[components[0]]
if isinstance(next_res, dict):
if len(components) == 1:
return get_recursive_values(next_res)
return search(next_res, " ".join(components[1:]))
else:
return [mydict[components[0]]]
raise KeyError(components[0])
Probably could be written a little nicer, but that'll work for you- try calling search(test_dict, 'A B')
and you'll get both the results.
Another potential solution would be, if you don't care about insertion time, to have all the values for all the different keys- this may sound a bit ridiculous, but you'll get values in O(1) time but insertion time will be large- i.e.
'A': [0, 1],
'A B': [0, 1],
'A B C': [0],
'A B E': [1],
'E': [1],
'E F': [1]
}
def insert(mydict, key, value):
for k in mydict:
if k.startswith(key):
mydict[k].append(value)
mydict[key] = [value]

Specialized Powerset Requirement

Let's say I have compiled five regular expression patterns and then created five Boolean variables:
a = re.search(first, mystr)
b = re.search(second, mystr)
c = re.search(third, mystr)
d = re.search(fourth, mystr)
e = re.search(fifth, mystr)
I want to use the Powerset of (a, b, c, d, e) in a function so it finds more specific matches first then falls through. As you can see, the Powerset (well, its list representation) should be sorted by # of elements descending.
Desired behavior:
if a and b and c and d and e:
return 'abcde'
if a and b and c and d:
return 'abcd'
[... and all the other 4-matches ]
[now the three-matches]
[now the two-matches]
[now the single matches]
return 'No Match' # did not match anything
Is there a way to utilize the Powerset programmatically and ideally, tersely, to get this function's behavior?
You could use the powerset() generator function recipe in the itertools documentation like this:
from itertools import chain, combinations
from pprint import pprint
import re
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
mystr = "abcdefghijklmnopqrstuvwxyz"
first = "a"
second = "B" # won't match, should be omitted from result
third = "c"
fourth = "d"
fifth = "e"
a = 'a' if re.search(first, mystr) else ''
b = 'b' if re.search(second, mystr) else ''
c = 'c' if re.search(third, mystr) else ''
d = 'd' if re.search(fourth, mystr) else ''
e = 'e' if re.search(fifth, mystr) else ''
elements = (elem for elem in [a, b, c, d, e] if elem is not '')
spec_ps = [''.join(item for item in group)
for group in sorted(powerset(elements), key=len, reverse=True)
if any(item for item in group)]
pprint(spec_ps)
Output:
['acde',
'acd',
'ace',
'ade',
'cde',
'ac',
'ad',
'ae',
'cd',
'ce',
'de',
'a',
'c',
'd',
'e']
First, those aren't booleans; they're either match objects or None. Second, going through the power set would be a terribly inefficient way to go about this. Just stick each letter in the string if the corresponding regex matched:
return ''.join(letter for letter, match in zip('abcde', [a, b, c, d, e]) if match)

Categories