Searching for similar values within a regex string

Searching for similar values within a regex string - python

I'm trying to do a search with regex within two lists that have similar strings, but not the same, how to fix the fault below?
Script:
import re
list1 = [
'juice',
'potato']
list2 = [
'juice;44',
'potato;55',
'apple;66']
correlation = []
for a in list1:
r = re.compile(r'\b{}\b'.format(a), re.I)
for b in list2:
if r.search(b):
pass
else:
correlation.append(b)
print(correlation)
Output:
['potato;55', 'apple;66', 'juice;44', 'apple;66']
Desired Output:
['apple;66']
Regex:

You can create a single regex pattern to match terms from list1 as whole words, and then use filter:
import re
list1 = ['juice', 'potato']
list2 = ['juice;44', 'potato;55', 'apple;66']
rx = re.compile(r'\b(?:{})\b'.format("|".join(list1)))
print( list(filter(lambda x: not rx.search(x), list2)) )
# => ['apple;66']
See the Python demo.
The regex is \b(?:juice|potato)\b, see its online demo. The \b is a word boundary, the regex matches juice or potato as whole words. filter(lambda x: not rx.search(x), list2) removes all items from list2 that match the regex.

First, inner and outer for-loop must be swapped to make this work.
Then you can set a flag to False before the inner for-loop, set it in the inner loop to True if you found a match, after the loop add to correlation if flag is False yet.
This finally looks like:
import re
list1 = [
'juice',
'potato']
list2 = [
'juice;44',
'potato;55',
'apple;66']
correlation = []
for b in list2:
found = False
for a in list1:
r = re.compile(r'\b{}\b'.format(a), re.I)
if r.search(b):
found = True
if not found:
correlation.append(b)
print(correlation)

Convert list1 into a single regexp that matches all the words. Then append the element of list2 if it doesn't match the regexp.
regex = re.compile(r'\b(?:' + '|'.join(re.escape(word) for word in ROE) + r')\b')
correlation = [a for a in list2 if not regex.search(a)]

Related

Del part of list in python by for loop

I am trying to remove some words from string list of words.
list1= "abc dfc kmc jhh jkl".
My goal is to remove the words from 'dfc' to 'jhh'. I am new in Python, so I am trying some things with the index from c#, but they don't work here.
I am trying this:
index=0
for x in list1:
if x=='dfc'
currentindex=index
for y in list1[currentindex:]
if y!='jhh'
break;
del list1[currentindex]
currentindex=index
elif x=='jhh'
break;

Instead of a long for loop, a simple slice in Python does the trick:
words = ['abc', 'dfc', 'kmc', 'jhh', 'jkl']
del words[1:4]
print(words)
indexes start at 0. So you want to delete index 1-3. We enter 4 in the slice because Python stops -1 before the last index argument (so at index 3). Much easier than a loop.
Here is your output:
['abc', 'jkl']

>>> a = "abc dfc kmc jhh jkl"
>>> print(a.split("dfc")[0] + a.split("jhh")[1])
abc jkl
You can do this sample treatment with lambda:
b = lambda a,b,c : a.split(b)[0] + a.split(c)[1]
print(b(a, "dfc", "jhh"))

First, split the string into words:
list1 = "abc dfc kmc jhh jkl"
words = list1.split(" ")
Next, iterate through the words until you find a match:
start_match = "dfc"
start_index = 0
end_match = "jhh"
end_index = 0
for i in range(len(words)):
if words[i] == start_match:
start_index = i
if words[i] == end_match:
end_index = j
break
print ' '.join(words[:start_index]+words[end_index+1:])
Note: In the case of multiple matches, this will delete the least amount of words (choose the last start_match and first end_match).

list1= "abc dfc kmc jhh jkl".split() makes list1 as follows:
['abc', 'dfc', 'kmc', 'jhh', 'jkl']
Now if you want to remove a list element you can try either
list1.remove(item) #removes first occurrence of 'item' in list1
Or
list1.pop(index) #removes item at 'index' in list1

Create a list of words by splitting the string
list1= "abc dfc kmc jhh jkl".split()
Then iterate over the list, using a flag variable to indicate whether an element should be deleted from the list
flag = False
for x in list1:
if x=='dfc':
flag = True
if x == 'jhh':
list1.remove(x)
flag = False
if flag == True:
list1.remove(x)

There are several problems with what you have tried, especially:
list1 is a string, not a list
when you write list1[i], you get the character at index i (not a word)
in your for loop, you try to modify the string you iterate on: it is a very bad idea.
Here is my one-line style suggestion using re.sub(), which simply substitute a part of the string matching with the given regex pattern. It may be sufficient for your purpose:
import re
list1= "abc dfc kmc jhh jkl"
list1 = re.sub(r'dfc .* jhh ', "", list1)
print(list1)
Note: I kept the identifier list1 even if it is a string.

You can do like this
test = list1.replace("dfc", "")

split and flatten tuple of tuples

What is the best way to split and flatten the tuple of tuples below?
I have this tuple of tuples:
(('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',), ('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',), ('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',), ('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
I need to:
split by \n\, \r\, \n\, \r\, ",", " "
and flatten it. So the end result should look like this:
['aaaa_BBB_wacker*','cccc', 'aaaa_BBB_tttt*','aaaa_BBB2_wacker','aaaa_BBB','BBB_ffff','aaaa_BBB2MM*','naaaa_BBB_cccc2MM*','BBBMM','BBB2MM BBB','aaaa_BBB_cccc2MM_tttt','aaaa_BBB_tttt', 'aaaa_BBB']
I tried the following and it eventually completes the job but I have to repeat it multiple times for each pattern.
patterns = [[i.split('\\r') for i in patterns]]
patterns = [item for sublist in patterns for item in sublist]
patterns = [item for sublist in patterns for item in sublist]
patterns = [[i.split('\\n') for i in patterns]]

You should use a regexp to split the strings:
import re
re.split(r'[\n\r, ]+', s)
It will be easier using a loop:
patterns = []
for item in l:
patterns += re.split(r'[\n\r, ]+', s)

Given
tups = (('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',),
('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',),
('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',),
('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
Do
import re
delimiters = ('\r', '\n', ',', ' ', '\\r', '\\n')
pat = '(?:{})+'.format('|'.join(map(re.escape, delimiters)))
result = [s for tup in tups for item in tup for s in re.split(pat, item)]
Notes. Calling re.escape on your delimiters makes sure that they are properly escaped for your regular expression. | makes them alternatives. ?: makes your delimiter group non-capturing so it isn't returned by re.split. + means match the previous group one or more times.

Here is a one-liner.. but it's not simple. You can add as many items as you want in the replace portion, just keep adding them.
start = (('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',), ('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',), ('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',), ('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
output = [final_item for sublist in start for item in sublist for final_item in item.replace('\\r',' ').replace('\\n',' ').split()]

Filter strings where there are n equal characters in a row

Is there an option how to filter those strings from list of strings which contains for example 3 equal characters in a row? I created a method which can do that but I'm curious whether is there a more pythonic way or more efficient or more simple way to do that.
list_of_strings = []
def check_3_in_row(string):
for ch in set(string):
if ch*3 in string:
return True
return False
new_list = [x for x in list_of_strings if check_3_in_row(x)]
EDIT:
I've just found out one solution:
new_list = [x for x in set(keywords) if any(ch*3 in x for ch in x)]
But I'm not sure which way is faster - regexp or this.

You can use Regular Expression, like this
>>> list_of_strings = ["aaa", "dasdas", "aaafff", "afff", "abbbc"]
>>> [x for x in list_of_strings if re.search(r'(.)\1{2}', x)]
['aaa', 'aaafff', 'afff', 'abbbc']
Here, . matches any character and it is captured in a group ((.)). And we check if the same captured character (we use the backreference \1 refer the first captured group in the string) appears two more times ({2} means two times).

filter string in comprehension list with nested loops

I want just for fun know if it's possible process this in a comprehension
list
some like:
text = "give the most longest word"
def LongestWord(text):
l = 0
words = list()
for x in text.split():
word = ''.join(y for y in x if y.isalnum())
words.append(word)
for word in words:
if l < len(word):
l = len(word)
r = word
return r

Not one but two list comprehensions:
s = 'ab, c d'
cmpfn = lambda x: -len(x)
sorted([''.join(y for y in x if y.isalnum()) for x in s.split()], key=cmpfn)[0]

Zero list comprehensions:
import re
def longest_word(text):
return sorted(re.findall(r'\w+', text), key=len, reverse=True)[0]
print(longest_word("this is an example.")) # prints "example"
Or, if you insist, the same thing but with a list comprehension:
def longest_word(text):
return [w for w in sorted(re.findall(r'\w+', text), key=len, reverse=True)][0]

No need for a list comprehension, really.
import re
my_longest_word = max(re.findall(r'\w+', text), key=len)
Alternatively if you don't want to import re, you can avoid a lambda expression and use one list comprehension using max once again:
my_longest_word = max([ ''.join(l for l in word if l.isalnum())
for w in text.split() ], key = len)
How this works:
Uses a list comprehension and isalnum() to filter out non-alphanumeric characters evaluating each letter in each word, and splits into a list using whitespaces.
Takes the max once again.
How regex solution works:
Matches all alphanumeric of at least length 1 with \w+
findall() places the matches in a list of strings
Max finds the element with maximum length from the list.
Outputs (in both cases):
>>>text = "give the most longest word"
>>>my_longest_word
'longest'
>>>text = "what!! is ??with !##$ these special CharACTERS?"
>>>my_longest_word
'CharACTERS'

Cross-matching two lists

I have two lists where I am trying to see if there is any matches between substrings in elements in both lists.
["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
["2311","233412","2231"]
If any substrings in an element matches the second list such as "Po2311tato" will match with "2311". Then I would want to put "Po2311tato" in a new list in which all elements of the first that match would be placed in the new list. So the new list would be ["Po2311tato","Pin2231eap","Orange2231edg"]

You can use the syntax 'substring' in string to do this:
a = ["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
b = ["2311","233412","2231"]
def has_substring(word):
for substring in b:
if substring in word:
return True
return False
print filter(has_substring, a)
Hope this helps!

This can be a little more concise than the jobby's answer by using a list comprehension:
>>> list1 = ["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
>>> list2 = ["2311","233412","2231"]
>>> list3 = [string for string in list1 if any(substring in string for substring in list2)]
>>> list3
['Po2311tato', 'Pin2231eap', 'Orange2231edg']
Whether or not this is clearer / more elegant than jobby's version is a matter of taste!

import re
list1 = ["Po2311tato","Pin2231eap","Orange2231edg","add22131dfes"]
list2 = ["2311","233412","2231"]
matchlist = []
for str1 in list1:
for str2 in list2:
if (re.search(str2, str1)):
matchlist.append(str1)
break
print matchlist

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching for similar values within a regex string - python

Convert list1 into a single regexp that matches all the words. Then append the element of list2 if it doesn't match the regexp. regex = re.compile(r'\b(?:' + '|'.join(re.escape(word) for word in ROE) + r')\b') correlation = [a for a in list2 if not regex.search(a)]

Related

Del part of list in python by for loop

split and flatten tuple of tuples

Filter strings where there are n equal characters in a row

filter string in comprehension list with nested loops

Cross-matching two lists

Categories

Resources