wondering about the best way to approach this particular problem and if any libraries (python preferably, but I can be flexible if need be).
I have a file with a string on each line. I would like to find the longest common patterns and their locations in each line. I know that I can use SequenceMatcher to compare line one and two, one and three, so on and then correlate the results, but if there something that already does it?
Ideally these matches would appear anywhere on each line, but for starters I can be fine with them existing at the same offset in each line and go from there. Something like a compression library that has a good API to access its string table might be ideal, but I have not found anything so far that fits that description.
For instance with these lines:
\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e
I would want to see that 0-1, and 10-12 match in all lines at the same position and line1[4,5] matches line2[5,6] matches line3[7,8].
Thanks,
If all you want is to find common substrings that are at the same offset in each line, all you need is something like this:
matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
c1,c2,c3 = zipped_strings[i]
# if you're not inside a match,
# look for matching characters and save the match start position
if startpos==-1 and c1==c2==c3:
startpos = i
# if you are inside a match,
# look for non-matching characters, save the match to matches, reset startpos
elif startpos>-1 and not c1==c2==c3:
matches.append((startpos,i,s1[startpos:i]))
# matches will contain (startpos,endpos,matchstring) tuples
startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
endpos = len(zipped_strings)
matches.append((startpos,endpos,s1[startpos:endpos]))
To find the longest common pattern regardless of location, SequenceMatcher does sound like the best idea, but instead of comparing string1 to string2 and then string1 to string3 and trying to merge the results, just get all common substrings of string1 and string2 (with get_matching_blocks), and then compare each result of that to string3 to get matches between all three strings.
Related
I have a web story that has cencored word in it with asterix
right now i'm doing it with a simple and dumb str.replace
but as you can imagine this is a pain and I need to search in the text to find all instance of the censoring
here is bastard instance that are capitalized, plurial and with asterix in different places
toReplace = toReplace.replace("b*stard", "bastard")
toReplace = toReplace.replace("b*stards", "bastards")
toReplace = toReplace.replace("B*stard", "Bastard")
toReplace = toReplace.replace("B*stards", "Bastards")
toReplace = toReplace.replace("b*st*rd", "bastard")
toReplace = toReplace.replace("b*st*rds", "bastards")
toReplace = toReplace.replace("B*st*rd", "Bastard")
toReplace = toReplace.replace("B*st*rds", "Bastards")
is there a way to compare all word with "*" (or any other replacement character) to an already compiled dict and replace them with the uncensored version of the word ?
maybe regex but I don't think so
Using regex alone will likely not result in a full solution for this. You would likely have an easier time if you have a simple list of the words that you want to restore, and use Levenshtein distance to determine which one is closest to a given word that you have found a * in.
One library that may help with this is fuzzywuzzy.
The two approaches that I can think of quickly:
Split the text so that you have 1 string per word. For each word, if '*' in word, then compare it to the list of replacements to find which is closest.
Use re.sub to identify the words that contain a * character, and write a function that you would use as the repl argument to determine which replacement it is closest to and return that replacement.
Additional resources:
Python: find closest string (from a list) to another string
Find closest string match from list
How to find closest match of a string from a list of different length strings python?
You can use re module to find matches between the censored word and words in your wordlist.
Replace * with . (dot has special meaning in regex, it means "match every character") and then use re.match:
import re
wordlist = ["bastard", "apple", "orange"]
def find_matches(censored_word, wordlist):
pat = re.compile(censored_word.replace("*", "."))
return [w for w in wordlist if pat.match(w)]
print(find_matches("b*st*rd", wordlist))
Prints:
['bastard']
Note: If you want match exact word, add $ at the end of your pattern. That means appl* will not match applejuice in your dictionary for example.
I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.
I have a list of keywords. A sample is:
['IO', 'IO Combination','CPI Combos']
Now what I am trying to do is see if any of these keywords is present in a string. For example, if my string is: there is a IO competition coming in Summer 2018. So for this example since it contains IO, it should identify that but if the string is there is a competition coming in Summer 2018 then it should not identify any keywords.
I wrote this Python code but it also identifies IO in competition:
if any(word.lower() in string_1.lower() for word in keyword_list):
print('FOUND A KEYWORD IN STRING')
I also want to identify which keyword was identified in the string (if any present). What is the issue in my code and how can I make sure that it matches only complete words?
Regex solution
You'll need to implement word boundaries here:
import re
keywords = ['IO', 'IO Combination','CPI Combos']
words_flat = "|".join(r'\b{}\b'.format(word) for word in keywords)
rx = re.compile(words_flat)
string = "there is a IO competition coming in Summer 2018"
match = rx.search(string)
if match:
print("Found: {}".format(match.group(0)))
else:
print("Not found")
Here, your list is joined with | and \b on both sides.
Afterwards, you may search with re.search() which prints "Found: IO" in this example.
Even shorter with a direct comprehension:
rx = re.compile("|".join(r'\b{}\b'.format(word) for word in keywords))
Non-regex solution
Please note that you can even use a non-regex solution for single words, you just have to reorder your comprehension and use split() like
found = any(word in keywords for word in string.split())
if found:
# do sth. here
Notes
The latter has the drawback that strings like
there is a IO. competition coming in Summer 2018
# ---^---
won't work while they do count as a "word" in the regex solution (hence the approaches are yielding different results). Additionally, because of the split() function, combined phrases like CPI Combos cannot be found. The regex solution has the advantage to even support lower and uppercase scenarios (just apply flag = re.IGNORECASE).
It really depends on your actual requirements.
for index,key in enumerate(mylist):
if key.find(mystring) != -1:
return index
It loops over your list, on every item in the list, it checks if your string is contained in the item, if it does, find() returns -1 which means it is contained, and if that happens, you get the index of the item where it was found with the help of enumerate().
I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words
You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)
You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>
You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'
Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.
Is it possible to get all overlapping matches, which starts from the same index, but are from different matching group?
e.g. when I look for pattern "(A)|(AB)" from "ABC" regex should return following matches:
(0,"A") and (0,"AB")
For one possibility see the answer of Evpok. The second interpretation of your question can be that you want to match all patterns at the same time from the same position. You can use a lookahead expression in this case. E.g. the regular expression
(?=(A))(?=(AB))
will give you the desired result (i.e. all places where both patterns match together with the groups).
Update: With the additional clarification this can still be done with a single regex. You just have to make both groups above optional, i.e.
(?=(A))?(?=(AB))?(?:(?:A)|(?:AB))
Nevertheless I wouldn't suggest to do so. You can much more easily look for each pattern separately and later join the results.
string = "AABAABA"
result = [(g.start(), g.group()) for g in re.compile('A').finditer(string)]
result += [(g.start(), g.group()) for g in re.compile('AB').finditer(string)]
I get this though I can't recall where or from who
def myfindall(regex, seq):
resultlist = []
pos = 0
while True:
result = regex.search(seq, pos)
if result is None:
break
resultlist.append(seq[result.start():result.end()])
pos = result.start() + 1
return resultlist
it returns a list of all (even overlapping) matches, with the limit of no more than one match for each index.