Let's say I have 2 strings
AAABBBCCCCC
and
AAAABBBBCCCC
to make these strings as similar as possible, given that I can only remove characters I should
delete the last C from the first string
delete the last A and the last B from the second string,
so that they become
AAABBBCCCC
What would be an efficient algorithm to find out which characters to remove from each string?
I'm currently crushing my brain cells thinking about a sollution involving substrings of the strings, looking for them i,n the other string.
Levenshtein distance can calculate how many changes you need to convert one string into another. A small change to the source, and you may get not only distance, but the conversions needed.
How about using difflib?
import difflib
s1 = 'AAABBBCCCCC'
s2 = 'AAAABBBBCCCC'
for difference in difflib.ndiff(s1, s2):
print difference,
if difference[0] == '+':
print 'remove this char from s2'
elif difference[0] == '-':
print 'remove this char from s1'
else:
print 'no change here'
This will print out the differences between the two strings that you can then use to remove the differences. Here is the output:
A no change here
A no change here
A no change here
+ A remove this char from s2
+ B remove this char from s2
B no change here
B no change here
B no change here
C no change here
C no change here
C no change here
C no change here
- C remove this char from s1
Don't know if it's the fastest, but as code goes, it is at least short:
import difflib
''.join([c[-1] for c in difflib.Differ().compare('AAABBBCCCCC','AAAABBBBCCCC') if c[0] == ' '])
I think regular expression can do this.It's a string distance problem.
I mean. Let's have two string:
str1 = 'abc'
str2 = 'aabbcc'
first, I choose the short, and construct a regular expression like is:
regex = '(\w*)'+'(\w*)'.join(list(str1))+'(\w*)'
Then, we can search:
matches = re.search(regex,str2)
I use round brackets to group the section I am interested.
these groups of matches.group() is the distance of two strings.Next, I can figure out what characters should be removed.
It's my idea, I hope it can help you.
Related
This question already has answers here:
Find common characters between two strings
(5 answers)
Closed 2 months ago.
I have a string of text
hfHrpphHBppfTvmzgMmbLbgf
I have separated this string into two half's
hfHrpphHBppf,TvmzgMmbLbgf
I'd like to check if any of the characters in the first string, also appear in the second string, and would like to class lowercase and uppercase characters as separate (so if string 1 had a and string 2 had A this would not be a match).
and the above would return:
f
split_text = ['hfHrpphHBppf', 'TvmzgMmbLbgf']
for char in split_text[0]:
if char in split_text[1]:
print(char)
There is probably a better way to do it, but this a quick and simple way to do what you want.
Edit:
split_text = ['hfHrpphHBppf', 'TvmzgMmbLbgf']
found_chars = []
for char in split_text[0]:
if char in split_text[1] and char not in found_chars:
found_chars.append(char)
print(char)
There is almost certainly a better way of doing this, but this is a way of doing it with the answer I already gave
You could use the "in" word.
something like this :
for i in range(len(word1) :
if word1[i] in word2 :
print(word[i])
Not optimal, but it should print you all the letter in common
You can achieve this using set() and intersection
text = "hfHrpphHBppf,TvmzgMmbLbgf"
text = text.split(",")
print(set(text[0]).intersection(set(text[1])))
You can use list comprehension in order to check if letters from string a appears in string b.
a='hfHrpphHBppf'
b='TvmzgMmbLbgf'
c=[x for x in a if x in b]
print(' '.join(set(c)))
then output will be:
f
But you can use for,too. Like:
a='hfHrpphHBppf'
b='TvmzgMmbLbgf'
c=[]
for i in a:
if i in b:
c.append(i)
print(set(c))
I have a column of strings that look similar to the following:
1 IX-1-a
2 IX-1-b
3 IX-1-C
4 IX-1-D
Some end in lowercase letters while others end in uppercase. I need to standardize all endings to lowercase without affecting the letters at the beginning of the string. Below is some code fragment that I am working with to make changes within the series but it doesn't quite work.
if i in tw4515['Unnamed: 0'].str[-1].str.isupper() == True:
tw4515['Unnamed: 0'].str[-1].str.lower()
How can the truth table from tw4515['Unnamed: 0'].str[-1].str.isupper() be utilized efficiently to affect conditional changes?
One option is to split once from the right side, make the second part lowercase, then combine:
tmp = s.str.rsplit('-', 1)
out = tmp.str[0] + '-' + tmp.str[1].str.lower()
If the last part is always a single letter, #Barmar's solution is even better:
out = s.str[:-1] + s.str[-1].str.lower()
Output:
1 IX-1-a
2 IX-1-b
3 IX-1-c
4 IX-1-d
Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?
To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.
I'm wondering if there's any way to find how many pair of parentheses are in a string.
I have to do some string manipulation and I sometimes have something like:
some_string = '1.8.0*99(0000000*kWh)'
or something like
some_string = '1.6.1*01(007.717*kW)(1604041815)'
What I'd like to do is:
get all the digits between the parentheses (e.g for the first string: 0000000)
if there are 2 pairs of parentheses (there will always be max 2 pairs) get all the digits and join them (e.g for the second string I'll have: 0077171604041815)
How can I verify how many pair of parentheses are in a string so that I can do later something like:
if number_of_pairs == 1:
do_this
else:
do_that
Or maybe there's an easier way to do what I want but couldn't think of one so far.
I know how to get only the digits in a string: final_string = re.sub('[^0-9]', '', my_string), but I'm wondering how could I treat both cases.
As parenthesis always present in pairs, So just count the left or right parenthesis in a string and you'll get your answer.
num_of_parenthesis = string.count('(')
You can do that: (assuming you already know there's at least one parenthese)
re.sub(r'[^0-9]+', '', some_string.split('(', 1)[1])
or only with re.sub:
re.sub(r'^[^(]*\(|[^0-9]+', '', some_string)
If you want all the digits in a single string, use re.findall after replacing any . and join into a single string:
In [15]: s="'1.6.1*01(007.717*kW)(1604041815)'"
In [16]: ("".join(re.findall("\((\d+).*?\)", s.replace(".", ""))))
Out[16]: '0077171604041815'
In [17]: s = '1.8.0*99(0000000*kWh)'
In [18]: ("".join(re.findall("\((\d+).*?\)", s.replace(".", ""))))
Out[18]: '0000000'
The count of parens is irrelevant when all you want is to extract any digits inside them. Based on the fact "you only have max two pairs" I presume the format is consistent.
Or if the parens always have digits, find the data in the parens and sub all bar the digits:
In [20]: "".join([re.sub("[^0-9]", "", m) for m in re.findall("\((.*?)\)", s)])
Out[20]: '0077171604041815'
I'm building an app that gets incoming SMSs, then based on a keyword, it looks to see if that keyword is associated with any campaigns that it is running. The way I'm doing it now is to load a list of keywords and possible spelling combinations, then when the SMS comes in, I look through all keywords and combinations to see if there is a match.
How would you do this not using this method, but by actually looking for words that might match another word.
Let's say the correct spelling is HAMSTER, normally I would give the campaign alternatives like HMSTER HIMSTER HAMSTAR HAMSTR HAMSTIR etc.
Is there a smart way of doing this?
HAMSTER
"hamstir".compare_to("hamster") ? match
EDIT:
How about 2 words?
Say we know there are two words that need to match in the SMS:
correct for first word = THE FIRST WORD
correct for second word = AND SECOND WORD
SMS = FIRST WORD SECOND
EDIT:
Ideally people should SMS the words comma seperated, that whay I would know where to split and look for the words.
But what if they dont, like :
UNIQUE KEYWORD SECOND PARAMATER
How would I tell where the words split? The first word might be 3 words long and the second 3 or 1 or 2 etc.
In these examples, how would you use the techniques below to find the two words ?
Would you look twice ? one for each needed parameter or keyword?
The simplest solution is to use the difflib package, which has a get_close_matches function for approximate string matching:
import difflib
difflib.get_close_matches(word, possibilities)
What you're looking for is Levenshtein Distance.
Assuming your list of campaign isn't too large, you can calculate the distance between the input word and that of each campaign then select the one with the shortest. To filter out completely wrong words you might need to set a minimum acceptable distance and discard the input if the shortest is still beyond the limit.
To calculate the distance between two words, you can try one of these modules:
levenshtein.py
python-Levenshtein.
py-editdist
For example, using levenshtein.py:
from levenshtein import levenshtein
campaigns = (
"HAMSTER",
"TWO WORDED",
"FRIDAY",
)
def get_campaign(word):
return min(campaigns, key=lambda x: levenshtein(word, x))
Usage:
>>> get_campaign("HAMSTA")
'HAMSTER'
>>> get_campaign("HAM WORDED")
'TWO WORDED'
>>> get_campaign("FROODY")
'FRIDAY'
>>> get_campaign("FRIDAY")
'FRIDAY'
Note that is a very simple-minded approach and will always return something even if the input is completely different.
I use levenshtein distance to solve similar problem
see http://en.wikipedia.org/wiki/Levenshtein_distance
def distance(u1, u2):
try:
s1 = unicode(u1)
s2 = unicode(u2)
except:
s1 = u1
s2 = u2
if len(s1) < len(s2):
return distance(u2, u1)
if not s1:
return len(s2)
previous_row = xrange(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
distance("hamstir", "hamster") < 3
True
distance("god", "hamster") < 3
False
It seems to me that you're trying to build a spell checker. You could use minimum edit distance matching. Alternatively, look at Peter Norvig's python spell checker
Hope that helps
You could use a fuzzy matching and a named list with regex library e.g., to find any phrase from a list with at most one error (insertion, deletion, substitution):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex as re # pip install regex
words = ["first word", "second word", "third"]
sms = u"junk Furst Word second Third"
for m in re.finditer(ur"(?fie)\L<words>{e<=1}", sms, words=words):
print(m[0]) # the match
print(m.span()) # return indexes where the match found in the sms
# to find out which of the words matched:
print(next(w for w in words
if re.match(ur"(?fi)(?:%s){e<=1}" % re.escape(w), m[0])))
Output
Furst Word
(5, 14)
first word
Third
(22, 27)
third
Or you could iterate over the words directly:
for w in words:
for m in re.finditer(ur"(?fie)(?:%s){e<=1}" % re.escape(w), sms):
print(m[0])
print(m.span())
print(w)
It produces the same output as the first example.