re to find longest matching postfix of two strings - python

I have two strings like:
a = '54515923333558964'
b = '48596478923333558964'
Now the longest postfix match is
c = '923333558964'
what will be a solution using re?
Here is a solution I found for prefix match:
import re
pattern = re.compile("(?P<mt>\S*)\S*\s+(?P=mt)")
a = '923333221486456'
b = '923333221486234567'
c = pattern.match(a + ' ' + b).group('mt')

Try the difflib.SequenceMatcher:
import difflib
a = '54515923333558964'
b = '48596478923333558964'
s = difflib.SequenceMatcher(None, a, b)
m = s.find_longest_match(0, len(a), 0, len(b))
print a[m.a:m.a+m.size]

You can use this variation of the regex pattern:
\S*?(?P<mt>\S*)\s+\S*(?P=mt)$
EDIT.
Note, however, that this may require O(n3) time with some inputs. Try e.g.
a = 1000 * 'a'
b = 1000 * 'a' + 'b'
This takes one second to process on my system.

Related

Regex with python dictionary

I am trying to do some "batch" find and replace.
I have the following string:
abc123 = abc122 + V[2] + V[3]
I would like to find every instance of abc{someNumber} = and replace the instance's abc portion with int ijk{someNumber} =, and also replace V[3] with a keyword in a dictionary.
dictToReplace={"[1]": "_i", "[2]":"_j", "[3]":"_k"}
The expected end result would be:
int ijk123 = ijk122 + V_j + V_k
What is the best way to achieve this? RegEx for the first part? Can it also be used for the second?
I'd split the logic in two steps:
1.) First replace the keyword abc\d+
2.) Replace the keys found in dictionary with their respective values
import re
dictToReplace = {"[1]": "_i", "[2]": "_j", "[3]": "_k"}
s = "abc123 = abc122 + V[2] + V[3]"
pat1 = re.compile(r"abc(\d+)")
pat2 = re.compile("|".join(map(re.escape, dictToReplace)))
s = pat1.sub(r"ijk\1", s)
s = pat2.sub(lambda g: dictToReplace[g.group(0)], s)
print(s)
Prints:
ijk123 = ijk122 + V_j + V_k
Use a function as the replacement value in re.sub(). It can then look up the matched value in the dictionary to get the replacement.
string = 'abc123 = abc122 + V[2] + V[3]'
# change abc### to ijk###
result = re.sub(r'abc(\d+)', r'ijk\1', string)
# replace any V[###] with V_xxx from the dict.
result = re.sub(r'V(\[\d+\])', lambda m: 'V' + dictToReplace.get(m.group(1), m.group(1)), result)

How to make a python script that gives you every iteration of a string from a pattern

So I'm trying to make a python script that takes a pattern (ex: c**l) where it'll return every iteration of the string (* = any character in the alphabet)...
So, we get something like: caal, cbal, ccal and so forth.
I've tried using the itertools library's product but I haven't been able to make it work properly. So after 2 hours I've decide to turn to Stack Overflow.
Here's my current code. It's not complete since I feel stuck
alphabet = list('abcdefghijklmnopqrstuvwxyz')
wildChar = False
tmp_string = ""
combinations = []
if '*' in pattern:
wildChar = True
tmp_string = pattern.replace('*', '', pattern.count('*')+1)
if wildChar:
tmp = []
for _ in range(pattern.count('*')):
tmp.append(list(product(tmp_string, alphabet)))
for array in tmp:
for instance in array:
combinations.append("".join(instance))
tmp = []
print(combinations)
You could try:
from itertools import product
from string import ascii_lowercase
pattern = "c**l"
repeat = pattern.count("*")
pattern = pattern.replace("*", "{}")
for letters in product(ascii_lowercase, repeat=repeat):
print(pattern.format(*letters))
Result:
caal
cabl
cacl
...
czxl
czyl
czzl
Use itertools.product
import itertools
import string
s = 'c**l'
l = [c if c != '*' else string.ascii_lowercase) for c in s]
out = [''.join(c) for c in itertools.product(*l)]
Output:
>>> out
['caal',
'cabl',
'cacl',
'cadl',
'cael',
'cafl',
'cagl',
'cahl',
'cail',
'cajl'
...

How to get portion of string from 2 different strings and concat?

I have 2 strings a and b with - as delimiter, want to get 3rd string by concatenating the substring upto last % from a (which is one-two-three-%whatever% in below example) and from string b, drop the substring upto number of dashes found in resultant string (which is 4 in below e.g., that gives bar-bazz), I did this so far, is there a better way?
>>> a='one-two-three-%whatever%-foo-bar'
>>> b='1one-2two-3three-4four-bar-bazz'
>>> k="%".join(a.split('%')[:-1]) + '%-'
>>> k
'one-two-three-%whatever%-'
>>> k.count('-')
4
>>> y=b.split("-",k.count('-'))[-1]
>>> y
'bar-bazz'
>>> k+y
'one-two-three-%whatever%-bar-bazz'
>>>
An alternative approach using Regex:
import re
a = 'one-two-three-%whatever%-foo-bar'
b = '1one-2two-3three-4four-bar-bazz'
part1 = re.findall(r".*%-",a)[0] # one-two-three-%whatever%-
num = part1.count("-") # 4
part2 = re.findall(r"\w+",b) # ['1one', '2two', '3three', '4four', 'bar', 'bazz']
part2 = '-'.join(part2[num:]) # bar-bazz
print(part1+part2) # one-two-three-%whatever%-bar-bazz
For the first substring obtained from a, you can use rsplit():
k = a.rsplit('%', 1)[0] + '%-'
The rest look good to me
Maybe a little shorter ?
a = 'one-two-three-%whatever%-foo-bar'
b = '1one-2two-3three-4four-bar-bazz'
def merge (a,b):
res = a[:a.rfind ('%')+1]+'-'
return (res + "-".join (b.split ("-")[res.count ('-'):]))
print (merge (a,b) == 'one-two-three-%whatever%-bar-bazz')
I personally get nervous when I need to manually increment indexes or concatenate bare strings.
This answer is pretty similar to hingev's, just without the additional concat/addition operators.
t = "-"
ix = list(reversed(a)).index("%")
t.join([s] + b.split(t)[len(a[:-ix].split(t)):])
yet another possible answer:
def custom_merge(a, b):
result = []
idx = 0
for x in itertools.zip_longest(a.split('-'), b.split('-')):
result.append(x[idx])
if x[0][0] == '%' == x[0][-1]:
idx = 1
return "-".join(result)
Your question is specific enough that you might be optimizing the wrong thing (a smaller piece of a bigger problem). That being said, one way that feels easier to follow, and avoids some of the repeated linear traversals (splits and joins and counts) would be this:
def funky_percent_join(a, b):
split_a = a.split('-')
split_b = b.split('-')
breakpoint = 0 # len(split_a) if all of a should be used on no %
for neg, segment in enumerate(reversed(split_a)):
if '%' in segment:
breakpoint = len(split_a) - neg
break
return '-'.join(split_a[:breakpoint] + split_b[breakpoint:])
and then
>>> funky_percent_join('one-two-three-%whatever%-foo-bar', '1one-2two-3three-4four-bar-bazz')
'one-two-three-%whatever%-bar-bazz'
print(f"one-two-three-%{''.join(a.split('%')[1])}%")
that would work for the first, and then you could do the same for the second, and when you're ready to concat, you can do:
part1 = str(f"one-two-three-%{''.join(a.split('%')[1])}%")
part2 = str(f"-{''.join(b.split('-')[-2])}-{''.join(b.split('-')[-1])}")
result = part1+part2
this way it'll grab whatever you set the a/b variables to, provided they follow the same format.
but then again, why not do something like:
result = str(a[:-8] + b[22:])

Match file names and not substrings in python regex

I am trying to match a list of file names using a regex. Instead of matching just the full name, it is matching both the name and a substring of the name.
Three example files are
t0 = r"1997_06_daily.txt"
t1 = r"2010_12_monthly.txt"
t2 = r"2018_01_daily_images.txt"
I am using the regex d.
a = r"[0-9]{4}"
b = r"_[0-9]{2}_"
c = r"(daily|daily_images|monthly)"
d = r"(" + a + b + c + r".txt)"
when I run
t0 = r"1997_06_daily.txt"
t1 = r"2010_12_monthly.txt"
t2 = r"2018_01_daily_images.txt"
a = r"[0-9]{4}"
b = r"_[0-9]{2}_"
c = r"(daily|daily_images|monthly)"
d = r"(" + a + b + c + r".txt)"
for t in (t0, t1, t2):
m = re.match(d, t)
if m is not None:
print(t, m.groups(), sep="\n", end="\n\n")
I get
1997_06_daily.txt
("1997_06_daily.txt", "daily")
2010_12_monthly.txt
("2010_12_monthly.txt", "monthly")
2018_01_daily_images.txt
("2018_01_daily_images.txt", "daily_images")
How can I force the regex to only return the version that includes the full file name and not the substring?
You should make your c pattern non-capturing with '?:'
c = r"(?:daily|daily_images|monthly)"
This is working correctly. The issue you are seeing is how groups work in regex. Your regex c is in parentheses. Parentheses in regex signify that this match should be treated as a group. By printing m.group(), you are printing a tuple of all the groups that matched. Luckily, the first element in the group is always the full match, so just use the following:
print(t, m.groups()[0], sep="\n", end="\n\n")
I know you're only looking for regex solutions but you could easily use os module to split the extension and return index 0. Otherwise, as Bill S. stated, m.groups()[0] returns the 0th index of the regex group.
# os solution
import os
s = "1997_06_daily.txt"
os.path.splitext(s)[0]

Python: Replace ith occurence of x with ith element in list

Suppose we have a string a = "01000111000011" with n=5 "1"s. The ith "1", I would like to replace with the ith character in "ORANGE".
My result should look like:
b = "0O000RAN0000GE"
What could be the finest way to solve this problem in Python? Is it possible to bind an index to each substitution?
Many thanks!
Helga
Tons of answers/ways to do it. Mine uses a fundamental assumption that your #of 1s is equal to the length of the word you are subsituting.
a = "01000111000011"
a = a.replace("1", "%s")
b = "ORANGE"
print a % tuple(b)
Or the pythonic 1 liner ;)
print "01000111000011".replace("1", "%s") % tuple("ORANGE")
a = '01000111000011'
for char in 'ORANGE':
a = a.replace('1', char, 1)
Or:
b = iter('ORANGE')
a = ''.join(next(b) if i == '1' else i for i in '01000111000011')
Or:
import re
a = re.sub('1', lambda x, b=iter('ORANGE'): b.next(), '01000111000011')
s_iter = iter("ORANGE")
"".join(next(s_iter) if c == "1" else c for c in "01000111000011")
If the number of 1's in your source string doesn't match the length of your replacement string you can use this solution:
def helper(source, replacement):
i = 0
for c in source:
if c == '1' and i < len(replacement):
yield replacement[i]
i += 1
else:
yield c
a = '010001110001101010101'
b = 'ORANGE'
a = ''.join(helper(a, b)) # => '0O000RAN000GE01010101'
Improving on bluepnume's solution:
>>> from itertools import chain, repeat
>>> b = chain('ORANGE', repeat(None))
>>> a = ''.join((next(b) or c) if c == '1' else c for c in '010001110000110101')
>>> a
'0O000RAN0000GE0101'
[EDIT]
Or even simpler:
>>> from itertools import chain, repeat
>>> b = chain('ORANGE', repeat('1'))
>>> a = ''.join(next(b) if c == '1' else c for c in '010001110000110101')
>>> a
'0O000RAN0000GE0101'
[EDIT] #2
Also this works:
import re
>>> r = 'ORANGE'
>>> s = '010001110000110101'
>>> re.sub('1', lambda _,c=iter(r):next(c), s, len(r))
'0O000RAN0000GE0101'

Categories