Match file names and not substrings in python regex

Match file names and not substrings in python regex - python

I am trying to match a list of file names using a regex. Instead of matching just the full name, it is matching both the name and a substring of the name.
Three example files are
t0 = r"1997_06_daily.txt"
t1 = r"2010_12_monthly.txt"
t2 = r"2018_01_daily_images.txt"
I am using the regex d.
a = r"[0-9]{4}"
b = r"_[0-9]{2}_"
c = r"(daily|daily_images|monthly)"
d = r"(" + a + b + c + r".txt)"
when I run
t0 = r"1997_06_daily.txt"
t1 = r"2010_12_monthly.txt"
t2 = r"2018_01_daily_images.txt"
a = r"[0-9]{4}"
b = r"_[0-9]{2}_"
c = r"(daily|daily_images|monthly)"
d = r"(" + a + b + c + r".txt)"
for t in (t0, t1, t2):
m = re.match(d, t)
if m is not None:
print(t, m.groups(), sep="\n", end="\n\n")
I get
1997_06_daily.txt
("1997_06_daily.txt", "daily")
2010_12_monthly.txt
("2010_12_monthly.txt", "monthly")
2018_01_daily_images.txt
("2018_01_daily_images.txt", "daily_images")
How can I force the regex to only return the version that includes the full file name and not the substring?

You should make your c pattern non-capturing with '?:'
c = r"(?:daily|daily_images|monthly)"

This is working correctly. The issue you are seeing is how groups work in regex. Your regex c is in parentheses. Parentheses in regex signify that this match should be treated as a group. By printing m.group(), you are printing a tuple of all the groups that matched. Luckily, the first element in the group is always the full match, so just use the following:
print(t, m.groups()[0], sep="\n", end="\n\n")

I know you're only looking for regex solutions but you could easily use os module to split the extension and return index 0. Otherwise, as Bill S. stated, m.groups()[0] returns the 0th index of the regex group.
# os solution
import os
s = "1997_06_daily.txt"
os.path.splitext(s)[0]

Related

Regex with python dictionary

I am trying to do some "batch" find and replace.
I have the following string:
abc123 = abc122 + V[2] + V[3]
I would like to find every instance of abc{someNumber} = and replace the instance's abc portion with int ijk{someNumber} =, and also replace V[3] with a keyword in a dictionary.
dictToReplace={"[1]": "_i", "[2]":"_j", "[3]":"_k"}
The expected end result would be:
int ijk123 = ijk122 + V_j + V_k
What is the best way to achieve this? RegEx for the first part? Can it also be used for the second?

I'd split the logic in two steps:
1.) First replace the keyword abc\d+
2.) Replace the keys found in dictionary with their respective values
import re
dictToReplace = {"[1]": "_i", "[2]": "_j", "[3]": "_k"}
s = "abc123 = abc122 + V[2] + V[3]"
pat1 = re.compile(r"abc(\d+)")
pat2 = re.compile("|".join(map(re.escape, dictToReplace)))
s = pat1.sub(r"ijk\1", s)
s = pat2.sub(lambda g: dictToReplace[g.group(0)], s)
print(s)
Prints:
ijk123 = ijk122 + V_j + V_k

Use a function as the replacement value in re.sub(). It can then look up the matched value in the dictionary to get the replacement.
string = 'abc123 = abc122 + V[2] + V[3]'
# change abc### to ijk###
result = re.sub(r'abc(\d+)', r'ijk\1', string)
# replace any V[###] with V_xxx from the dict.
result = re.sub(r'V(\[\d+\])', lambda m: 'V' + dictToReplace.get(m.group(1), m.group(1)), result)

How can I shift patterns in a string one place ahead (removing the first, replacing the last)?

I have a string in Python, and I would like to shift a pattern 1 place earlier.
This is my string:
my_string = [AudioLengthInSecs: 37.4]hello[seconds_silence:
0.65]one[seconds_silence: 0.54]two[seconds_silence: 0.59]three[seconds_silence:
0.48]hello[seconds_silence: 2.32]
I would like to shift the numbers, after [seconds_silence: XXXX] one place earlier (and removing the first one, and the last one (since that one is shifted)). The result should be like this:
my_desired_string = [AudioLengthInSecs: 37.4]hello[seconds_silence: 0.54]one[seconds_silence: 0.59]two[seconds_silence:
0.48]three[seconds_silence: 2.32]hello
Here is my code:
import re
my_string = "[AudioLengthInSecs: 37.4]hello[seconds_silence:0.65]one[seconds_silence: 0.54]two[seconds_silence: 0.59]three[seconds_silence: 0.48]hello[seconds_silence: 2.32]"
# First, find all the numbers in the string
all_numbers = (re.findall('\d+', my_string ))
# Secondly, remove the first 4 numbers ()
all_numbers = all_numbers[4:]
# combine the numbers into one string
all_numbers
combined_numbers = [i+j for i,j in zip(all_numbers[::2], all_numbers[1::2])]
# Than loop over the string and instert
for word in my_string.split():
print(word)
if word == "[seconds_silence":
print(word)
# here i wanted to check if [soconds_silence was recognized
# and replace with value from combined_numbers
# however, this is failing obviously

The idea is to find all pairs:
the string preceding [seconds_silence: ...] fragment (capturing group No 1),
and the above fragment itself (capturing group No 2).
Then:
drop the first [seconds_silence: ...] fragment,
and join both lists,
but as they now have different length, itertools.zip_longest is needed.
So the whole code to do your task is:
import itertools
import re
my_string = '[AudioLengthInSecs: 37.4]hello[seconds_silence:0.65]'\
'one[seconds_silence: 0.54]two[seconds_silence: 0.59]'\
'three[seconds_silence: 0.48]hello[seconds_silence: 2.32]'
gr1 = []
gr2 = []
for mtch in re.findall(r'(.+?)(\[seconds_silence: ?[\d.]+\])', my_string):
g1, g2 = mtch
gr1.append(g1)
gr2.append(g2)
gr2.pop(0)
my_desired_string = ''
for g1, g2 in itertools.zip_longest(gr1, gr2, fillvalue=''):
my_desired_string += g1 + g2
print(my_desired_string)

split or chunk dynamic string into specific parts and merging in python

Is there way to split or chunk the dynamic string into fixed size? let me explain:
Suppose:
name = Natalie
Family = David12
length = len(name) #7 bit
length = len(Family) # 7 bit
i want to split the name and family into and merging as :
result=nadatavilid1e2
and again split and extract the the 2 string as
x= Natalie
y= david
another Example:
Name = john
Family= mark
split and merging:
result= jomahnrk
and again split and extract the the 2 string as
x=john
y= mark
.
Remember variable name and family have different size length every time not static! . i hope my question is clear. i have seen some related solution about it like here and here and here and here and here and here and here but none of these work with what im looking for. Any suggestion ?? Thanks
i'm using spyder python 3.6.4
I have try this code split data into two parts:
def split(data):
indices = list(int(x) for x in data[-1:])
data = data[:-1]
rv = []
for i in indices[::-1]:
rv.append(data[-i:])
data=data[:-i]
rv.append(data)
return rv[::-1]
data='Natalie'
x,c=split(str(data))
print (x)
print (c)

Given you have stated names will always be of equal length you could use wrap to split in to 2 char pairs and the zip and chain to join them up. In the split part you can again use wwrap to split in 2 char pairs but if the number of pairs is odd then you need to split the last pair into 2 single entries. something like.
from textwrap import wrap
from itertools import chain
def merge_names(name, family):
name_split = wrap(name, 2)
family_split = wrap(family, 2)
return "".join(chain(*zip(name_split, family_split)))
def split_names(merged_name):
names = ["", ""]
char_pairs = wrap(merged_name, 2)
if len(char_pairs) % 2:
char_pairs.append(char_pairs[-1][1])
char_pairs[-2] = char_pairs[-2][0]
for index, chars in enumerate(char_pairs):
pos = 1 if index % 2 else 0
names[pos] += chars
return names
print(merge_names("john", "mark"))
print(split_names("jomahnrk"))
print(merge_names("stephen", "natalie"))
print(split_names("stnaeptaheline"))
print(merge_names("Natalie", "David12"))
print(split_names("NaDatavilid1e2"))
OUTPUT
jomahnrk
['john', 'mark']
stnaeptaheline
['stephen', 'natalie']
NaDatavilid1e2
['Natalie', 'David12']

Something like:
a = "Eleonora"
b = "James"
l = max(len(a), len(b))
a = a.lower() + " " * (l-len(a))
b = b.lower() + " " * (l-len(b))
n = 2
a = [a[i:i+n] for i in range(0, len(a), n)]
b = [b[i:i+n] for i in range(0, len(b), n)]
ans = "".join(map(lambda xy: "".join(xy), zip(a, b))).replace(" ", "")
Giving for this example:
eljaeomenosra

Cut string within a specific pattern in python

I have string of some length consisting of only 4 characters which are 'A,T,G and C'. I have pattern 'GAATTC' present multiple times in the given string. I have to cut the string at intervals where this pattern is..
For example for a string, 'ATCGAATTCATA', I should get output of
string one - ATCGA
string two - ATTCATA
I am newbie in using Python but I have come up with the following (incomplete) code:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
Any help would be really appreciated.

First lets develop your idea of using find, so you can figure out your mistakes.
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
Yet python string sports a powerful replace method that directly replaces fragments of string. The below snippet uses the replace method to insert separators when needed:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
I believe it is more intuitive and should be faster then RE (which might have lower performance, depending on library and usage)

Here is a simple solution :
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
Result :
print(result)
['ATCGA', 'ATTCATA']

BioPython has a restriction enzyme package to do exactly what you're asking.
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
print(EcoRI.site) # You will see that this is the enzyme you listed above
test = 'ATCGAATTCATA'.upper() # This is the sequence you want to search
my_seq = Seq(test, IUPACAmbiguousDNA()) # Create a biopython Seq object with our sequence
cut_sites = EcoRI.search(my_seq)
cut_sites contain a list of exactly where to cut the input sequence (such that GA is in the left sequence and ATTC is in the right sequence.
You can then split the sequence into contigs using:
cut_sites = [0] + cut_sites # We add a leading zero so this works for the first
# contig. This might not always be needed.
contigs = [test[i:j] for i,j in zip(cut_sites, cut_sites[1:]+[None])]
You can see this page for more details about BioPython.

My code is a bit sloppy, but you could try something like this when you want to iterate over multiple occurrences of the string
def split_strings(seq):
string1 = seq[:seq.find(str1) +2]
string2 = seq[seq.find(str1) +2:]
return string1, string2
test = 'ATCGAATTCATA'.upper()
str1 = 'GAATTC'
seq = test
while str1 in seq:
string1, seq = split_strings(seq)
print string1
print seq

Here's a solution using the regular expression module:
import re
seq = 'ATCGAATTCATA'
restriction_site = re.compile('GAATTC')
subseq_start = 0
for match in restriction_site.finditer(seq):
print seq[subseq_start:match.start()+2]
subseq_start = match.start()+2
print seq[subseq_start:]
Output:
ATCGA
ATTCATA

re to find longest matching postfix of two strings

I have two strings like:
a = '54515923333558964'
b = '48596478923333558964'
Now the longest postfix match is
c = '923333558964'
what will be a solution using re?
Here is a solution I found for prefix match:
import re
pattern = re.compile("(?P<mt>\S*)\S*\s+(?P=mt)")
a = '923333221486456'
b = '923333221486234567'
c = pattern.match(a + ' ' + b).group('mt')

Try the difflib.SequenceMatcher:
import difflib
a = '54515923333558964'
b = '48596478923333558964'
s = difflib.SequenceMatcher(None, a, b)
m = s.find_longest_match(0, len(a), 0, len(b))
print a[m.a:m.a+m.size]

You can use this variation of the regex pattern:
\S*?(?P<mt>\S*)\s+\S*(?P=mt)$
EDIT.
Note, however, that this may require O(n3) time with some inputs. Try e.g.
a = 1000 * 'a'
b = 1000 * 'a' + 'b'
This takes one second to process on my system.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match file names and not substrings in python regex - python

You should make your c pattern non-capturing with '?:' c = r"(?:daily|daily_images|monthly)"

I know you're only looking for regex solutions but you could easily use os module to split the extension and return index 0. Otherwise, as Bill S. stated, m.groups()[0] returns the 0th index of the regex group. # os solution import os s = "1997_06_daily.txt" os.path.splitext(s)[0]

Related

Regex with python dictionary

How can I shift patterns in a string one place ahead (removing the first, replacing the last)?

split or chunk dynamic string into specific parts and merging in python

Cut string within a specific pattern in python

re to find longest matching postfix of two strings

Categories

Resources