Regex with python dictionary - python

I am trying to do some "batch" find and replace.
I have the following string:
abc123 = abc122 + V[2] + V[3]
I would like to find every instance of abc{someNumber} = and replace the instance's abc portion with int ijk{someNumber} =, and also replace V[3] with a keyword in a dictionary.
dictToReplace={"[1]": "_i", "[2]":"_j", "[3]":"_k"}
The expected end result would be:
int ijk123 = ijk122 + V_j + V_k
What is the best way to achieve this? RegEx for the first part? Can it also be used for the second?

I'd split the logic in two steps:
1.) First replace the keyword abc\d+
2.) Replace the keys found in dictionary with their respective values
import re
dictToReplace = {"[1]": "_i", "[2]": "_j", "[3]": "_k"}
s = "abc123 = abc122 + V[2] + V[3]"
pat1 = re.compile(r"abc(\d+)")
pat2 = re.compile("|".join(map(re.escape, dictToReplace)))
s = pat1.sub(r"ijk\1", s)
s = pat2.sub(lambda g: dictToReplace[g.group(0)], s)
print(s)
Prints:
ijk123 = ijk122 + V_j + V_k

Use a function as the replacement value in re.sub(). It can then look up the matched value in the dictionary to get the replacement.
string = 'abc123 = abc122 + V[2] + V[3]'
# change abc### to ijk###
result = re.sub(r'abc(\d+)', r'ijk\1', string)
# replace any V[###] with V_xxx from the dict.
result = re.sub(r'V(\[\d+\])', lambda m: 'V' + dictToReplace.get(m.group(1), m.group(1)), result)

Related

Replace ip partially with x in python

I have several ip addresses like
162.1.10.15
160.15.20.222
145.155.222.1
I am trying to replace the ip's like below.
162.x.xx.xx
160.xx.xx.xxx
145.xxx.xxx.x
How to achieve this in python.
Here’s a slightly simpler solution
import re
txt = "192.1.2.3"
x = txt.split(".", 1) # ['192', '1.2.3']
y = x[0] + "." + re.sub(r"\d", "x", x[1])
print(y) # 192.x.x.x
We can use re.sub with a callback function here:
def repl(m):
return m.group(1) + '.' + re.sub(r'.', 'x', m.group(2)) + '.' + re.sub(r'.', 'x', m.group(3)) + '.' + re.sub(r'.', 'x', m.group(4))
inp = "160.15.20.222"
output = re.sub(r'\b(\d+)\.(\d+)\.(\d+)\.(\d+)\b', repl, inp)
print(output) # 160.xx.xx.xxx
In the callback, the idea is to use re.sub to surgically replace each digit by x. This keeps the same width of each original number.
This is not the optimize solution but it works for me .
import re
Ip_string = "160.15.20.222"
Ip_string = Ip_string.split('.')
Ip_String_x =""
flag = False
for num in Ip_string:
if flag:
num = re.sub('\d','x',num)
Ip_String_x = Ip_String_x + '.'+ num
else:
flag = True
Ip_String_x = num
Solution 1
Other answers are good, and this single regex works, too:
import re
strings = [
'162.1.10.15',
'160.15.20.222',
'145.155.222.1',
]
for string in strings:
print(re.sub(r'(?:(?<=\.)|(?<=\.\d)|(?<=\.\d\d))\d', 'x', string))
output:
162.x.xx.xx
160.xx.xx.xxx
145.xxx.xxx.x
Explanation
(?<=\.) means following by single dot.
(?<=\.\d) means follwing by single dot and single digit.
(?<=\.\d\d) means following by single dot and double digit.
\d means a digit.
So, all digits that following by single dot and none/single/double digits are replaced with 'x'
(?<=\.\d{0,2}) or similar patterns are not allowed since look-behind ((?<=...)) should has fixed-width.
Solution 2
Without re module and regex,
for string in strings:
first, *rest = string.split('.')
print('.'.join([first, *map(lambda x: 'x' * len(x), rest)]))
above code has same result.
There are multiple ways to go about this. Regex is the most versatile and fancy way to write string manipulation codes. But you can also do it by same old for-loops with split and join functions.
ip = "162.1.10.15"
#Splitting the IPv4 address using '.' as the delimiter
ip = ip.split(".")
#Converting the substrings to x's except 1st string
for i,val in enumerate(ip[1:]):
cnt = 0
for x in val:
cnt += 1
ip[i+1] = "x" * cnt
#Combining the substrings back to ip
ip = ".".join(ip)
print(ip)
I highly recommend checking Regex but this is also a valid way to go about this task.
Hope you find this useful!
Pass an array of IPs to this function:
def replace_ips(ip_list):
r_list=[]
for i in ip_list:
first,*other=i.split(".",3)
r_item=[]
r_item.append(first)
for i2 in other:
r_item.append("x"*len(i2))
r_list.append(".".join(r_item))
return r_list
In case of your example:
print(replace_ips(["162.1.10.15","160.15.20.222","145.155.222.1"]))#==> expected output: ["162.x.xx.xx","160.xx.xx.xxx","145.xxx.xxx.x"]
Oneliner FYI:
import re
ips = ['162.1.10.15', '160.15.20.222', '145.155.222.1']
pattern = r'\d{1,3}'
replacement_sign = 'x'
res = [re.sub(pattern, replacement_sign, ip[::-1], 3)[::-1] for ip in ips]
print(res)

Python extract string starting with index up to character

Say I have an incoming string that varies a little:
" 1 |r|=1.2e10 |v|=2.4e10"
" 12 |r|=-2.3e10 |v|=3.5e-04"
"134 |r|= 3.2e10 |v|=4.3e05"
I need to extract the numbers (ie. 1.2e10, 3.5e-04, etc)... so I would like to start at the end of '|r|' and grab all characters up to the ' ' (space) after it. Same for '|v|'
I've been looking for something that would:
Extract a substring form a string starting at an index and ending on a specific character...
But have not found anything remotely close.
Ideas?
NOTE: Added new scenario, which is the one that is causing lots of head-scratching...
To keep it elegant and generic, let's utilize split:
First, we split by ' ' to tokens
Then we find if it has an equal sign and parse the key-value
import re
sabich = "134 |r| = 3.2e10 |v|=4.3e05"
parts = sabich.split(' |')
values = {}
for p in parts:
if '=' in p:
k, v = p.split('=')
values[k.replace('|', '').strip()] = v.strip(' ')
# {'r': '3.2e10', 'v': '4.3e05'}
print(values)
This can be converted to the one-liner:
import re
sabich = "134 |r| = 3.2e10 |v|=4.3e05"
values = {t[0].replace('|', '').strip() : t[1].strip(' ') for t in [tuple(p.split('=')) for p in sabich.split(' |') if '=' in p]}
# {'|r|': '1.2e10', '|v|': '2.4e10'}
print(values)
You can solve it with a regular expression.
import re
strings = [
" 1 |r|=1.2e10 |v|=2.4e10",
" 12 |r|=-2.3e10 |v|=3.5e-04"
]
out = []
pattern = r'(?P<name>\|[\w]+\|)=(?P<value>-?\d+(?:\.\d*)(?:e-?\d*)?)'
for s in strings:
out.append(dict(re.findall(pattern, s)))
print(out)
Output
[{'|r|': '1.2e10', '|v|': '2.4e10'}, {'|r|': '-2.3e10', '|v|': '3.5e-04'}]
And if you want to convert the strings to number
out = []
pattern = r'(?P<name>\|[\w]+\|)=(?P<value>-?\d+(?:\.\d*)(?:e-?\d*)?)'
for s in strings:
# out.append(dict(re.findall(pattern, s)))
out.append({
name: float(value)
for name, value in re.findall(pattern, s)
})
Output
[{'|r|': 12000000000.0, '|v|': 24000000000.0}, {'|r|': -23000000000.0, '|v|': 0.00035}]

Is there a way to replace the first and last three characters in a list of sequences using Python?

I am attempting to use Python to replace certain characters in a list of sequences that will be sent out for synthesis. The characters in question are the first and last three of each sequence. I am also attempting to add a * between each character.
The tricky part is that the first and last character need to be different from the other two.
For example: the DNA sequence TGTACGTTGCTCCGAC would need to be changed to /52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/
The first character needs to be /52MOEr_/ and the last needs to be /32MOEr_/, where the _ is the character at that index. For the example above it would be T for the first and C for the last. The other two, the GT and GA would need to be /i2MOEr_/ modifications.
So far I have converted the sequences into a list using the .split() function. The end result was ['AAGTCTGGTTAACCAT', 'AATACTAGGTAACTAC', 'TGTACGTTGCTCCGTC', 'TGTAGTTAGCTCCGTC']. I have been playing around for a bit but I feel I need some guidance.
Is this not as easy to do as I thought it would be?
You can just use the divide and conquer algorithm. Here's my solution to achieve your goal.
dna = "TGTACGTTGCTCCGAC"
dnaFirst3Chars = '/52MOEr' + dna[0] + '/*/i2MOEr' + dna[1] + '/*/i2MOEr' + dna[2] + '/*'
dnaMiddle = '*'.join(dna[3:-3])
dnaLast3Chars = '*/i2MOEr' + dna[-3] + '/*i2MOEr' + dna[-2] + '/*/32MOEr' + dna[-1] + '/'
dnaTransformed = dnaFirst3Chars + dnaMiddle + dnaLast3Chars
print(dnaTransformed)
Output:
/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*i2MOErA/*/32MOErC/
UPDATE:
For simplicity, you can transform the above code in a function like this:
def dna_transformation(dna):
""" Takes a DNA string and returns the transformed DNA """
dnaFirst3Chars = '/52MOEr' + dna[0] + '/*/i2MOEr' + dna[1] + '/*/i2MOEr' + dna[2] + '/*'
dnaMiddle = '*'.join(dna[3:-3])
dnaLast3Chars = '*/i2MOEr' + dna[-3] + '/*i2MOEr' + dna[-2] + '/*/32MOEr' + dna[-1] + '/'
return dnaFirst3Chars + dnaMiddle + dnaLast3Chars
print(dna_transformation("TGTACGTTGCTCCGAC")) # call the function
Output: /52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*i2MOErA/*/32MOErC/
Assuming there's a typo in your expected result and it should actually be
/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/ the code below will work:
# python3
def encode_sequence(seq):
seq_front = seq[:3]
seq_back = seq[-3:]
seq_middle = seq[3:-3]
front_ix = ["/52MOEr{}/", "/i2MOEr{}/", "/i2MOEr{}/"]
back_ix = ["/i2MOEr{}/", "/i2MOEr{}/", "/32MOEr{}/"]
encoded = []
for base, index in zip(seq_front, front_ix):
encoded.append(index.format(base))
encoded.extend(seq_middle)
for base, index in zip(seq_back, back_ix):
encoded.append(index.format(base))
return "*".join(encoded)
Read through the code and make sure you understand it. Essentially we're just slicing the original string and inserting the bases into the format you need. Each element of the final output is added to a list and joined by the * character at the end.
If you need to dynamically specify the number and name of the bases you extract from the front and back of the sequence you can use this version. Note that the {} braces tell the string.format function where to insert the base.
def encode_sequence_2(seq, front_ix, back_ix):
seq_front = seq[:len(front_ix)]
seq_back = seq[-len(back_ix):]
seq_middle = seq[len(front_ix):-len(back_ix)]
encoded = []
for base, index in zip(seq_front, front_ix):
encoded.append(index.format(base))
encoded.extend(seq_middle)
for base, index in zip(seq_back, back_ix):
encoded.append(index.format(base))
return "*".join(encoded)
And here's the output:
> seq = "TGTACGTTGCTCCGAC"
> encode_sequence(seq)
/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/
If you have a list of sequences to encode you can iterate over the list and encode each:
encoded_list = []
for seq in dna_list:
encoded_list.append(encode_sequence(seq))
Or with a list comprehension:
encoded_list = [encode_sequence(seq) for seq in dna_list)]

Match file names and not substrings in python regex

I am trying to match a list of file names using a regex. Instead of matching just the full name, it is matching both the name and a substring of the name.
Three example files are
t0 = r"1997_06_daily.txt"
t1 = r"2010_12_monthly.txt"
t2 = r"2018_01_daily_images.txt"
I am using the regex d.
a = r"[0-9]{4}"
b = r"_[0-9]{2}_"
c = r"(daily|daily_images|monthly)"
d = r"(" + a + b + c + r".txt)"
when I run
t0 = r"1997_06_daily.txt"
t1 = r"2010_12_monthly.txt"
t2 = r"2018_01_daily_images.txt"
a = r"[0-9]{4}"
b = r"_[0-9]{2}_"
c = r"(daily|daily_images|monthly)"
d = r"(" + a + b + c + r".txt)"
for t in (t0, t1, t2):
m = re.match(d, t)
if m is not None:
print(t, m.groups(), sep="\n", end="\n\n")
I get
1997_06_daily.txt
("1997_06_daily.txt", "daily")
2010_12_monthly.txt
("2010_12_monthly.txt", "monthly")
2018_01_daily_images.txt
("2018_01_daily_images.txt", "daily_images")
How can I force the regex to only return the version that includes the full file name and not the substring?
You should make your c pattern non-capturing with '?:'
c = r"(?:daily|daily_images|monthly)"
This is working correctly. The issue you are seeing is how groups work in regex. Your regex c is in parentheses. Parentheses in regex signify that this match should be treated as a group. By printing m.group(), you are printing a tuple of all the groups that matched. Luckily, the first element in the group is always the full match, so just use the following:
print(t, m.groups()[0], sep="\n", end="\n\n")
I know you're only looking for regex solutions but you could easily use os module to split the extension and return index 0. Otherwise, as Bill S. stated, m.groups()[0] returns the 0th index of the regex group.
# os solution
import os
s = "1997_06_daily.txt"
os.path.splitext(s)[0]

Write a Python formatted generator

To generate a Tecplot file I use:
import numpy as np
x, y = np.genfromtxt('./files.dat', unpack=True)
nb_value = x.size
x_splitted = np.split(x, nb_value // 1000 + 1)
y_splitted = np.split(y, nb_value // 1000 + 1)
with open('./test.dat', 'w') as f:
f.write('TITLE = \" YOUPI \" \n')
f.write('VARIABLES = \"x\" \"Y\" \n')
f.write('ZONE T = \"zone1 \" , I=' + str(nb_value) + ', F=BLOCK \n')
for idx in range(len(x_splitted)):
string_list = ["%.7E" % val for val in x_splitted[idx]]
f.write('\t'.join(string_list)+'\n')
for idx in range(len(y_splitted)):
string_list = ["%.7E" % val for val in y_splitted[idx]]
f.write('\t'.join(string_list)+'\n')
Here is an example of file.dat:
-6.491083147394967334e-02 6.917197804459292456e+02
-6.489978349202699115e-02 6.871829941905543819e+02
-6.481115367048655151e-02 6.707292800160890920e+02
-6.479991205404790622e-02 6.756112033303363660e+02
-6.471117816968344205e-02 7.666798999627604871e+02
-6.469995628177811764e-02 7.819675271405360490e+02
This code is working but I have seen that I should use .format() instead of %. This is running: string_list = ["{}".format(list(val for val in y_splitted[idx]))] but won't work with Tecplot because we need 7E.
If I try: string_list = ["{.7E}".format(list(val for val in y_splitted[idx]))] it doesn't work at all. I got: AttributeError: 'list' object has no attribute '7E'
What would be the best way to do what I am trying to do?
Formatting specifiers come after a : colon:
["{:.7E}".format(val) for val in y_splitted[idx]]
Note that I had to adjust your list comprehension syntax as well; you only want to apply each val to str.format(), not the whole loop. In essence, you only needed to replace the "%.7E" % val part here.
See the Format String Syntax documentation:
replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}"
Demo:
>>> ["%.7E" % val for val in (2.8, 4.2e5)]
['2.8000000E+00', '4.2000000E+05']
>>> ["{:.7E}".format(val) for val in (2.8, 4.2e5)]
['2.8000000E+00', '4.2000000E+05']
Not that you really need to use str.format() since there is there are no other parts to the string; if all you have is "{:<formatspec>}", just use the format() function and pass in the <formatspec> as the second argument:
[format(val, ".7E") for val in y_splitted[idx]]
Note that in Python, you generally don't loop over a range() then use the index to get a list value. Just loop over the list directly:
for xsplit in x_splitted:
string_list = [format(val, ".7E") for val in xsplit]
f.write('\t'.join(string_list) + '\n')
for ysplit in y_splitted:
string_list = [format(val, ".7E") for val in ysplit]
f.write('\t'.join(string_list)+'\n')
You also don't have to escape the " characters in your strings; you only need to do that when the string delimiters are also " characters; you are using ' instead. You can use str.format() to insert the nb_value there too:
f.write('TITLE = " YOUPI " \n')
f.write('VARIABLES = "x" "Y" \n')
f.write('ZONE T = "zone1 " , I={}, F=BLOCK \n'.format(nb_value))

Categories