find substring between "a" and "(" + 2 unknown characters +")" - python

I have some strings which are built this way:
string = "blabla y the_blabla_I_want (two_digits_number) blabla"
I would like to get the_blabla_I_want.
I know the re.search can help but my problem is about how to represent (two_digits_number).

To represent (two_digits_number), you may use "\([0-9]{2}\)".
Here is a regex tutorial in python.
To get the_blabla_I_want, you may try the following code:
import re
x = re.search("y (.*) \([0-9]{2}\)", str)
x[1]
Depends on how you define the two digits number, yo may want to change "\([0-9]{2}" to "\([1-9][0-9])" to avoid numbers have leading zero

string = "blabla y the_blabla_I_want (99) blabla"
if string.count("(") == 1 and string.count(")") == 1:
start_idx = string.find("(")
end_idx = string.find(")")
two_digits_number = string[start_idx+1:end_idx]
print(two_digits_number) # output: 99
two_digits_number = string[start_idx:end_idx+1]
print(two_digits_number) # output: (99)

Related

Replace ip partially with x in python

I have several ip addresses like
162.1.10.15
160.15.20.222
145.155.222.1
I am trying to replace the ip's like below.
162.x.xx.xx
160.xx.xx.xxx
145.xxx.xxx.x
How to achieve this in python.
Here’s a slightly simpler solution
import re
txt = "192.1.2.3"
x = txt.split(".", 1) # ['192', '1.2.3']
y = x[0] + "." + re.sub(r"\d", "x", x[1])
print(y) # 192.x.x.x
We can use re.sub with a callback function here:
def repl(m):
return m.group(1) + '.' + re.sub(r'.', 'x', m.group(2)) + '.' + re.sub(r'.', 'x', m.group(3)) + '.' + re.sub(r'.', 'x', m.group(4))
inp = "160.15.20.222"
output = re.sub(r'\b(\d+)\.(\d+)\.(\d+)\.(\d+)\b', repl, inp)
print(output) # 160.xx.xx.xxx
In the callback, the idea is to use re.sub to surgically replace each digit by x. This keeps the same width of each original number.
This is not the optimize solution but it works for me .
import re
Ip_string = "160.15.20.222"
Ip_string = Ip_string.split('.')
Ip_String_x =""
flag = False
for num in Ip_string:
if flag:
num = re.sub('\d','x',num)
Ip_String_x = Ip_String_x + '.'+ num
else:
flag = True
Ip_String_x = num
Solution 1
Other answers are good, and this single regex works, too:
import re
strings = [
'162.1.10.15',
'160.15.20.222',
'145.155.222.1',
]
for string in strings:
print(re.sub(r'(?:(?<=\.)|(?<=\.\d)|(?<=\.\d\d))\d', 'x', string))
output:
162.x.xx.xx
160.xx.xx.xxx
145.xxx.xxx.x
Explanation
(?<=\.) means following by single dot.
(?<=\.\d) means follwing by single dot and single digit.
(?<=\.\d\d) means following by single dot and double digit.
\d means a digit.
So, all digits that following by single dot and none/single/double digits are replaced with 'x'
(?<=\.\d{0,2}) or similar patterns are not allowed since look-behind ((?<=...)) should has fixed-width.
Solution 2
Without re module and regex,
for string in strings:
first, *rest = string.split('.')
print('.'.join([first, *map(lambda x: 'x' * len(x), rest)]))
above code has same result.
There are multiple ways to go about this. Regex is the most versatile and fancy way to write string manipulation codes. But you can also do it by same old for-loops with split and join functions.
ip = "162.1.10.15"
#Splitting the IPv4 address using '.' as the delimiter
ip = ip.split(".")
#Converting the substrings to x's except 1st string
for i,val in enumerate(ip[1:]):
cnt = 0
for x in val:
cnt += 1
ip[i+1] = "x" * cnt
#Combining the substrings back to ip
ip = ".".join(ip)
print(ip)
I highly recommend checking Regex but this is also a valid way to go about this task.
Hope you find this useful!
Pass an array of IPs to this function:
def replace_ips(ip_list):
r_list=[]
for i in ip_list:
first,*other=i.split(".",3)
r_item=[]
r_item.append(first)
for i2 in other:
r_item.append("x"*len(i2))
r_list.append(".".join(r_item))
return r_list
In case of your example:
print(replace_ips(["162.1.10.15","160.15.20.222","145.155.222.1"]))#==> expected output: ["162.x.xx.xx","160.xx.xx.xxx","145.xxx.xxx.x"]
Oneliner FYI:
import re
ips = ['162.1.10.15', '160.15.20.222', '145.155.222.1']
pattern = r'\d{1,3}'
replacement_sign = 'x'
res = [re.sub(pattern, replacement_sign, ip[::-1], 3)[::-1] for ip in ips]
print(res)

Add ": " for every nth character in a list

Say i have this:
x = ["hello-543hello-454hello-765", "hello-745hello-635hello-321"]
how can i get the output to:
["hello-543: hello-454: hello-765", "hello-745: hello-635: hello-321"]
You can split each string based on substring length with a list comprehension using range where the step value is the number of characters each substring should contain. Then use join to convert each list back to a string with the desired separator characters.
x = ["hello-543hello-454hello-765", "hello-745hello-635hello-321"]
n = 9
result = [': '.join([s[i:i+n] for i in range(0, len(s), n)]) for s in x]
print(result)
# ['hello-543: hello-454: hello-765', 'hello-745: hello-635: hello-321']
Or with textwrap.wrap:
from textwrap import wrap
x = ["hello-543hello-454hello-765", "hello-745hello-635hello-321"]
n = 9
result = [': '.join(wrap(s, n)) for s in x]
print(result)
# ['hello-543: hello-454: hello-765', 'hello-745: hello-635: hello-321']
If you are sure every str length is multiply of your n, I would use re.findall for that task.
import re
txt1 = "hello-543hello-454hello-765"
txt2 = "hello-745hello-635hello-321"
out1 = ": ".join(re.findall(r'.{9}',txt1))
out2 = ": ".join(re.findall(r'.{9}',txt2))
print(out1) #hello-543: hello-454: hello-765
print(out2) #hello-745: hello-635: hello-321
.{9} in re.findall mean 9 of any characters excluding newline (\n), so this code would work properly as long as your strs do not contain \n. If this does not hold true you need to add re.DOTALL as third argument of re.findall

Cut string within a specific pattern in python

I have string of some length consisting of only 4 characters which are 'A,T,G and C'. I have pattern 'GAATTC' present multiple times in the given string. I have to cut the string at intervals where this pattern is..
For example for a string, 'ATCGAATTCATA', I should get output of
string one - ATCGA
string two - ATTCATA
I am newbie in using Python but I have come up with the following (incomplete) code:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
Any help would be really appreciated.
First lets develop your idea of using find, so you can figure out your mistakes.
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
Yet python string sports a powerful replace method that directly replaces fragments of string. The below snippet uses the replace method to insert separators when needed:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
I believe it is more intuitive and should be faster then RE (which might have lower performance, depending on library and usage)
Here is a simple solution :
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
Result :
print(result)
['ATCGA', 'ATTCATA']
BioPython has a restriction enzyme package to do exactly what you're asking.
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
print(EcoRI.site) # You will see that this is the enzyme you listed above
test = 'ATCGAATTCATA'.upper() # This is the sequence you want to search
my_seq = Seq(test, IUPACAmbiguousDNA()) # Create a biopython Seq object with our sequence
cut_sites = EcoRI.search(my_seq)
cut_sites contain a list of exactly where to cut the input sequence (such that GA is in the left sequence and ATTC is in the right sequence.
You can then split the sequence into contigs using:
cut_sites = [0] + cut_sites # We add a leading zero so this works for the first
# contig. This might not always be needed.
contigs = [test[i:j] for i,j in zip(cut_sites, cut_sites[1:]+[None])]
You can see this page for more details about BioPython.
My code is a bit sloppy, but you could try something like this when you want to iterate over multiple occurrences of the string
def split_strings(seq):
string1 = seq[:seq.find(str1) +2]
string2 = seq[seq.find(str1) +2:]
return string1, string2
test = 'ATCGAATTCATA'.upper()
str1 = 'GAATTC'
seq = test
while str1 in seq:
string1, seq = split_strings(seq)
print string1
print seq
Here's a solution using the regular expression module:
import re
seq = 'ATCGAATTCATA'
restriction_site = re.compile('GAATTC')
subseq_start = 0
for match in restriction_site.finditer(seq):
print seq[subseq_start:match.start()+2]
subseq_start = match.start()+2
print seq[subseq_start:]
Output:
ATCGA
ATTCATA

Splitting an unspaced string of decimal values - Python

An awful person has given me a string like this
values = '.850000.900000.9500001.000001.50000'
and I need to split it to create the following list:
['.850000', '.900000', '.950000', '1.00000', '1.500000']
I know that I was dealing only with numbers < 1 I could use the code
dl = '.'
splitvalues = [dl+e for e in values.split(dl) if e != ""]
But in cases like this one where there are numbers greater than 1 buried in the string, splitvalue would end up being
['.850000', '.900000', '.9500001', '.000001', '.50000']
So is there a way to split a string with multiple delimiters while also splitting the string differently based on which delimiter is encountered?
I think this is somewhat closer to a fixed width format string. Try a regular expression like this:
import re
str = "(\d{1,2}\\.\d{5})"
m = re.search(str, input_str)
your_first_number = m.group(0)
Try this repeatedly on the remaining string to consume all numbers.
>>> import re
>>> source = '0.850000.900000.9500001.000001.50000'
>>> re.findall("(.*?00+(?!=0))", source)
['0.850000', '.900000', '.950000', '1.00000', '1.50000']
The split is based on looking for "{anything, double zero, a run of zeros (followed by a not-zero)"}.
Assume that the value before the decimal is less than 10, and then we have,
values = '0.850000.900000.9500001.000001.50000'
result = list()
last_digit = None
for value in values.split('.'):
if value.endswith('0'):
result.append(''.join([i for i in [last_digit, '.', value] if i]))
last_digit = None
else:
result.append(''.join([i for i in [last_digit, '.', value[0:-1]] if i]))
last_digit = value[-1]
if values.startswith('0'):
result = result[1:]
print(result)
# Output
['.850000', '.900000', '.950000', '1.00000', '1.50000']
How about using re.split():
import re
values = '0.850000.900000.9500001.000001.50000'
print([a + b for a, b in zip(*(lambda x: (x[1::2], x[2::2]))(re.split(r"(\d\.)", values)))])
OUTPUT
['0.85000', '0.90000', '0.950000', '1.00000', '1.50000']
Here digits are of fixed width, i.e. 6, if include the dot it's 7. Get the slices from 0 to 7 and 7 to 14 and so on. Because we don't need the initial zero, I use the slice values[1:] for extraction.
values = '0.850000.900000.9500001.000001.50000'
[values[1:][start:start+7] for start in range(0,len(values[1:]),7)]
['.850000', '.900000', '.950000', '1.00000', '1.50000']
Test;
''.join([values[1:][start:start+7] for start in range(0,len(values[1:]),7)]) == values[1:]
True
With a fixed / variable string, you may try something like:
values = '0.850000.900000.9500001.000001.50000'
str_list = []
first_index = values.find('.')
while first_index > 0:
last_index = values.find('.', first_index + 1)
if last_index != -1:
str_list.append(values[first_index - 1: last_index - 2])
first_index = last_index
else:
str_list.append(values[first_index - 1: len(values) - 1])
break
print str_list
Output:
['0.8500', '0.9000', '0.95000', '1.0000', '1.5000']
Assuming that there will always be a single digit before the decimal.
Please take this as a starting point and not a copy paste solution.

How simplify list processing in Python?

Here's my first Python program, a little utility that converts from a Unix octal code for file permissions to the symbolic form:
s=raw_input("Octal? ");
digits=[int(s[0]),int(s[1]),int(s[2])];
lookup=['','x','w','wx','r','rx','rw','rwx'];
uout='u='+lookup[digits[0]];
gout='g='+lookup[digits[1]];
oout='o='+lookup[digits[2]];
print(uout+','+gout+','+oout);
Are there ways to shorten this code that take advantage of some kind of "list processing"? For example, to apply the int function all at once to all three characters of s without having to do explicit indexing. And to index into lookup using the whole list digits at once?
digits=[int(s[0]),int(s[1]),int(s[2])];
can be written as:
digits = map(int,s)
or:
digits = [ int(x) for x in s ] #list comprehension
As it looks like you might be using python3.x (or planning on using it in the future based on your function-like print usage), you may want to opt for the list-comprehension unless you want to dig in further and use zip as demonstrated by one of the later answers.
Here is a slightly optimized version of your code:
s = raw_input("Octal? ")
digits = map(int, s)
lookup = ['','x','w','wx','r','rx','rw','rwx']
perms = [lookup[d] for d in digits]
rights = ['{}={}'.format(*x) for x in zip('ugo', perms)]
print ','.join(rights)
You can also do it with bitmasks:
masks = {
0b100: 'r', # 4
0b010: 'x', # 2
0b001: 'w' # 1
}
octal = raw_input('Octal? ')
result = '-'
for digit in octal[1:]:
for mask, letter in sorted(masks.items(), reverse=True):
if int(digit, 8) & mask:
result += letter
else:
result += '-'
print result
Here's my version, inspired by Blender's solution:
bits = zip([4, 2, 1], "rwx")
groups = "ugo"
s = raw_input("Octal? ");
digits = map(int, s)
parts = []
for group, digit in zip(groups, digits):
letters = [letter for bit, letter in bits if digit & bit]
parts.append("{0}={1}".format(group, "".join(letters)))
print ",".join(parts)
I think it's better not to have to explicitly enter the lookup list.
Here's my crack at it (including '-' for missing permissions):
lookup = {
0b000 : '---',
0b001 : '--x',
0b010 : '-w-',
0b011 : '-wx',
0b100 : 'r--',
0b101 : 'r-x',
0b110 : 'rw-',
0b111 : 'rwx'
}
s = raw_input('octal?: ')
print(','.join( # using ',' as the delimiter
r + '=' + lookup[int(n, 8)] # the letter followed by the permissions
for n, r in zip(tuple(s), 'ugo'))) # for each number/ letter pair

Categories