regular expression replace (if pattern found replace symbol for symbol) - python

I have several lines of text (RNA sequence), I want to make a matrix regarding conservation of characters, because they are aligned according similarity.
But I have several gaps (-) which actually mean missing a whole structure (e.g.#- > 100) If this happens I want to change that for dots (other symbol for making a distinguishment) with the same amount found.
I thought I can do this with regular expression, but I am not able to replace only the pattern, or when I do so, I replace everything but with the incorrect number of dots.
My code looks like this:
with alnfile as f_in:
if re.search('-{100,}', elem,):
elem = re.sub('-{100,}','.', elem, ) #failed alternative*len(m.groups(x)), elem)
print len(elem) # check if I am keeping the lenghth of my sequence
print elem[0:100] # check the start
f1.write(elem)
if my file is:
ONE ----(*100)atgtgca----(*20)
I am getting:
ONE ..(*100)atgtgca----(*20)
My other change was only dots then I get:
ONE ....(*100)atgtgca....(*20)
WHAT I NEED:
ONE ....(*100)atgtgca----(*20)
I know that I am missing something, but I can not figure it out? Is there a flag or something that help me or would allow the exact change of this?

You could try the following:
data = "ONE " + "-" * 100 + "atgtgca" + "-" * 20
print re.sub(r'-{100,}', lambda x: '.' * len(x.group(0)), data)
This would display:
ONE ....................................................................................................atgtgca--------------------

Related

regex in python - how to understand this ip lable without parentheses

I have this code to check if a string is a valid IPv4 address:
import re
def is_ip4(IP):
label = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it works fine. but if I remove the parentheses from the label, as this
import re
def is_ip4(IP):
label = "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it show valid ip for "2090.1.11.0", "20.1.11.0", but not for "2.1.11.0". I'm actually a bit confused for the cases with vs without parentheses. Can someone explain this for me? thanks
The reason you need the parentheses is because of the two-step process you're using. By itself, the parentheses don't do anything (other than capturing in a group). But you're also doing this:
pattern = re.compile("(" + label + "\.){3}" + label + "$")
The label regex is copied twice, first for three repetitions followed by a period. That copy is fine (almost), because in the statement, it is enclosed in parentheses once more. However, the second copy is outside any parentheses, so you end up with a regex like (simplified):
pattern == '(a|ab|abc\.){3}a|ab|abc$'
This matches if either (a|ab|abc\.){3}a matches, or ab or abc. With parentheses, it would be like:
pattern == '((a|ab|abc)\.){3}(a|ab|abc)$'
So, although the parentheses appear superfluous, they are not for two reasons. They are keeping the period separate from the last option abc and they are keeping the final choices together and apart from the first bit.
However, you shouldn't be doing this in the first place. Just use:
from ipaddress import ip_address
def is_ip4(ip):
try:
ip_address(ip)
return True
except ValueError:
return False
No installation required, it's a standard library.
The reason you get a match for '2090.1.11.0' is because matching it to this:
'([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$'
Comes down to matching it to this:
'([0-9]){3}[0-9]'
Since, [0-9] is the first option in the 'or' expression in parentheses, repeated three times and the second [0-9] is the first option in the 'or' expression after the {3}.
Note that the $ you put in to ensure the entire string was matches is lumped in with the final 'or' option, so that doesn't do anything here.
Try running the below and note the identical first match:
import re
print(re.findall('([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$', '2090.1.11.0'))
print(re.findall('([0-9]){3}[0-9]', '2090.1.11.0'))
(ignore the second match on the first line, not as relevant)

Not getting intended results with "either/or" character in python regex

I'm trying to match some fairly simple text but am having trouble with the "|" character. The text is:
"TF0876 some text Y N 2.31 - 0.01\n TF9788 more text N Y - 2.3 -\n TF1626"
and I want to extract two items using re.findall:
"TF0876 some text for Y N 2.31" and
"TF9788 more text N Y -"
The code I thought would work is:
mat = re.compile(r"TF\d{4}.*?[Y|N] [Y|N] [-|\d\.\d*]",flags=re.DOTALL)
test2 = re.findall(mat,text)
print(test2)
However, this gives me the following list:
['TF0876 some text for Y N 2', 'TF9788 more text N Y -']
For some reason, in the first match that the regex finds stops at the "2", rather than the "2.31" which is what I want. If instead of the \d\.\d* I simply type in2.31 then it still only matches only up to the "2". In fact whatever I type, I only seem to get one character from either side of the "|". I don't understand this; the regex HOWTO says that the expression Crow|Servo will match "Crow" or "Servo", but nothing smaller (such as "Cro"). In my case the opposite seems to be happening, so I clearly don't understand something and would be grateful for help.
Thanks.
The problem lies within your compiled statement, try changing it to
mat = re.compile(r"TF\d{4}.*?[YN] [YN] [-\d\.]*",flags=re.DOTALL)
You will not need the "|" within "[]". These brackets already signalize a range or collection of different possible expressions.
Second Option is to use groups by applying "()" brackets instead of your "[]". Depends on what you want to match exactly. Both will work on your given example texts.
The problem is that you are using brackets [] instead of parentheses () to separate subgroups. Try this:
import re
text = "TF0876 some text Y N 2.31 - 0.01\n TF9788 more text N Y - 2.3 -\n TF1626"
mat = re.compile(r"TF\d{4}.*?(?:Y|N) (?:Y|N) (?:-|\d\.\d*)",flags=re.DOTALL)
test2 = re.findall(mat, text)
print(test2)
# ['TF0876 some text Y N 2.31', 'TF9788 more text N Y -']
Here the ?: bits are just so subgroups are not captured. Note that (?:Y|N) is basically the same as simply [YN].

python substitute a substring with one character less

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs
You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =
No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.
I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?
how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.
either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," ")
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module,but it's not necessory

Matching optional numbers in regex

This one is probably a simple one, but I could not find an example that's simple enough to understand (sorry, I'm new with RegEx).
I'm writing some Python code to search for any string that matches any of the following examples:
float[20]
float[7532]
float[]
So this is what I have so far:
import re
p = re.compile('float\[[0-9]+\]')
print p.match("float[20]")
print p.match("float[7532]")
print p.match("float[]")
The code works great for the first and second scenarios, but not the third (no numbers between brackets). What's the best way to add that condition?
Thanks a lot!
p = re.compile('float\[[0-9]*\]')
putting a * after the character class means 0 or matches of the character class.
Try
float\[\d*\]
\d is a shortcut for [0-9].
The asterisk matches 0..n (any number) of characters of the character class.
The + operator requires at least one instance of whatever it's applying to, which your third option doesn't have. You want the * operator which is 0 or more. So:
p = re.compile('float\[[0-9]*\]')
Try:
import re
p = re.compile('float\[[0-9]*\]')
print p.match("float[20]")
print p.match("float[7532]")
print p.match("float[]")
+ is for one or more elements and * is used for zero or more element.

Python regex: find lines where period is missing in

I'm looking for a regular expression, implemented in Python, that will match on this text
WHERE PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753'
but will not match on this text
WHERE AsPolicy.PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753'
I'm doing this to find places in a large piece of SQL where the developer did not explicitly reference the table name. All I want to do is print the offending lines (the first WHERE clause above). I have all of the code done except for the regex.
re.compile('''WHERE [^.]+ =''')
Here, the [] indicates "match a set of characters," the ^ means "not" and the dot is a literal period. The + means "one or more."
Was that what you were looking for?
something like
WHERE .*\..* = .*
not sure how accurate can be, it depends on how your data looks... If you provide a bigger sample it can be refined
Something like this would work in java, c#, javascript, I suppose you can adapt it to python:
/WHERE +[^\.]+ *\=/
>>> l
["WHERE PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753' ", "WHERE AsPolicy.P
olicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753' "]
>>> [line for line in l if re.match('WHERE [^.]+ =', line)]
["WHERE PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753' "]

Categories