Python RegexHelp - python

I have an sentence and want to run the regex on it, to match a word.
Test Inputs :
This is about CHG6784532
Starting CHG4560986.
Code Snippet:
regVal = re.compile(r"(CHG\w+)")
for i in text:
if regVal.search(i):
print(i)
Desired Output:
CHG4560986 ( NOT CHG4560986.)
The output the for the first input is apt, it prints "CHG6784532" but the second prints "CHG4560986.",I tried adding ^ $ to the regex but still its not helping. Is there something I am missing here.
Thanks!

Make sure text is a string variable (if it is a list use " ".join(text) instead of text in the code below) and then you may use
import re
text="This is about CHG6784532\nStarting CHG4560986."
regVal = re.compile(r"CHG\w+")
res = regVal.findall(text)
print(res)
# => ['CHG6784532', 'CHG4560986']
See the Python demo.
Details
regVal = re.compile(r"CHG\w+") - the regVal variable is declared that holds the CHG\w+ pattern: it matches CHG and then 1+ word chars
res = regVal.findall(text) finds all the matching substrings in text variable and saves them in res variable

Related

Replace paranthesis with "_" leaving all the contents as it is

I have a verilog file in which some inputs and ouputs are named as 133GAT(123).For example
nand2 g679(.a(n752), .b(n750), .O(1355GAT(558) ));
Here, I have to only replace 1355GAT(558) with 1355GAT_588 and not for .a(n752) There are multiple such instance.
I tried with python3.
re.sub(r'GAT*\((\w+)\)',r'_\1',"nand2 g679(.a(n752), .b(n750), .O(1355GAT(558) ) ")
It is giving output as
'nand2 g679(.a(n752), .b(n750), .O(1355_558 ) '
My expectation is to get the output as
'nand2 g679(.a(n752), .b(n750), .O(1355GAT_558 ) '
Why your code is not giving you expected results
Your regex GAT*\((\w+)\) matches GA, GAT, GATT, etc., and while it matches GAT in your string, you're effectively replacing it with your substitution since you never capture it and include it again in the substitution.
Regex 1
This works and gives you the option to check for digits before GAT.
See this regex in use here
# regex
(\d+GAT)\((\d+)\)
# replacement
\1_\2
Code 1
See code in use here
import re
s = "nand2 g679(.a(n752), .b(n750), .O(1355GAT(558) ));"
r = r'(\d+GAT)\((\d+)\)'
x = re.sub(r,r'\1_\2',s)
print(x)
Regex 2
This works too, but uses one capture group rather than two.
See this regex in use here
# regex
(?<=\dGAT)\((\d+)\)
# replacement
_\1
Code 2
See code in use here
import re
s = "nand2 g679(.a(n752), .b(n750), .O(1355GAT(558) ));"
r = r'(?<=\dGAT)\((\d+)\)'
x = re.sub(r,r'_\1',s)
print(x)

Strike-Down Markdown text between two symols

I'm trying to emulate the strike-through markdown from GitHub in python and I managed to do half of the job. Now there's just one problem I have: The pattern I'm using doesn't seem to replace the text containing symbols and I couldn't figure it out so I hope someone can help me
text = "This is a ~~test?~~"
match = re.findall(r"(?<![.+?])(~{2})(?!~~)(.+?)(?<!~~)\1(?![.+?])", text) # Finds all the text between ~~ symbols
if match:
for _, m in match: # Iterates though the matches. First variable (_) containing the symbol ~ and the second one (m) contains the text I want to replace
text = re.sub(f"~~{m}~~", "\u0336".join(m) + "\u0336", text) # Should replace ~~test?~~ with t̶e̶s̶t̶?̶ but it fails
There is a problem in the string that you are trying to replace. In your case, ~~{m}~~ where value of m is test? the regex to be replaced becomes ~~test?~~ and here ? has a special meaning which you aren't escaping hence the replace doesn't work properly. Just try using re.escape(m) instead of m so meta characters get escaped and are treated as literals.
Try your modified Python code,
import re
text = "This is a ~~test?~~"
match = re.findall(r"(?<![.+?])(~{2})(?!~~)(.+?)(?<!~~)\1(?![.+?])", text) # Finds all the text between ~~ symbols
if match:
for _, m in match: # Iterates though the matches. First variable (_) containing the symbol ~ and the second one (m) contains the text I want to replace
print(m)
text = re.sub(f"~~{re.escape(m)}~~", "\u0336".join(m) + "\u0336", text) # Should replace ~~test?~~ with t̶e̶s̶t̶?̶ but it fails
print(text)
This replaces like you expected and prints,
This is a t̶e̶s̶t̶?̶

How to extract function name python regex

Hello I am trying to extract the function name in python using Regex however I am new to Python and nothing seems to be working for me. For example: if i have a string "def myFunction(s): ...." I want to just return myFunction
import re
def extractName(s):
string = []
regexp = re.compile(r"\s*(def)\s+\([^\)]*\)\s*{?\s*")
for m in regexp.finditer(s):
string += [m.group()]
return string
Assumption: You want the name myFunction from "...def myFunction(s):..."
I find something missing in your regex and the way it is structured.
\s*(def)\s+\([^\)]*\)\s*{?\s*
Lets look at it step by step:
\s*: match to zero or more white spaces.
(def): match to the word def.
\s+: match to one or more white spaces.
\([^\)]*\): match to balanced ()
\s*: match to zero or more white spaces.
After that pretty much doesn't matter if you are going for just the name of the function. You are not matching the exact thing you want out of the regex.
You can try this regex if you are interested in doing it by regex:
\s*(def)\s([a-zA-Z]*)\([a-zA-z]*\)
Now the way I have structured the regex, you will get def myFunction(s) in group0, def in group1 and myFunction in group2. So you can use the following code to get you result:
import re
def extractName(s):
string = ""
regexp = re.compile(r"(def)\s([a-zA-Z]*)\([a-zA-z]*\)")
for m in regexp.finditer(s):
string += m.group(2)
return string
You can check your regex live by going on this site.
Hope it helps!

How to substitute character in string using regex

I want to change the following string
^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
to this:
^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
(20151204 changed to 2015-12-04 only)
I can accomplish it by:
re.sub("20151204", "2015-12-04", string)
where
string= ^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
But the value 20151204 is a date and will change and I can't have it hardcoded.
I tried:
re.sub("2015\\d{2}\\d{2}", "2015\-\\d{2}\-\\d{2}", string)
However this did not work.
You need to use capture groups in the pattern and backreferences in the replacement:
result = re.sub("2015(\\d{2})(\\d{2})", "2015-\\1-\\2", string)
^ ^^ ^ ^^^ ^^^
// => ^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
See IDEONE demo
If you need to match any year after ^mylog\., you can use
result = re.sub(r"^\^mylog\\\.(\d{4})(\d{2})(\d{2})", r"^mylog\.\1-\2-\3", string)
See another demo
You first need to find the date and then convert it into the required format and then replace the new string in your old text.
See the code below:
text = "^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$"
search = re.search(r'\d{4}\d{2}\d{2}',text)
search = search.group()
you get search as:
20151204
Now create the date as you want:
new_text = search[0:4] + "-" + search[4:6] + "-" + search[6:8]
So new_text will be:
2015-12-04
Now substitute this new_text in place of the earlier string using `re.sub()
text = re.sub(search,new_text,text)
So now text will be:
^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories