re.compile() python: issue in getting particular pattern - python

I have been trying to extract particular pattern, which looks like (PSSA) or (FJFD10) in a string.
In a string like this, I want to extract for instance something inside that parentheses (PNDM) in this case. However, I wanted to print it without parentheses.
eg_string = """DAAAAAAJFF: Hellllllllo (PNDM)
CC [MIM:606176]: Blalblablalbalbl. {CCO:0000069|Pubd:160,
CC ECO:0000269|PubMed:18162506}. Note=elllelefjfjfjf HAahndfd
"""
What I did was:
patti = re.compile(r'([A-Z]+)')
www = patti.findall(eg_string)
However, this was giving me more than I needed. It did include PNDM, but it also included like DAAAJFF, ECO
Another thing I tried was r'(^[A-Z]+) I knew it was going to print out DAAAAAJFF only. I want to know how to print (PNDM) which is in the middle of the string.

Use the regex: r"\([A-Z]+\)" to get text results for including ().
Demo: https://regex101.com/r/e2gyly/1
Explanation:
\( - will look for opening brace (
[A-Z] - any char between range A to Z
\) - closing brace )

Here ([A-Z]+) is consider as pattern like A-Z any number of times but you need to change it as \(([A-Z]+)\)
Your Code will be like
import re
eg_string = """DAAAAAAJFF: Hellllllllo (PNDM)
CC [MIM:606176]: Blalblablalbalbl. {CCO:0000069|Pubd:160,
CC ECO:0000269|PubMed:18162506}. Note=elllelefjfjfjf HAahndfd
"""
patti = re.compile(r'\(([A-Z]+)\)')
www = patti.findall(eg_string)
print(www)
#Output : ['PNDM']
Hope this will Help...

Related

How to extract value from re?

import re
cc = 'test 5555555555555555/03/22/284 test'
cc = re.findall('[0-9]{15,16}\/[0-9]{2,4}\/[0-9]{2,4}\/[0-9]{3,4}', cc)
print(cc)
[5555555555555555/03/22/284]
This code is working fine but if i put 5555555555555555|03|22|284 on cc variable then this output will come:
[]
I want one condition if it contains '|' then it gives output: 5555555555555555|03|22|284 or '/' then also it will give output: 5555555555555555/03/22/284
Just replace all the /s in your regex (which incidentally don't need to be backslashed) with [/|], which matches either a / or a |. Or if you want backslashes, too, as in your comment on Zain's answer, [/|\\]. (You should always use raw strings r'...' for regexes since they have their own interpretation of backslashes; in a regular string, [/|\\] would have to be written [/|\\\\].)
match = re.findall(
r'[0-9]{15,16}[/|\\][0-9]{2,4}[/|\\][0-9]{2,4}[/|\\][0-9]{3,4}',
cc)
Any other characters you want to include, like colons, can likewise be added between the square brackets.
If you want to accept repeated characters – and treat them as a single delimiter – you can add + to accept "1 or more" of any of the characters:
match = re.findall(
r'[0-9]{15,16}[:/|\\]+[0-9]{2,4}[:/|\\]+[0-9]{2,4}[:/|\\]+[0-9]{3,4}',
cc)
But that will accept, for example, 555555555555555:/|\\03::|::22\\//284 as valid. If you want to be pickier you can replace the character class with a set of alternates, which can be any length. Just separate the options via | – note that outside of the square brackets, a literal | needs a backslash – and put (?:...) around the whole thing: (?:/|\\|\||:|...) whatever, in place of the square-bracketed expressions up there.
I don't recommend assigning the result of the findall back to the original cc variable; for one thing, it's a list, not a string. (You can get the string with e.g. new_cc = match[0]).
Better to create a new variable so (1) you still have the original value in case you need it and (2) when you use the new value in later code, it's clear that it's different.
In fact, if you're going to the trouble of matching this pattern, you might as well go ahead and extract all the components of it at the same time. Just put (...) around the bits you want to keep, and they'll be put in a tuple as the result of that match:
import re
pat = re.compile(r'([0-9]{15,16})[:/|\\]+([0-9]{2,4})[:/|\\]+([0-9]{2,4})[:/|\\]+([0-9]{3,4})')
cc = 'test 5555555555555555/03/22/284 test'
match, = pat.findall(cc)
print(match)
Which outputs this:
('5555555555555555', '03', '22', '284')
Define both options in re to let your string work with both e.g. the following RE used checks for both "\" and also "|" in the string
import re
cc = 'test 5555555555555555/03/22/284 test'
#cc = 'test 5555555555555555|03|22|284 test'
cc = re.findall('[0-9]{15,16}[\/|][0-9]{2,4}[\/|][0-9]{2,4}[\/|][0-9]{3,4}', cc)
print(cc)

How to extract function name python regex

Hello I am trying to extract the function name in python using Regex however I am new to Python and nothing seems to be working for me. For example: if i have a string "def myFunction(s): ...." I want to just return myFunction
import re
def extractName(s):
string = []
regexp = re.compile(r"\s*(def)\s+\([^\)]*\)\s*{?\s*")
for m in regexp.finditer(s):
string += [m.group()]
return string
Assumption: You want the name myFunction from "...def myFunction(s):..."
I find something missing in your regex and the way it is structured.
\s*(def)\s+\([^\)]*\)\s*{?\s*
Lets look at it step by step:
\s*: match to zero or more white spaces.
(def): match to the word def.
\s+: match to one or more white spaces.
\([^\)]*\): match to balanced ()
\s*: match to zero or more white spaces.
After that pretty much doesn't matter if you are going for just the name of the function. You are not matching the exact thing you want out of the regex.
You can try this regex if you are interested in doing it by regex:
\s*(def)\s([a-zA-Z]*)\([a-zA-z]*\)
Now the way I have structured the regex, you will get def myFunction(s) in group0, def in group1 and myFunction in group2. So you can use the following code to get you result:
import re
def extractName(s):
string = ""
regexp = re.compile(r"(def)\s([a-zA-Z]*)\([a-zA-z]*\)")
for m in regexp.finditer(s):
string += m.group(2)
return string
You can check your regex live by going on this site.
Hope it helps!

how to replace symbols using regex.sub in python

I have a string s, where:
s = 'id=,value=<<<,RMOrigin=[0]>>>BasicData:id=ABCvalue=<<<ABCRMGrade=[0]>>>BasicData:id=ABCvalue='
I want to replace ABC with DEF when ever
<<<ABC\w+=\[0]>>>
occurs then output should be
<<<DEF\w+=\[0]>>>
in text \w+ refers to RMGrade but this changes randomly
desired ouput is:
S = id=,value=<<<,RMOrigin=[0]>>>BasicData:id=ABCvalue=<<<ABCRMGrade=[0]>>>BasicData:id=ABCvalue=
i have tried in way of:
s = re.sub('<<<ABC\w+=\[0]>>>','<<<DEF\w+=\[0]>>>',s)
i'm output as
'id=,value=<<<,RMOrigin=[0]>>>BasicData:id=ABCvalue=<<<DEF\\w+=\\[0]>>>BasicData:id=ABCvalue='
I'm a bit confused what you exactly want to achieve. But if you want to replace ABC in every match of pattern <<<ABC\w+=\[0]>>>, then you can use backreferences to groups.
For example, modify pattern so that you can reference the groups (<<<)ABC(\w+=\[0]>>>). Now group#1 refers to the part before ABC and group#2 refers to part after ABC. So the replacement string looks like this - \1DEF\2 - where \1 is group#1 and \2 is group#2.
import re
s = 'id=,value=<<<,RMOrigin=[0]>>>BasicData:id=ABCvalue=<<<ABCRMGrade=[0]>>>BasicData:id=ABCvalue='
res = re.sub(r'(<<<)ABC(\w+=\[0]>>>)', r'\1DEF\2', s)
print(res)
The output: id=,value=<<<,RMOrigin=[0]>>>BasicData:id=ABCvalue=<<<DEFRMGrade=[0]>>>BasicData:id=ABCvalue=
You also can use function to define replacement. For more check in documentation.

Python replace regex

I have a string in which there are some attributes that may be empty:
[attribute1=value1, attribute2=, attribute3=value3, attribute4=]
With python I need to sobstitute the empty values with the value 'None'. I know I can use the string.replace('=,','=None,').replace('=]','=None]') for the string but I'm wondering if there is a way to do it using a regex, maybe with the ?P<name> option.
You can use
import re
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(,|])', r'=None\1', s)
\1 is the match in parenthesis.
With python's re module, you can do something like this:
# import it first
import re
# your code
re.sub(r'=([,\]])', '=None\1', your_string)
You can use
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(?!\w)', r'=None', s)
This works because the negative lookahead (?!\w) checks if the = character is not followed by a 'word' character. The definition of "word character", in regular expressions, is usually something like "a to z, 0 to 9, plus underscore" (case insensitive).
From your example data it seems all attribute values match this. It will not work if the values may start with something like a comma (unlikely), may be quoted, or may start with anything else. If so, you need a more fool proof setup, such as parse from the start: skipping the attribute name by locating the first = character.
Be specific and use a character class:
import re
string = "[attribute1=value1, attribute2=, attribute3=value3, attribute4=]"
rx = r'\w+=(?=[,\]])'
string = re.sub(rx, '\g<0>None', string)
print string
# [attribute1=value1, attribute2=None, attribute3=value3, attribute4=None]

How to find a string between to special characters in python?

I have a set of strings like this:
uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;
I need to extract the sub-string between two semicolons and I only need the first occurrence.
The result should be: C1orf159
I have tried this code, but it does not work:
import re
info = "uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;"
name = re.search(r'\;(.*)\;', info)
print name.group()
Please help me.
Thanks
You can split the string and limit it to two splits.
x = info.split(';',2)[1]
import re
pattern=re.compile(r".*?;([a-zA-Z0-9]+);.*")
print pattern.match(info).groups()
This looks for first ; eating up non greedily through .*? .Then it captures the alpha numeric string until next ; is found.Then it eats up the rest of the string.Match captured though .groups()

Categories