regex: split by + except if inside a brackets - python

I'm dealing with equations like 'x_{t+1}+y_{t}=z_{t-1}'. My objective is to obtain all "variables", that is, a list with x_{t+1}, y_{t}, z_{t-1}.
I'd like to split the string by [+-=*/], but not if + or - are inside {}.
Something like this re.split('(?<!t)[\+\-\=]','x_{t+1}+y_{t}=z_{t-1}') partly does the job by not spliting if it observes t followed by a symbol. But I'd like to be more general. Assume there are no nested brackets.
How can I do this?

Instead of splitting at those characters, you could find sequences of all other characters (like x and _) and bracket parts (like {t+1}). The first such sequence in the example is x, _, {t+1}, i.e., the substring x_{t+1}.
import re
s = 'x_{t+1}+y_{t}=z_{t-1}'
print(re.findall(r'(?:\{.*?}|[^-+=*/])+', s))
Output (Try it online!):
['x_{t+1}', 'y_{t}', 'z_{t-1}']

Instead of re.split, consider using re.findall to match only the variables:
>>> re.findall(r"[a-z0-9]+(?:_\{[^\}]+\})?","x_{t+1}+y_{t}=z_{t-1}+pi", re.IGNORECASE)
['x_{t+1}', 'y_{t}', 'z_{t-1}', 'pi']
Try online
Explanation of regex:
[a-z0-9]+(?:_\{[^\}]+\})?
[a-z0-9]+ : One or more alphanumeric characters
(?: )?: A non-capturing group, optional
_\{ \} : Underscore, and opening/closing brackets
[^\}]+ : One or more non-close-bracket characters

Related

Regex matching: Case insensitive German words with spaces (Python)

I have a problem where I want to match any number of German words inside [] braces, ignoring the case. The expression should only match spaces and words, nothing else i.e no punctuation marks or parenthesis
E.g :
The expression ['über das thema schreibt'] should be matched with ['Über', 'das', 'Thema', 'schreibt']
I have one list with items of the former order and another with the latter order, as long as the words are same, they both should match.
The code I tried with is -
regex = re.findall('[(a-zA-Z_äöüÄÖÜß\s+)]', str(term))
or
re.findall('[(\S\s+)]', str(term))
But they are not working. Kindly help me find a solution
In the simplest form using \w+ works for finding words (needs Unicode flag for non-ascii chars), but since you want them to be within the square brackets (and quotes I assume) you'd need something a bit complex
\[(['\"])((\w+\s?)+)\1\]
\[ and \] are used to match the square brackets
['\"] matches either quote and the \1 makes sure the same quote is one the other end
\w+ captures 1 word. The \s? is for an optional space.
The whole string is in the second group which you can split to get the list
import re
text = "['über das thema schreibt']"
regex = re.compile("\[(['\"])((\w+\s?)+)['\"]\]", flags=re.U)
match = regex.match(text)
if match:
print(match.group(2).split())
(slight edit as \1 did not seem to work in the terminal for me)
I found the easiest solution to it :
for a, b in zip(list1, list2):
reg_a = re.findall('[(\w\s+)]', str(a).lower())
reg_b = re.findall('[(\w\s+)]', str(b).lower())
if reg_a == reg_b:
return True
else
return False
Updated based on comments to match each word. This simply ignores spaces, single quotes and square braces
import re
text = "['über das thema schreibt']"
re.findall("([a-zA-Z_äöüÄÖÜß]+)", str(text))
# ['über', 'das', 'thema', 'schreibt']
If you are solving case sensitivity issue, add the regex flaf re.IGNORECASE
like
re.findall('[(\S\s+)]', str(term),re.IGNORECASE)
You might need to consider converting them to unicode, if it did not help.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Regex for match parentheses in Python

I have a list of fasta sequences, each of which look like this:
>>> sequence_list[0]
'gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)
I'd like to be able to extract the gene names from each of the fasta entries in my list, but I'm having difficulty finding the right regular expression. I thought this one would work: "^/(.+/),$". Start with a parentheses, then any number of any character, then end with a parentheses followed by a comma. Unfortunately: this returns None:
test = re.search(r"^/(.+/),$", sequence_list[0])
print(test)
Can someone point out the error in this regex?
Without any capturing groups,
>>> import re
>>> str = """
... gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
... ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
... CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)"""
>>> m = re.findall(r'(?<=\().*?(?=\),)', str)
>>> m
['Ndufa10']
It matches only the words which are inside the parenthesis only when the closing bracket is followed by a comma.
DEMO
Explanation:
(?<=\() In regex (?<=pattern) is called a lookbehind. It actually looks after a string which matches the pattern inside lookbehind . In our case the pattern inside the lookbehind is \( means a literal (.
.*?(?=\),) It matches any character zero or more times. ? after the * makes the match reluctant. So it does an shortest match. And the characters in which the regex engine is going to match must be followed by ),
you need to escape parenthesis:
>>> re.findall(r'\([^)]*\),', txt)
['(Ndufa10),']
Can someone point out the error in this regex? r"^/(.+/),$"
regex escape character is \ not / (do not confuse with python escape character which is also \, but is not needed when using raw strings)
=> r"^\(.+\),$"
^ and $ match start/end of the input string, not what you want to output
=> r"\(.+\),"
you need to match "any" characters up to 1st occurence of ), not to the last one, so you need lazy operator +?
=> r"\(.+?\),"
in case gene names could not contain ) character, you can use a faster regex that avoids backtracking
=> r"\([^)]+\),"

Categories