I have a string that looks as follows.
s = 'string with %%substring1%% and %%substring2%%'
I want to extract the text in the substrings including the %% and I cannot figure out how to make a regular expression inclusive.
For example, re.findall('%%(.*?)%%', s, re.DOTALL) will output ['substring1', 'substring2'], but what I really want is for it to return ['%%substring1%%', '%%substring2%%'].
Any suggestions?
You were quite near. Put the group to match the entire required portion rather than only the string in between
>>> s = 'string with %%substring1%% and %%substring2%%'
>>> import re
>>> re.findall('(%%.*?%%)', s, re.DOTALL)
['%%substring1%%', '%%substring2%%']
You actually do not need the parens at all!
>>> re.findall('%%.*?%%', s, re.DOTALL) # Even this works !!!
['%%substring1%%', '%%substring2%%']
And for some visualization, check this out
Debuggex Demo
And check the explaination here
Related
I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30
I have a lot of strings like the following:
\frac{l_{2}\,\mathrm{phi2dd}\,\sin\left(\varphi _{2}\right)}{2}
I want to replace the \frac{***}{2} to \frac{1}{2} ***
The desired string would then become:
\frac{1}{2} l_{2}\,\mathrm{phi2dd}\,\sin\left(\varphi _{2}\right)
I thought I could use a regular expression to do so, but I can't quite figure out how to extract the 'main string' from the substring.
Update: I simplified the problem a bit too much. The strings I have to replace actually contain multiple 'fracs', like so:
I_{2}\,\mathrm{phi2dd}-\frac{l_{2}\,\mathrm{lm}_{4}\,\cos\left(\varphi _{2}\right)}{2}+\frac{l_{2}\,\mathrm{lm}_{3}\,\sin\left(\varphi _{2}\right)}{2}=0
I don't know the number of occurances in the string, this is varying.
Match using \\frac\{(.*?)\}\{2} and substitute using \\frac{1}{2} \1
Updated code:
import re
regex = r"\\frac\{(.*?)\}\{2}"
test_str = "I_{2}\\,\\mathrm{phi2dd}-\\frac{l_{2}\\,\\mathrm{lm}_{4}\\,\\cos\\left(\\varphi _{2}\\right)}{2}+\\frac{l_{2}\\,\\mathrm{lm}_{3}\\,\\sin\\left(\\varphi _{2}\\right)}{2}=0"
subst = "\\\\frac{1}{2} \\1"
# 4th argument decides how many occurences to replace
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
I have a relatively complex string that contains a bunch of data. I am trying to extract the relevant pieces of the string using a regex command. The portions I am interested in are contained in square brackets, like this:
s = '"data":["value":3.44}] lol haha "data":["value":55.34}]
"data":["value":2.44}] lol haha "data":["value":56.34}]'
And the regex expression I have built is as follows:
l = re.findall(r'\"data\"\:.*(\[.*\])', s)
I was expecting this to return
['["value":3.44}]', '["value":55.34}]', '["value":2.44}]', '["value":56.34}]']
But instead all I get is the last one, i.e.,
['["value":56.34}]']
How can I catch 'em all?
It's because quantifiers are greedy by default. So .* will match everything between the first "data": and the last [, so there's only one [...] left to match.
Use non-greedy quantifiers by adding ?.
l = re.findall(r'\"data\"\:.*?(\[.*?\])', s)
You can also use finditer to extract the relevant content iteratively:
import re
s = '"data":["value":3.44}] lol haha "data":["value":55.34}] "data":["value":2.44}] lol haha "data":["value":56.34}]'
for m in re.finditer(r'(\[.*?\])', s):
print m.group(1)
OUTPUT
["value":3.44}]
["value":55.34}]
["value":2.44}]
["value":56.34}]
I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')
How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']