Extract string within parentheses - PYTHON - python

I have a string "Name(something)" and I am trying to extract the portion of the string within the parentheses!
Iv'e tried the following solutions but don't seem to be getting the results I'm looking for.
n.split('()')
name, something = n.split('()')

You can use a simple regex to catch everything between the parenthesis:
>>> import re
>>> s = 'Name(something)'
>>> re.search('\(([^)]+)', s).group(1)
'something'
The regex matches the first "(", then it matches everything that's not a ")":
\( matches the character "(" literally
the capturing group ([^)]+) greedily matches anything that's not a ")"

as an improvement on #Maroun Maroun 's answer:
re.findall('\(([^)]+)', s)
it finds all instances of strings in between parentheses

You can use split as in your example but this way
val = s.split('(', 1)[1].split(')')[0]
or using regex

You can use re.match:
>>> import re
>>> s = "name(something)"
>>> na, so = re.match(r"(.*)\((.*)\)" ,s).groups()
>>> na, so
('name', 'something')
that matches two (.*) which means anything, where the second is between parentheses \( & \).

You can look for ( and ) (need to escape these using backslash in regex) and then match every character using .* (capturing this in a group).
Example:
import re
s = "name(something)"
regex = r'\((.*)\)'
text_inside_paranthesis = re.match(regex, s).group(1)
print(text_inside_paranthesis)
Outputs:
something
Without regex you can do the following:
text_inside_paranthesis = s[s.find('(')+1:s.find(')')]
Outputs:
something

Related

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

How to create non-greedy regular expression from right?

I have a file named 'ab9c_xy8z_12a3.pdf' . I want to capture part after the last underscore and before '.pdf'.
Writing regular expression like :
s = 'ab9c_xy8z_12a3.pdf'
m = re.search(r'_.*?\.pdf',s)
m.group(0)
returns:
'_xy8z_12a3.pdf'
In this example, I would like to capture only '12a3' part. Thank you for your help.
The _.*?\.pdf regex matches the first underscore with _, then matches any 0+ chars other than a newline, as few as possible, but up to the leftmost occurrence of .pdf, which turns out to be at the end of the string. So, . matched all underscores on its way to .pdf, just because of the way a regex engine parses the string (from left to right) and due to . pattern.
You may fix the pattern by using a negated character class [^_] instead of . that will "subtract" underscores from . pattern.
([^_]+)\.pdf
and grab Group 1 value. See the regex demo.
Python demo:
import re
rx = r"([^_]+)\.pdf"
s = "ab9c_xy8z_12a3.pdf"
m = re.search(rx, s)
if m:
print(m.group(1)) # => 12a3
Use re.split instead:
>>> re.split('[_.]', 'ab9c_xy8z_12a3.pdf')[-2]
'12a3'

About how to find all desired format in a str

I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?
One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]
I think this could work:
r'\[.+?\]\w+'
Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.
I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277
\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there
All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

python regex: match to the first "}"

I have a string s containing:-
Hello {full_name} this is my special address named {address1}_{address2}.
I am attempting to match all instances of strings that is contained within the curly brackets.
Attempting:-
matches = re.findall(r'{.*}', s)
gives me
['{full_name}', '{address1}_{address2}']
but what I am actually trying to retrieve is
['{full_name}', '{address1}', '{address2}']
How can I do that?
>>> import re
>>> text = 'Hello {full_name} this is my special address named {address1}_{address2}.'
>>> re.findall(r'{[^{}]*}', text)
['{full_name}', '{address1}', '{address2}']
Try a non-greedy match:
matches = re.findall(r'{.*?}', s)
You need a non-greedy quantifier:
matches = re.findall(r'{.*?}', s)
Note the question mark ?.

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

Categories