u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
All I need is the contents inside the parenthesis.
If your problem is really just this simple, you don't need regex:
s[s.find("(")+1:s.find(")")]
Use re.search(r'\((.*?)\)',s).group(1):
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
If you want to find all occurences:
>>> re.findall('\(.*?\)',s)
[u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
>>> re.findall('\((.*?)\)',s)
[u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
Building on tkerwin's answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
if contents_re:
print(contents_re.groupdict()['contents'])
No need to use regex ....
Just use list slicing ...
string="(tidtkdgkxkxlgxlhxl) ¥£%#_¥#_¥#_¥#"
print(string[string.find("(")+1:string.find(")")])
TheSoulkiller's answer is great. just in my case, I needed to handle extra parentheses and only extract the word inside the parentheses. a very small change would solve the problem
>>> s=u'abcde((((a+b))))-((a*b))'
>>> re.findall('\((.*?)\)',s)
['(((a+b', '(a*b']
>>> re.findall('\(+(.*?)\)',s)
['a+b', 'a*b']
Here are several ways to extract strings between parentheses in Pandas with the \(([^()]+)\) regex (see its online demo) that matches
\( - a ( char
([^()]+) - then captures into Group 1 any one or more chars other than ( and )
\) - a ) char.
Extracting the first occurrence using Series.str.extract:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.extract(r'\(([^()]+)\)')
# => df['Values']
# 0 value 1
# Name: Values, dtype: object
Extracting (finding) all occurrences using Series.str.findall:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)')
# => df['Values']
# 0 [value 1, value 2]
# Name: Values, dtype: object
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)').str.join(', ')
# => df['Values']
# 0 value 1, value 2
# Name: Values, dtype: object
Note that .str.join(', ') is used to create a comma-separated string out of the resulting list of strings. You may adjust this separator for your scenario.
testcase
s = "(rein<unint>(pBuf) +fsizeof(LOG_RECH))"
result
['pBuf', 'LOG_RECH', 'rein<unint>(pBuf) +fsizeof(LOG_RECH)']
implement
def getParenthesesList(s):
res = list()
left = list()
for i in range(len(s)):
if s[i] == '(':
left.append(i)
if s[i] == ')':
le = left.pop()
res.append(s[le + 1:i])
print(res)
return res
If im not missing something, a small fix to #tkerwin:
s[s.find("(")+1:s.rfind(")")]
The 2nd find should be rfind so you start search from end of string
Related
u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
All I need is the contents inside the parenthesis.
If your problem is really just this simple, you don't need regex:
s[s.find("(")+1:s.find(")")]
Use re.search(r'\((.*?)\)',s).group(1):
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
If you want to find all occurences:
>>> re.findall('\(.*?\)',s)
[u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
>>> re.findall('\((.*?)\)',s)
[u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
Building on tkerwin's answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
if contents_re:
print(contents_re.groupdict()['contents'])
No need to use regex ....
Just use list slicing ...
string="(tidtkdgkxkxlgxlhxl) ¥£%#_¥#_¥#_¥#"
print(string[string.find("(")+1:string.find(")")])
TheSoulkiller's answer is great. just in my case, I needed to handle extra parentheses and only extract the word inside the parentheses. a very small change would solve the problem
>>> s=u'abcde((((a+b))))-((a*b))'
>>> re.findall('\((.*?)\)',s)
['(((a+b', '(a*b']
>>> re.findall('\(+(.*?)\)',s)
['a+b', 'a*b']
Here are several ways to extract strings between parentheses in Pandas with the \(([^()]+)\) regex (see its online demo) that matches
\( - a ( char
([^()]+) - then captures into Group 1 any one or more chars other than ( and )
\) - a ) char.
Extracting the first occurrence using Series.str.extract:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.extract(r'\(([^()]+)\)')
# => df['Values']
# 0 value 1
# Name: Values, dtype: object
Extracting (finding) all occurrences using Series.str.findall:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)')
# => df['Values']
# 0 [value 1, value 2]
# Name: Values, dtype: object
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)').str.join(', ')
# => df['Values']
# 0 value 1, value 2
# Name: Values, dtype: object
Note that .str.join(', ') is used to create a comma-separated string out of the resulting list of strings. You may adjust this separator for your scenario.
testcase
s = "(rein<unint>(pBuf) +fsizeof(LOG_RECH))"
result
['pBuf', 'LOG_RECH', 'rein<unint>(pBuf) +fsizeof(LOG_RECH)']
implement
def getParenthesesList(s):
res = list()
left = list()
for i in range(len(s)):
if s[i] == '(':
left.append(i)
if s[i] == ')':
le = left.pop()
res.append(s[le + 1:i])
print(res)
return res
If im not missing something, a small fix to #tkerwin:
s[s.find("(")+1:s.rfind(")")]
The 2nd find should be rfind so you start search from end of string
I have a long string that may contain multiple same sub-strings. I would like to extract certain sub-strings by using regex. Then, for each extracted sub-string, I want to append [i] and replace the original one.
By using Regex, I extracted ['df.Libor3m','df.Libor3m_lag1','df.Libor3m_lag1']. However, when I tried to add [i] to each item, the first 'df.Libor3m_lag1' in string is replaced twice.
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD.find(var_name)
new_var_name = var_name+'[i]'
function_text_MD=function_text_MD.replace(var_name,new_var_name,1)
So I got '0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i][i],0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'.
df.Libor3m_lag1[i][i] was added [i] twice.
What I want to get:
'0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i],0.9))+0.7*np.maximum(df.Libor3m_lag1[i],0.9)'
Thanks in advance!
Here is the code.
import re
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD = function_text_MD.replace(var_name,var_name+'[i]')
print(function_text_MD)
t = "0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)"
p = re.split("(?<=df\.)[a-zA-Z_0-9]+", t)
s = re.findall("(?<=df\.)[a-zA-Z_0-9]+", t)
s = [x+"[i]" for x in s]
result = "".join([p[0],s[0],p[1],s[1],p[2],s[2]])
use the regular expression to split string first.
use the same regular expression to find the spliters
change the spliters to what you want
put the 2 list together and join.
I need to writing my dataframe to csv, and some of the series start with "+-= ", so I need to remove them first.
I tried to test by using a string:
test="+++++-= I love Mercedes-Benz"
while True:
if test.startswith('+') or test.startswith('-') or test.startswith('=') or test.startswith(' '):
test=test[1:]
continue
else:
print(test)
break
Output looks perfect:
I love Mercedes-Benz.
Now when I want to do the same while using lambda in my dataframe:
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df.loc[len(my_df)] = ["++++-= I love Mercedes-Benz", 4, "Love this"]
my_df.loc[len(my_df)] = ["=Looks so good!", 2, "5-year-old"]
my_df
my_df["A"]=my_df["A"].map(lambda x: x[1:] if x.startswith('=') else x)
print(my_df["A"])
I am not sure how to put 4 startswith "-","=","+"," " together and loop them until they meet the first alphabet or character(sometimes it might be in Japanese or Chinese.)
expected final my_df:
A B C
0 I love Mercedes-Benz 4 Love this
1 Looks so good! 2 5-year-old
You can use str.lstrip in order to remove these leading characters:
my_df.A.str.lstrip('+-=')
0 I love Mercedes-Benz
1 Looks so good!
Name: A, dtype: object
One way to achieve it could be
old = ""
while old != my_df["A"]:
old = my_df["A"]
my_df["A"]=my_df["A"].map(lambda x: x[1:] if any(x.startswith(char) for char in "-=+ ") else x)
But I'd like to warn you about the strip() method for strings:
>>> test="+++++-= I love Mercedes-Benz"
>>> test.strip("+-=")
' I love Mercedes-Benz'
So your data extraction can become simpler:
my_df["A"].str=my_df["A"].str.strip("+=- ")
Just be careful because strip will remove the characters from both sides of the string. lstrip instead can do the job only on the left side.
The function startswith accepts a tuple of prefixes:
while test.startswith(('+','-','=',' ')):
test=test[1:]
But you can't put that in a lambda. But then, you don't need a lambda: just write the function and pass its name to map.
As a lover of regex and possibly convoluted solutions, I will add this solution as well:
import re
my_df["A"]=my_df["A"].map(lambda x: re.sub('^[*-=\s]*', '', x))
the regex reads:
^ from the beginning
[] items in this group
\s any whitespace
* zero or more
so this will match (and replace with nothing) all the characters from the beginning of the string that are in the square brackets
So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010
I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).