Python: using lambda with startswith - python

I need to writing my dataframe to csv, and some of the series start with "+-= ", so I need to remove them first.
I tried to test by using a string:
test="+++++-= I love Mercedes-Benz"
while True:
if test.startswith('+') or test.startswith('-') or test.startswith('=') or test.startswith(' '):
test=test[1:]
continue
else:
print(test)
break
Output looks perfect:
I love Mercedes-Benz.
Now when I want to do the same while using lambda in my dataframe:
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df.loc[len(my_df)] = ["++++-= I love Mercedes-Benz", 4, "Love this"]
my_df.loc[len(my_df)] = ["=Looks so good!", 2, "5-year-old"]
my_df
my_df["A"]=my_df["A"].map(lambda x: x[1:] if x.startswith('=') else x)
print(my_df["A"])
I am not sure how to put 4 startswith "-","=","+"," " together and loop them until they meet the first alphabet or character(sometimes it might be in Japanese or Chinese.)
expected final my_df:
A B C
0 I love Mercedes-Benz 4 Love this
1 Looks so good! 2 5-year-old

You can use str.lstrip in order to remove these leading characters:
my_df.A.str.lstrip('+-=')
0 I love Mercedes-Benz
1 Looks so good!
Name: A, dtype: object

One way to achieve it could be
old = ""
while old != my_df["A"]:
old = my_df["A"]
my_df["A"]=my_df["A"].map(lambda x: x[1:] if any(x.startswith(char) for char in "-=+ ") else x)
But I'd like to warn you about the strip() method for strings:
>>> test="+++++-= I love Mercedes-Benz"
>>> test.strip("+-=")
' I love Mercedes-Benz'
So your data extraction can become simpler:
my_df["A"].str=my_df["A"].str.strip("+=- ")
Just be careful because strip will remove the characters from both sides of the string. lstrip instead can do the job only on the left side.

The function startswith accepts a tuple of prefixes:
while test.startswith(('+','-','=',' ')):
test=test[1:]
But you can't put that in a lambda. But then, you don't need a lambda: just write the function and pass its name to map.

As a lover of regex and possibly convoluted solutions, I will add this solution as well:
import re
my_df["A"]=my_df["A"].map(lambda x: re.sub('^[*-=\s]*', '', x))
the regex reads:
^ from the beginning
[] items in this group
\s any whitespace
* zero or more
so this will match (and replace with nothing) all the characters from the beginning of the string that are in the square brackets

Related

pandas regex look ahead and behind from a 1st occurrence of character

I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _ symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can do (try the pattern here )
df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object
You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

list named argument in a string [duplicate]

u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
All I need is the contents inside the parenthesis.
If your problem is really just this simple, you don't need regex:
s[s.find("(")+1:s.find(")")]
Use re.search(r'\((.*?)\)',s).group(1):
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
If you want to find all occurences:
>>> re.findall('\(.*?\)',s)
[u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
>>> re.findall('\((.*?)\)',s)
[u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
Building on tkerwin's answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
if contents_re:
print(contents_re.groupdict()['contents'])
No need to use regex ....
Just use list slicing ...
string="(tidtkdgkxkxlgxlhxl) ¥£%#_¥#_¥#_¥#"
print(string[string.find("(")+1:string.find(")")])
TheSoulkiller's answer is great. just in my case, I needed to handle extra parentheses and only extract the word inside the parentheses. a very small change would solve the problem
>>> s=u'abcde((((a+b))))-((a*b))'
>>> re.findall('\((.*?)\)',s)
['(((a+b', '(a*b']
>>> re.findall('\(+(.*?)\)',s)
['a+b', 'a*b']
Here are several ways to extract strings between parentheses in Pandas with the \(([^()]+)\) regex (see its online demo) that matches
\( - a ( char
([^()]+) - then captures into Group 1 any one or more chars other than ( and )
\) - a ) char.
Extracting the first occurrence using Series.str.extract:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.extract(r'\(([^()]+)\)')
# => df['Values']
# 0 value 1
# Name: Values, dtype: object
Extracting (finding) all occurrences using Series.str.findall:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)')
# => df['Values']
# 0 [value 1, value 2]
# Name: Values, dtype: object
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)').str.join(', ')
# => df['Values']
# 0 value 1, value 2
# Name: Values, dtype: object
Note that .str.join(', ') is used to create a comma-separated string out of the resulting list of strings. You may adjust this separator for your scenario.
testcase
s = "(rein<unint>(pBuf) +fsizeof(LOG_RECH))"
result
['pBuf', 'LOG_RECH', 'rein<unint>(pBuf) +fsizeof(LOG_RECH)']
implement
def getParenthesesList(s):
res = list()
left = list()
for i in range(len(s)):
if s[i] == '(':
left.append(i)
if s[i] == ')':
le = left.pop()
res.append(s[le + 1:i])
print(res)
return res
If im not missing something, a small fix to #tkerwin:
s[s.find("(")+1:s.rfind(")")]
The 2nd find should be rfind so you start search from end of string

function should return the contents of the parentheses. If no text appears in parentheses, return an empty string [duplicate]

u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
All I need is the contents inside the parenthesis.
If your problem is really just this simple, you don't need regex:
s[s.find("(")+1:s.find(")")]
Use re.search(r'\((.*?)\)',s).group(1):
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
If you want to find all occurences:
>>> re.findall('\(.*?\)',s)
[u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
>>> re.findall('\((.*?)\)',s)
[u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
Building on tkerwin's answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
if contents_re:
print(contents_re.groupdict()['contents'])
No need to use regex ....
Just use list slicing ...
string="(tidtkdgkxkxlgxlhxl) ¥£%#_¥#_¥#_¥#"
print(string[string.find("(")+1:string.find(")")])
TheSoulkiller's answer is great. just in my case, I needed to handle extra parentheses and only extract the word inside the parentheses. a very small change would solve the problem
>>> s=u'abcde((((a+b))))-((a*b))'
>>> re.findall('\((.*?)\)',s)
['(((a+b', '(a*b']
>>> re.findall('\(+(.*?)\)',s)
['a+b', 'a*b']
Here are several ways to extract strings between parentheses in Pandas with the \(([^()]+)\) regex (see its online demo) that matches
\( - a ( char
([^()]+) - then captures into Group 1 any one or more chars other than ( and )
\) - a ) char.
Extracting the first occurrence using Series.str.extract:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.extract(r'\(([^()]+)\)')
# => df['Values']
# 0 value 1
# Name: Values, dtype: object
Extracting (finding) all occurrences using Series.str.findall:
import pandas as pd
df = pd.DataFrame({'Description':['some text (value 1) and (value 2)']})
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)')
# => df['Values']
# 0 [value 1, value 2]
# Name: Values, dtype: object
df['Values'] = df['Description'].str.findall(r'\(([^()]+)\)').str.join(', ')
# => df['Values']
# 0 value 1, value 2
# Name: Values, dtype: object
Note that .str.join(', ') is used to create a comma-separated string out of the resulting list of strings. You may adjust this separator for your scenario.
testcase
s = "(rein<unint>(pBuf) +fsizeof(LOG_RECH))"
result
['pBuf', 'LOG_RECH', 'rein<unint>(pBuf) +fsizeof(LOG_RECH)']
implement
def getParenthesesList(s):
res = list()
left = list()
for i in range(len(s)):
if s[i] == '(':
left.append(i)
if s[i] == ')':
le = left.pop()
res.append(s[le + 1:i])
print(res)
return res
If im not missing something, a small fix to #tkerwin:
s[s.find("(")+1:s.rfind(")")]
The 2nd find should be rfind so you start search from end of string

Using regular expressions to remove a string from a column

I am trying to remove a string from a column using regular expressions and replace.
Name
"George # ACkDk02gfe" sold
I want to remove " # ACkDk02gfe"
I have tried several different variations of the code below, but I cant seem to remove string I want.
df['Name'] = df['Name'].str.replace('(\#\D+\"$)','')
The output should be
George sold
This portion of the string "ACkDk02gfe is entirely random.
Let's try this using regex with | ("OR") and regex group:
df['Name'].str.replace('"|(\s#\s\w+)','', regex=True)
Output:
0 George sold
Name: Name, dtype: object
Updated
df['Name'].str.replace('"|(\s#\s\w*[-]?\w+)','')
Where df,
Name
0 "George # ACkDk02gfe" sold
1 "Mike # AisBcIy-rW" sold
Output:
0 George sold
1 Mike sold
Name: Name, dtype: object
Your pattern and syntax is wrong.
import pandas as pd
# set up the df
df = pd.DataFrame.from_dict(({'Name': '"George # ACkDk02gfe" sold'},))
# use a raw string for the pattern
df['Name'] = df['Name'].str.replace(r'^"(\w+)\s#.*?"', '\\1')
I'll let someone else post a regex answer, but this could also be done with split. I don't know how consistent the data you are looking at is, but this would work for the provided string:
df['Name'] = df['Name'].str.split(' ').str[0].str[1:] + ' ' + df['Name'].str.split(' ').str[-1]
output:
George sold
This should do for you
Split the string by a chain of whitespace,#,text immediately after #and whitespace after the text. This results in a list. remove the list corner brackets while separating elements by space using .str.join(' ')
df.Name=df.Name.str.split('\s\#\s\w+\s').str.join(' ')
0 George sold
To use a regex for replacement, you need to import re and use re.sub() instead of .replace().
import re
Name
"George # ACkDk02gfe" sold
df['Name'] = re.sub(r"#.*$", "", df['Name'])
should work.
import re
ss = '"George # ACkDk02gfe" sold'
ss = re.sub('"', "", ss)
ss = re.sub("\#\s*\w+", "", ss)
ss = re.sub("\s*", " ", ss)
George sold
Given that this is the general format of your code, here's what may help you understand the process I made. (1) substitute literal " (2) substitute given regex \#\s*\w+ (means with literal # that may be followed by whitespace/s then an alphanumeric word with multiple characters) will be replaced (3) substitute multiple whitespaces with a single whitespace.
You can wrap around a function to this process which you can simply call to a column. Hope it helps!

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

Categories