Using regular expressions to remove a string from a column - python

I am trying to remove a string from a column using regular expressions and replace.
Name
"George # ACkDk02gfe" sold
I want to remove " # ACkDk02gfe"
I have tried several different variations of the code below, but I cant seem to remove string I want.
df['Name'] = df['Name'].str.replace('(\#\D+\"$)','')
The output should be
George sold
This portion of the string "ACkDk02gfe is entirely random.

Let's try this using regex with | ("OR") and regex group:
df['Name'].str.replace('"|(\s#\s\w+)','', regex=True)
Output:
0 George sold
Name: Name, dtype: object
Updated
df['Name'].str.replace('"|(\s#\s\w*[-]?\w+)','')
Where df,
Name
0 "George # ACkDk02gfe" sold
1 "Mike # AisBcIy-rW" sold
Output:
0 George sold
1 Mike sold
Name: Name, dtype: object

Your pattern and syntax is wrong.
import pandas as pd
# set up the df
df = pd.DataFrame.from_dict(({'Name': '"George # ACkDk02gfe" sold'},))
# use a raw string for the pattern
df['Name'] = df['Name'].str.replace(r'^"(\w+)\s#.*?"', '\\1')

I'll let someone else post a regex answer, but this could also be done with split. I don't know how consistent the data you are looking at is, but this would work for the provided string:
df['Name'] = df['Name'].str.split(' ').str[0].str[1:] + ' ' + df['Name'].str.split(' ').str[-1]
output:
George sold

This should do for you
Split the string by a chain of whitespace,#,text immediately after #and whitespace after the text. This results in a list. remove the list corner brackets while separating elements by space using .str.join(' ')
df.Name=df.Name.str.split('\s\#\s\w+\s').str.join(' ')
0 George sold

To use a regex for replacement, you need to import re and use re.sub() instead of .replace().
import re
Name
"George # ACkDk02gfe" sold
df['Name'] = re.sub(r"#.*$", "", df['Name'])
should work.

import re
ss = '"George # ACkDk02gfe" sold'
ss = re.sub('"', "", ss)
ss = re.sub("\#\s*\w+", "", ss)
ss = re.sub("\s*", " ", ss)
George sold
Given that this is the general format of your code, here's what may help you understand the process I made. (1) substitute literal " (2) substitute given regex \#\s*\w+ (means with literal # that may be followed by whitespace/s then an alphanumeric word with multiple characters) will be replaced (3) substitute multiple whitespaces with a single whitespace.
You can wrap around a function to this process which you can simply call to a column. Hope it helps!

Related

pandas regex look ahead and behind from a 1st occurrence of character

I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _ symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can do (try the pattern here )
df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object
You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

How to replace strings in pandas column that are in a list?

I have scrolled through the posts on this question and was unable to find an answer to my situation. I have a pandas dataframe with a list of company names, some of which are represented as their domain name.
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])
I want to remove the domain extension from all the strings. The extensions are given in a list:
web_domains = ['.com', '.us']
The ollowing attepmt did not yield any results:
df['firm'].str.lower().replace(web_domains, '')
Can someone please help me out, and possibly also explain why my solution does not work?
You need use regex=True for Series.replace since it matches exact string under regex=False.
For example a will only be replaced when target is a not ab nor bab
web_domains = ['\.com', '\.us'] # escape `.` for regex=True
df['removed'] = df['firm'].str.lower().replace(web_domains, '', regex=True)
print(df)
firm removed
0 amazon.us amazon
1 pepsi pepsi
2 YOUTUBE.COM youtube
3 apple.inc apple.inc
Form a regex alternation from the domain endings, then use str.replace to remove them:
web_domains = ['com', 'us']
regex = r'\.(?:' + r'|'.join(web_domains) + r')$'
df["firm"] = df["firm"].str.replace(regex, '', regex=True)
Use str.replace with a regular expression:
import re
import pandas as pd
web_domains = ['.com', '.us']
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])
regex_pattern = f"({ '|'.join(re.escape(wb) for wb in web_domains)})$"
res = df["firm"].str.replace(regex_pattern, "", regex=True, flags=re.IGNORECASE)
print(res)
Output
0 amazon
1 pepsi
2 YOUTUBE
3 apple.inc
Name: firm, dtype: object
The variable regex_pattern points to the string:
(\.com|\.us)$
it will match either .com or .us at the end of the string
explain why my solution does not work?
pandas.Series.str.replace first argument should be
str or compiled regex
you have shoved list here, which is not allowed.

Regex find keyword followed by N characters

I have a df column with URL having keyword with hash values,
example
/someurl/env40d929fadbe746ecagjbf6c515d30686/end
/some/other/url/envlabel40d929fadbe746ecagjbf6c517t30686/envvar40d929fadbe746ecagjbf6c515d306r6
Goal is to replace words env.following.32.char.hash into {env}, and similarly envlabel.following.32.char.hash into {envlabel} and similarly others.
I am trying to use regex in replace method,
to_replace_key = ['env', 'envlabel', 'envvar']
for word in to_replace_key:
df['URL'] = df['URL'].str.replace(f"{word}\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w", f'{{{word}Id}}', regex=True)
Challenges:
env replaces envlabels
keyword.following.32.char.hash is located as a substring between / char or at the end of line
Expected output:
/someurl/{env}/user
/some/other/url/{envlabel}/{envvar}
Thanks !!
You can use the regex '(env(label|var)?)\w{32}' which simply captures env and label if it is present. ie ? ensures that label is captured if present. Replace the matched string with the first captured group. ie \\1 within the curly braces.
df['URL'].str.replace('(env(label|var)?)\w{32}', '{\\1}', regex=True)
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
Edit:
If you have alot of items in the list, do:
pat = f"({'|'.join(sorted(to_replace_key, key = len,reverse = True))})\w{{{32}}}"
df['URL'].str.replace(pat, '{\\1}', regex=True)
Out[567]:
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
Name: URL, dtype: object
Don't loop, use:
df['URL'] = df['URL'].str.replace('(envlabel|envvar|env)\w{32}', r'{\1}', regex=True)
output:
URL
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
regex demo

unable to replace white spaces with dash in python using string.replace(" ", "_")

Here is my code:
import pandas as pd
import requests
url = "https://www.coingecko.com/en/coins/recently_added?page=1"
df = pd.read_html(requests.get(url).text, flavor="bs4")
df = pd.concat(df).drop(["Unnamed: 0", "Unnamed: 1"], axis=1)
df.to_csv("your_table.csv", index=False)
names = df["Coin"].str.rsplit(n=2).str[0].str.lower()
coins=names.replace(" ", "-")
print(coins)
The print is still printing coins with spaces in their names. I want to replace the spaces with dashes (-). Can someone please help
You can add a parameter regex=True to make it work:
coins = names.replace(" ", "-", regex=True)
The reason is that for .replace() function, it looks for exact match when replacing strings without regex=True (default is regex=False):
Official doc::
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
Therefore, unless your strings to replace contains only exactly one space (no other characters), there will be no match for the string to replace. Hence, no result with the code without regex=True.
Alternatively, you can also use .str.replace() instead of .replace():
coins = names.str.replace(" ", "-")
.str.replace() does not require exact match and matches for partial string regardless of whether regex=True or regex=False.
Result:
print(coins)
0 safebreastinu
1 unicly-bored-ape-yacht-club-collection
2 bundle-dao
3 shibamax
4 swapp
...
95 apollo-space-token
96 futurov-governance-token
97 safemoon-inu
98 x
99 black-kishu-inu
Name: Coin, Length: 100, dtype: object

Python: using lambda with startswith

I need to writing my dataframe to csv, and some of the series start with "+-= ", so I need to remove them first.
I tried to test by using a string:
test="+++++-= I love Mercedes-Benz"
while True:
if test.startswith('+') or test.startswith('-') or test.startswith('=') or test.startswith(' '):
test=test[1:]
continue
else:
print(test)
break
Output looks perfect:
I love Mercedes-Benz.
Now when I want to do the same while using lambda in my dataframe:
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df.loc[len(my_df)] = ["++++-= I love Mercedes-Benz", 4, "Love this"]
my_df.loc[len(my_df)] = ["=Looks so good!", 2, "5-year-old"]
my_df
my_df["A"]=my_df["A"].map(lambda x: x[1:] if x.startswith('=') else x)
print(my_df["A"])
I am not sure how to put 4 startswith "-","=","+"," " together and loop them until they meet the first alphabet or character(sometimes it might be in Japanese or Chinese.)
expected final my_df:
A B C
0 I love Mercedes-Benz 4 Love this
1 Looks so good! 2 5-year-old
You can use str.lstrip in order to remove these leading characters:
my_df.A.str.lstrip('+-=')
0 I love Mercedes-Benz
1 Looks so good!
Name: A, dtype: object
One way to achieve it could be
old = ""
while old != my_df["A"]:
old = my_df["A"]
my_df["A"]=my_df["A"].map(lambda x: x[1:] if any(x.startswith(char) for char in "-=+ ") else x)
But I'd like to warn you about the strip() method for strings:
>>> test="+++++-= I love Mercedes-Benz"
>>> test.strip("+-=")
' I love Mercedes-Benz'
So your data extraction can become simpler:
my_df["A"].str=my_df["A"].str.strip("+=- ")
Just be careful because strip will remove the characters from both sides of the string. lstrip instead can do the job only on the left side.
The function startswith accepts a tuple of prefixes:
while test.startswith(('+','-','=',' ')):
test=test[1:]
But you can't put that in a lambda. But then, you don't need a lambda: just write the function and pass its name to map.
As a lover of regex and possibly convoluted solutions, I will add this solution as well:
import re
my_df["A"]=my_df["A"].map(lambda x: re.sub('^[*-=\s]*', '', x))
the regex reads:
^ from the beginning
[] items in this group
\s any whitespace
* zero or more
so this will match (and replace with nothing) all the characters from the beginning of the string that are in the square brackets

Categories