Regex find keyword followed by N characters - python

I have a df column with URL having keyword with hash values,
example
/someurl/env40d929fadbe746ecagjbf6c515d30686/end
/some/other/url/envlabel40d929fadbe746ecagjbf6c517t30686/envvar40d929fadbe746ecagjbf6c515d306r6
Goal is to replace words env.following.32.char.hash into {env}, and similarly envlabel.following.32.char.hash into {envlabel} and similarly others.
I am trying to use regex in replace method,
to_replace_key = ['env', 'envlabel', 'envvar']
for word in to_replace_key:
df['URL'] = df['URL'].str.replace(f"{word}\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w", f'{{{word}Id}}', regex=True)
Challenges:
env replaces envlabels
keyword.following.32.char.hash is located as a substring between / char or at the end of line
Expected output:
/someurl/{env}/user
/some/other/url/{envlabel}/{envvar}
Thanks !!

You can use the regex '(env(label|var)?)\w{32}' which simply captures env and label if it is present. ie ? ensures that label is captured if present. Replace the matched string with the first captured group. ie \\1 within the curly braces.
df['URL'].str.replace('(env(label|var)?)\w{32}', '{\\1}', regex=True)
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
Edit:
If you have alot of items in the list, do:
pat = f"({'|'.join(sorted(to_replace_key, key = len,reverse = True))})\w{{{32}}}"
df['URL'].str.replace(pat, '{\\1}', regex=True)
Out[567]:
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
Name: URL, dtype: object

Don't loop, use:
df['URL'] = df['URL'].str.replace('(envlabel|envvar|env)\w{32}', r'{\1}', regex=True)
output:
URL
0 /someurl/{env}/end
1 /some/other/url/{envlabel}/{envvar}
regex demo

Related

How to replace strings in pandas column that are in a list?

I have scrolled through the posts on this question and was unable to find an answer to my situation. I have a pandas dataframe with a list of company names, some of which are represented as their domain name.
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])
I want to remove the domain extension from all the strings. The extensions are given in a list:
web_domains = ['.com', '.us']
The ollowing attepmt did not yield any results:
df['firm'].str.lower().replace(web_domains, '')
Can someone please help me out, and possibly also explain why my solution does not work?
You need use regex=True for Series.replace since it matches exact string under regex=False.
For example a will only be replaced when target is a not ab nor bab
web_domains = ['\.com', '\.us'] # escape `.` for regex=True
df['removed'] = df['firm'].str.lower().replace(web_domains, '', regex=True)
print(df)
firm removed
0 amazon.us amazon
1 pepsi pepsi
2 YOUTUBE.COM youtube
3 apple.inc apple.inc
Form a regex alternation from the domain endings, then use str.replace to remove them:
web_domains = ['com', 'us']
regex = r'\.(?:' + r'|'.join(web_domains) + r')$'
df["firm"] = df["firm"].str.replace(regex, '', regex=True)
Use str.replace with a regular expression:
import re
import pandas as pd
web_domains = ['.com', '.us']
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])
regex_pattern = f"({ '|'.join(re.escape(wb) for wb in web_domains)})$"
res = df["firm"].str.replace(regex_pattern, "", regex=True, flags=re.IGNORECASE)
print(res)
Output
0 amazon
1 pepsi
2 YOUTUBE
3 apple.inc
Name: firm, dtype: object
The variable regex_pattern points to the string:
(\.com|\.us)$
it will match either .com or .us at the end of the string
explain why my solution does not work?
pandas.Series.str.replace first argument should be
str or compiled regex
you have shoved list here, which is not allowed.

Python re.search for multiple values in the same line

I am trying to use re.search (or re.findall) to interpret a line, and change the keyword by a value.
My example string is:
line = 'Text1 <<ALTER, variable = Ion1>> Text2 <<ALTER, variable = Value1>>\n'
With values of Ion1 of 'Na' and Value1 of 1.0, I would like to have the return of
processedline = 'Text1 Na Text2 1.0'
To do so, I tryed the following code:
result = re.search('<<ALTER(.*)>>', line)
aux_txt = result.group(1).split('=')
var = aux_txt[-1].strip()
value = ParameterDictionary[var]
processedline = re.sub('<<ALTER(.*)>>', str(value), line, flags=re.DOTALL)
However, the return I am getting, for the variable result, is ', variable = Ion1>> Text2 <<ALTER, variable = Value1', i.e., it does not treat independently both keywords.
Anyone has some idea? Thanks in advance!
That is because your regex is matching the entire string (till last >>) instead of matching till the first occurrence of >> after Ion1. You need to use lazy operator with your .* to limit the match.
What .*? does is this: It matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Here is an example with an explanation: https://regex101.com/r/oKyOIn/1
Python re.search for multiple values in the same line
re.search is wrong tool for this task, it does return first (leftmost) match or None if not match was found. You should use either re.finditer which gives iterator of Match objects or re.findall which gives list of strs or tuples.
Also as already noted you need to change your pattern <<ALTER(.*)>> as it does match too much, you might use non-greedy version i.e.
<<ALTER(.*?)>>
or if > is not allowed inside << and >> harness that as follows
<<ALTER([^>]*)>>
You need to capture one or more word characters (alphanumeric including underscores) inside <<ALTER, variable = and >>, and then use a callable in the re.sub method replacement argument:
See the Python demo:
import re
ParameterDictionary = {'Ion1': 'Na', 'Value1': '1.0'}
line = 'Text1 <<ALTER, variable = Ion1>> Text2 <<ALTER, variable = Value1>>\n'
rx = r'<<ALTER, variable = (\w+)>>'
result = re.sub(rx, lambda x: ParameterDictionary.get(x.group(1), x.group()), line)
print(result)
# => Text1 Na Text2 1.0
Here,
<<ALTER, variable = (\w+)>> matches <<ALTER, variable =, space, then (\w+) captures into Group 1 any one or more word chars and then >> is matched
The match is passed into re.sub within a lambda expression, as x, and the ParameterDictionary.get(x.group(1), x.group()) either returns the corresponding value by found key, or the whole match (x.group()).
Using groups to capture within re.sub seems to be what you are looking for. re.sub accepts a function as the repl (replacement string argument). The function is evaluated with the match object as argument. See docs.
>>> param_dict = {'Ion1': 'Na', 'Variable1': '1.0'}
>>> re.sub(r'<<ALTER, variable = ([\w\d]+)>>', lambda m: param_dict[m.group(1)], line)
'Text1 Na Text2 1.0\n'
The regex group ([\w\d]+) can be adapted to the kind of values you expect to find.
Using raw strings (starting with r') for regexes in python is good practice and can save you from headaches.
Using .* is too broad, and capture everything between <<ALTER and >>. Why not use a more specific regexp ?
>>> re.findall(r"<<ALTER, variable = (\w+)>>", line)
['Ion1', 'Value1']
Thanks a lot!
It worked perfectly like this:
import re
ParameterDictionary = {'Ion1': 'Na', 'Value1': '1.0'}
line = 'Text1 <<ALTER, variable = Ion1>> Text2 <<ALTER, variable = Value1>>\n'
result = re.findall(r'<<ALTER, variable = (\w+)>>', line)
for txt in result:
aux_txt = f'<<ALTER, variable = {txt}>>'
value = ParameterDictionary[txt]
line = re.sub(aux_txt, str(value), line, flags=re.DOTALL)

unable to replace white spaces with dash in python using string.replace(" ", "_")

Here is my code:
import pandas as pd
import requests
url = "https://www.coingecko.com/en/coins/recently_added?page=1"
df = pd.read_html(requests.get(url).text, flavor="bs4")
df = pd.concat(df).drop(["Unnamed: 0", "Unnamed: 1"], axis=1)
df.to_csv("your_table.csv", index=False)
names = df["Coin"].str.rsplit(n=2).str[0].str.lower()
coins=names.replace(" ", "-")
print(coins)
The print is still printing coins with spaces in their names. I want to replace the spaces with dashes (-). Can someone please help
You can add a parameter regex=True to make it work:
coins = names.replace(" ", "-", regex=True)
The reason is that for .replace() function, it looks for exact match when replacing strings without regex=True (default is regex=False):
Official doc::
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
Therefore, unless your strings to replace contains only exactly one space (no other characters), there will be no match for the string to replace. Hence, no result with the code without regex=True.
Alternatively, you can also use .str.replace() instead of .replace():
coins = names.str.replace(" ", "-")
.str.replace() does not require exact match and matches for partial string regardless of whether regex=True or regex=False.
Result:
print(coins)
0 safebreastinu
1 unicly-bored-ape-yacht-club-collection
2 bundle-dao
3 shibamax
4 swapp
...
95 apollo-space-token
96 futurov-governance-token
97 safemoon-inu
98 x
99 black-kishu-inu
Name: Coin, Length: 100, dtype: object

Using regular expressions to remove a string from a column

I am trying to remove a string from a column using regular expressions and replace.
Name
"George # ACkDk02gfe" sold
I want to remove " # ACkDk02gfe"
I have tried several different variations of the code below, but I cant seem to remove string I want.
df['Name'] = df['Name'].str.replace('(\#\D+\"$)','')
The output should be
George sold
This portion of the string "ACkDk02gfe is entirely random.
Let's try this using regex with | ("OR") and regex group:
df['Name'].str.replace('"|(\s#\s\w+)','', regex=True)
Output:
0 George sold
Name: Name, dtype: object
Updated
df['Name'].str.replace('"|(\s#\s\w*[-]?\w+)','')
Where df,
Name
0 "George # ACkDk02gfe" sold
1 "Mike # AisBcIy-rW" sold
Output:
0 George sold
1 Mike sold
Name: Name, dtype: object
Your pattern and syntax is wrong.
import pandas as pd
# set up the df
df = pd.DataFrame.from_dict(({'Name': '"George # ACkDk02gfe" sold'},))
# use a raw string for the pattern
df['Name'] = df['Name'].str.replace(r'^"(\w+)\s#.*?"', '\\1')
I'll let someone else post a regex answer, but this could also be done with split. I don't know how consistent the data you are looking at is, but this would work for the provided string:
df['Name'] = df['Name'].str.split(' ').str[0].str[1:] + ' ' + df['Name'].str.split(' ').str[-1]
output:
George sold
This should do for you
Split the string by a chain of whitespace,#,text immediately after #and whitespace after the text. This results in a list. remove the list corner brackets while separating elements by space using .str.join(' ')
df.Name=df.Name.str.split('\s\#\s\w+\s').str.join(' ')
0 George sold
To use a regex for replacement, you need to import re and use re.sub() instead of .replace().
import re
Name
"George # ACkDk02gfe" sold
df['Name'] = re.sub(r"#.*$", "", df['Name'])
should work.
import re
ss = '"George # ACkDk02gfe" sold'
ss = re.sub('"', "", ss)
ss = re.sub("\#\s*\w+", "", ss)
ss = re.sub("\s*", " ", ss)
George sold
Given that this is the general format of your code, here's what may help you understand the process I made. (1) substitute literal " (2) substitute given regex \#\s*\w+ (means with literal # that may be followed by whitespace/s then an alphanumeric word with multiple characters) will be replaced (3) substitute multiple whitespaces with a single whitespace.
You can wrap around a function to this process which you can simply call to a column. Hope it helps!

Substitute specific matches using regex

I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).

Categories