Use regex to replace words before any digit with nothing - python

I want to replace words and spaces that appear before a digit in a string with nothing. For example, for the string = 'Juice of 1/2', I want to return '1/2'. I tried the following, but it did not work.
string = "Juice of 1/2"
new = string.replace(r"^.+?(?=\d)", "")
Also I am trying to perform this on every cell of a list of columns using the following code. How would I incorporate the new regex pattern into the existing pattern of r"(|)|?
df[pd.Index(cols2) + "_clean"] = (
df[cols2]
.apply(lambda col: col.str.replace(r"\(|\)|,", "", regex=True))
)

You might be able to phrase this using str.extract:
df["col2"] = df["col2"].str.extract(r'([0-9/-]+)')

.+? will match anything, including other digits. It will also match the / in 1/2. Since you only want to replace letters and spaces, use [a-z\s]+.
You also have to use re.sub(), not string.replace() (in Pandas, .str.replace() processes regular expressions by default).
new = re.sub(r'[a-z\s]+(?=\d)', '', string, flags=re.I)

May be something like this might work.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"[A-Za-z\s]+"
test_str = "Juice of 1/2 hede"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Related

RegEx for capturing and replacing all digits in a string except for special patterns

I have a text where digits appear in every possible way.
For example,
text = "hello23 the2e are 13 5.12apples *specially_x00123 named 31st"
I want to replace all digits with '#'s except the ones in a special pattern starting with *, a word, underscore, any character, and number such that *\w+_[a-z]\d+ (i.e., *specially_x00123).
I've tried to use lookaround syntax and non-capturing group but can't find a way to exactly change this to as below
text_cleaned = "hello## the#e are ## #.##apples *specially_x00123 named ##st"
I can use a pattern like below:
p1 = r'\d(?<!\*\w+_\w+)'
Then, it complains like this; "look-behind requires fixed-width pattern"
I tried to use non-capturing group:
p2 = r'(?:\*[a-z]+_\w+)\b|\d'
It takes out the special token (*specially_x000123) and all the digits. I think this is something that I may include in the solution, but I can't find how. Any ideas?
What you might do is capture the digit in a capturing group (\d) and use a callback in the replacement checking for the first capturing group.
If it is group 1, replace with a #, else return the match.
As \w+ also matches an underscore, you might match a word char except the underscore first using a negeated character class [^\W_\n]+
\*[^\W_\n]+_[a-z]\d+\b|(\d)
Regex demo | Python demo
import re
text = "hello23 the2e are 13 5.12apples *specially_x00123 named 31st"
pattern = r"\*[^\W_\n]+_[a-z]\d+\b|(\d)"
print (re.sub(pattern, lambda x: "#" if x.group(1) else x.group(), text))
Result
hello## the#e are ## #.##apples *specially_x00123 named ##st
One option might be that we split our string to before star and then after that. The expression (\d) captures every digits before star, which we can simply replace using #, then we will be joining it with $2:
(\d)|(\*.*)
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(\d)|(\*.*)"
test_str = ("hello23 the2e are 13 5.12apples *specially_x00123 named\n\n"
"hello## the#e are ## #.##apples *specially_x00123 named")
subst = "#\\2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
regex101.com
const regex = /(\d)|(\*.*)/gm;
const str = `hello23 the2e are 13 5.12apples *specially_x00123 named`;
const subst = `#$2`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);

RegEx for capturing and replacing digits in a pattern

I would like to replace the 3rd argument in the string with a new number (let's say 100). The matched string always starts with function, with the first argument either true or false, and with the number as second argument.
Expected
| |
v v
'function(true, 0, 15)' --> 'function(true, 0, 100)'
'function(false, 0, 23)' --> 'function(false, 0, 100)'
I have been reading the related posts but I believe I must have misunderstood some regex concept. The following code is that I had tried but it always replaces the whole string:
import re
string = 'function(true, 0, 15)'
regex = re.compile('function\([a-zA-Z]*, [0-9]*, ([0-9]*)\)')
res = re.sub(regex, '100', string)
print(res) # 100
# Expected: function(true, 0, 100)
Question: Could you point me out why the above code doesn't work? How would I write the code to achieve the expected result?
As the number you are trying to replace is just followed by a closing parenthesis ), you can just use this \d+(?=\s*\)) regex and replace it by 100 or whatever value you want. Try these Python codes,
import re
string = 'function(true, 0, 15)'
regex = re.compile(r'\d+(?=\s*\))')
res = re.sub(regex, '100', string)
print(res)
Prints,
function(true, 0, 100)
Also, the reason why your code isn't working as expected and is replacing whole of your string with 100 because the way you've written your regex, it matches your whole input and re.sub function replaces what all matches with second argument and hence all your input gets replaced with 100. But instead what you want is, to just replace the third argument with 100 hence the way you should write your regex, should only match the third argument value, like demonstrated in below regex demo,
Regex Demo matching only what you want to replace
And your current regex matches whole of your input as shown in below demo,
Regex Demo with your regex matching whole input
Also, in case you feel better and you want to match whole input and then selectively replace only third argument, you can use this regex to capture the function name and first two parameters in group1 like you wanted to capture in your original regex,
(function\([a-zA-Z]*, [0-9]*, )[0-9]*\)
and replace it with \g<1>100) where \g<1> references the value captured in group1 and further it is replaced with 100)
Regex Demo with full match and selected replacement
This expression also might work:
(?:\d+)(\))
which has a non-capturing group with our desired digits (?:\d+), followed by a right boundary (\)), which we can replace it with our new number and $1.
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:\d+)(\))"
test_str = "function(true, 0, 15)"
subst = "100\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
An alternative, you can print everything before the match and everything after the match, then print it out with the new result like so:
regex = re.compile( '(function\([a-zA-Z]*, [0-9]*, )([0-9]*)(\))' )
res = re.sub( regex, r'\1 100\3', string )
Basically, I placed parenthesis around the text before the expected match and after the expected match. Then I print it out as \1 (first match) 100 (new text) \3 (third match).
The reason why I propose this particular expression is in case OP specifically needs to only match strings that also contain the preceding "function(" section (or some other pattern). Plus, this is just an extension of OP's solution, so it may be more intuitive to OP.

How do I write a regex that either substitutes OR just adds new substring at the begging of a string?

I have a string that can either look like "string" (first case) or [word]string[word] (second case).
My goal is to change it to be [new_word]string[new_word].
If I use my_string = re.sub(r'\[[^\]]*\]', [new_word], my_string) it only works for the first case.
Can I modify the regex to work for both cases or should I use if statement instead?
You can use a regex alternation (|) to achieve this:
my_string = re.sub(r'(?:\[[^\]]*\]|")', '[new_word]', my_string)
Explanation:
(?: # Beginning of alternating group
\[[^\]]*\] # Matches [word]
| # OR
" # Matches literal double quote
)
Live Demo

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?
What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)
Here is an excellent tutorial on how to write regular expressions in Python.
Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.
Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)
Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()
import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

Categories