.replace example in Python - python

Going through some introductory classes in python and came across a manipulation as follows:
energy['Country'] = energy['Country'].str.replace(r" \(.*\)","")
Can someone explain the first part of the replace? Not quite sure how to interpret all the special characters. Thanks.

From the Python docs:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Your example appears to be trying to replace a regex pattern, but as far as I'm aware, str.replace() doesn't support regex.
You might be looking for re.sub().
Here's an example.
import re
energy['Country'] = re.sub(r' \(.*\)','',energy['Country'])
This code will delete anything between parentheses in energy['Country']. The regex matches a space, \( matches a left paren, . matches any non-line-break character, * allows an unlimited number of those, and \) matches a right paren. This regex searches for any text between parentheses that follows a space. The replacement argument in re.sub() in this case is an empty string, so the string Hello (World) will get replaced with Hello. Note the space, parentheses, and all text contained inside parentheses that's not a line break get matched and replaced.
RegExr is a handy online tool to test and explain regular expressions.
Regular-Expressions.info provides comprehensive explanations of how regex works.
Edit: Christian pointed out that the code appears to be using pandas.Series.str.replace()
In this case, regex is indeed supported directly. The question's code simply replaces all occurrences of the regex pattern (which matches text enclosed within parentheses that follow a space, the parentheses themselves, and their preceding space) with an empty string and overwrites the existing series energy['Country'] with the parenthesis-stripped version.

Related

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0
Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time
Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time
str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.
First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.
To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

Regex - Why won't this regex work in Python?

I have this expression
:([^"]*) \(([^"]*)\)
and this text
:chkpf_uid ("{4astr-hn389-918ks}")
:"#cert" ("false")
Im trying to match it so that on the first sentence ill get these groups:
chkpf_uid
{4astr-hn389-918ks}
and on the second, ill get these:
#cert
false
I want to avoid getting the quotes.
I can't seem to understand why the expression I use won't match these, especially if I switch the [^"]* to a (.*).
with ([^"]*): wont match
with (.*): does match, but with quotes
This is using the re module in python 2.7
Sidenote: your input may require a specific parser to handle, especially if it may have escape sequences.
Answering the question itself, remember that a regex is processed from left to right sequentially, and the string is processed the same here. A match is returned if the pattern matches a portion/whole string (depending on the method used).
If there are quotation marks in the string, and your pattern does not let match those quotes, the match will be failed, no match will be returned.
A possible solution can be adding the quotes as otpional subpatterns:
:"?([^"]*)"? \("?([^"]*)"?\)
^^ ^^ ^^ ^^
See the regex demo
The parts you need are captured into groups, and the quotes, present or not, are just matched, left out of your re.findall reach.

Python regex reference conflicting with substitution number

I'm taking a beginning Python course, and am having problems trying to do a regex substitution.
The question states: Write a substitution command that will change names like file1, file2, etc. to file01, file02, etc. but will not add a zero to names like file10 or file20.
Here's my solution:
re.sub(r'(\D+)(\d)$',r'\10\2','file1')
As you can see, the 0 is messing with my \1 reference. Can anyone help me with an easy solution? Thanks!
import re
print re.sub(r'(\D+)(\d)$',r'\g<1>0\2','file1')
Don't ask.. just do the \g<#> thing and it'll work fine in python. Other languages have the same issue:
http://resbook.wordpress.com/2011/01/04/regex-with-back-references-followed-by-number/
dont know python, but in your regex you just want one digit and not two
for the match you can do it like this
.+[^\d]\d$
test1 will match
test1 will not match
Good luck
#sdanzig has the correct answer, but if you insist to ask, it is actually a documented feature:
http://docs.python.org/2/library/re.html
Read the last paragraph for re.sub().
In string-type repl arguments, in addition to the character escapes
and backreferences described above, \g will use the substring
matched by the group named name, as defined by the (?P...)
syntax. \g uses the corresponding group number; \g<2> is
therefore equivalent to \2, but isn’t ambiguous in a replacement such
as \g<2>0. \20 would be interpreted as a reference to group 20, not a
reference to group 2 followed by the literal character '0'. The
backreference \g<0> substitutes in the entire substring matched by the
RE.

Python regex - Ignore parenthesis as indexing?

I've currently written a nooby regex pattern which involves excessive use of the "(" and ")" characters, but I'm using them for 'or' operators, such as (A|B|C) meaning A or B or C.
I need to find every match of the pattern in a string.
Trying to use the re.findall(pattern, text) method is no good, since it interprets the parenthesis characters as indexing signifiers (or whatever the correct jargon be), and so each element of the produced List is not a string showing the matched text sections, but instead is a tuple (which contain very ugly snippets of pattern match).
Is there an argument I can pass to findall to ignore paranthesis as indexing?
Or will I have to use a very ugly combination of re.search, and re.sub
(This is the only solution I can think of; Find the index of the re.search, add the matched section of text to the List then remove it from the original string {by using ugly index tricks}, continuing this until there's no more matches. Obviously, this is horrible and undesirable).
Thanks!
Yes, add ?: to a group to make it non-capturing.
import re
print re.findall('(.(foo))', "Xfoo") # [('Xfoo', 'foo')]
print re.findall('(.(?:foo))', "Xfoo") # ['Xfoo']
See re syntax for more information.
re.findall(r"(?:A|B|C)D", "BDE")
or
re.findall(r"((?:A|B|C)D)", "BDE")

Categories