How to replace text within parenthesis in pandas - python

I have some data with parenthetical I want to remove. I know the following piece of code works to remove the parenthetical. I just want to understand what exactly it's doing. What does the r do? How about the \? I know the .* stands for any number of characters between the parenthesis.
df['name'].str.replace(r"\(.*\)","")

I've never used pandas before, however a quick lookup indicates that the r prior to the String literal indicates that you're applying a RegEx pattern as opposed to replacing a literal.
As far as the RegEx, here it is broken down:
\(.*\)
\( Escapes the open parenthesis character to match it literally
.* Matches any character except for line breaks zero or more times
\) Escapes the close parenthesis character to match it literally

Related

Python regex: Line can't start with certain words, can only contain certain characters

I am reading in lines from a file, and I want to remove lines that only contain letters, colon, parentheses, underscores, spaces and backslashes. This regex was working fine to find those lines...
[^A-Za-z0-9:()_\s\\]
...as passed to re.search() as a raw string.
Now, I need to add to it that the lines cannot start with THEN or ELSE; otherwise they should not match and thus be exempted from being removed.
I tried just taking the ^ out of the brackets and adding a negative lookbehind before the bracketed expression, like so...
r'^(?!(ELSE|THEN))[A-Za-z0-9:()_\s\\]'
...but now it just matches every line. What am I missing?
^(?:(?:.*[^A-Za-z0-9:()_\s\\])|(?:THEN|ELSE)).*$
Broken down
^(?: ).*$ # Starts with
(?: )|(?: ) # Either
.*[^A-Za-z0-9:()_\s\\] # Anything that contains a non-alphanumeric character
THEN|ELSE # THEN/ELSE
See the example on regex101.com
Just use an alternation:
^(?:THEN|ELSE|[A-Za-z0-9:()_\s\\]*$)
and remove the lines that don't match the pattern.

Python regex needed for format: 'delete([any text here])'

I am a total regex beginner. I want to create a regular expression that strictly allows the word delete followed by two closed parenthesis that contain any kind of characters (http://www.waynesworld1.com).
If I put it all together, it should accept the following: delete(http://www.waynesworld123.com).
Let me emphasize that the regex should strictly accept delete() and shouldn't accept elete(). As long as the user types in delete() anything is acceptable within the parenthesis (example: this would be fine delete(12!#Ww)
How can I craft this regex in Python? So far all I have is /delete/ for my regex.
Here you go:
^delete\(.*\)$
^ assert position at start of the string
delete matches the characters delete literally (case sensitive)
\( matches the character ( literally
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
$ assert position at end of the string
Here is some Python test code:
import re
txt= {"delete(http://www.waynesworld123.com)",
"delete(12!#Ww)",
"elete(test)",
"delete[test]",
"test"}
pattern=re.compile('^delete\(.*\)$', re.DOTALL)
for line in txt:
if pattern.search(line):
print 'PASS', line
else:
print 'FAIL',line

How does the regex "\" character and grouping "()" character work together?

I am trying to see which statements the following pattern matches:
\(*[0­-9]{3}\)*-­*[0-­9]{3}­\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'404­678­2347'
'(123)­1247890'
'456­900­900'
'(678)­2001236'
'404123­1234'
'(404123­123'
Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?
Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html

What is the use of the following statement in python regular expression?

I am new to python and i need to work on an existing python script. Can someone explain me what is the meaning of the following statement
pgre = re.compile("([^T]+)T([^\.]+)\.[^\s]+\s(\d+\.\d+):\s\[.+\]\s+(\d+)K->(\d+)K\((\d+)K\),\s(\d+\.\d+)\ssecs\]")
You need to consult the references for the exact meanings of each part of that regular expression, but the basic purpose of it is to parse the GC logging. Each parenthesized part of the expression () is a group that matches a useful part of the GC line.
For example, the start of the regex ([^T]+)T matches everything up to the first "T", and the grouped part returns the text before the "T", i.e. the date "2013-08-28"
The content of the group, [^T]+ means "at least one character that is not a T"
Patterns in square brackets [] are character classes - consult the references in the comments above for details. Note that your input text contains literal square brackets, so the pattern handles those with the \[ escape sequence - see below.
I think you can simplify ([^T]+)T to just (.+)T, incidentally.
Other useful sub-patterns:
\s matches whitespace
\d matches numeric digits
\. \( and \[ match literal periods, parentheses, and square braces, respectively, rather than interpreting them as special regex characters

Regular Expression that Includes a Character Only If Another Character Precedes It

I'm new to Stack so not sure if I'm asking this right.
I'm trying to form a regular expression to match all characters except 3 specific ones (%,&,and$) but I want to ignore that exception if a backslash () proceeds any of those characters. For example, if I have the string
abcd\$&
I would want the regular expression to match
abcd\$
because a backsplash preceeds the dollar sign, but not match the ^ because no backslash precedes it.
So far I have:
^[^%$&]+
which matches any string that doesn't have the characters (%, $, or &), but it stops at the backslash rather than include the backslash and the next character.
Thanks in advance!
^([^%$&\\]|\\.)+$
should work.
It also excludes \ from the charset and then allows \ followed by any character.

Categories