Python regex to pick either regex A or regex B - python

I am trying to create a regex statement that will choose one regex or the other, for example:
string = '123 Test String'
pattern = r'( ?)([T](?P<name1>\w+))|([A](?P<name2>\w+))'
m = re.search(pattern, string)
Basically, I want the Regex to pick one regex of the other.

By using the alternation (or) operator "|" in your pattern you are effectively comparing your string against tow regular expressions. If your string matches the expression on either side of the "|", then re.search will return a MatchObject.
See the Python documentation:
Alternation, or the “or” operator. If A and B are regular expressions, A|B will match any string that matches either A or B. | has very low precedence in order to make it work reasonably when you’re alternating multi-character strings. Crow|Servo will match either Crow or Servo, not Cro, a 'w' or an 'S', and ervo.

Related

Split according to regex condition

This will be my another question:
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
How can I take all the characters despite it being fullstop, digits, or anything after "Organization: " using regex?
result_organization = re.search("(Organization: )(\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*)", string)
My above code is super long and not wise at all.
I would recommend using find command like this
print(string[string.find("Organization")+14:])
You don't need regex for that, this simple code should give you desired result:
str = "Organization: S.P. Dyer Computer Consulting, Cambridge MA";
if str.startswith("Organization: "):
str = str[14:];
print(str)
You also could use pattern (?<=Organization: ).+
Explanation:
(?<=Organization: ) - positive lookbehind, asserts if what is preceeding is Organization:
.+ - match any character except for newline characters.
Demo
You could use a single capturing group instead of 2 capturing groups.
Instead of specify all the words (\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*) you might choose to match any character except a newline using the dot and then match the 0+ times to match until the end.
But note that that would also match strings like ##$$ ++
^Organization: (.+)
Regex demo | Python demo
For example
import re
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
result_organization = re.search("Organization: (.*)", string)
print(result_organization.group(1))
If you want a somewhat more restrictive pattern you might use a character class and specify what you would allow to match. For example:
^Organization: ([\w.,]+(?: [\w.,]+)*)
Regex demo

Python - Regular Expression count "he" and "she"

I'm working on a regular expression that finds he or she that is surrounded by white space, so not finding he in other words (standalone). It is searching through a book.
I have tried the '+' 'and'
def q9():
pattern = r'\s(he)\s'
return re.compile(pattern)
This returns 1371 values when it should be 2000 This part doesn't really apply to you unless you know the book
Use this:
re.compile(r'\bs?he\b', re.I)
re.I do case-insentitive matching, \b is for word boundary, s?he means s is optional and he should always be matched. Equavalent way to write this is r'\b(she|he)\b' if you want to be more readable.

How do I write a regex that either substitutes OR just adds new substring at the begging of a string?

I have a string that can either look like "string" (first case) or [word]string[word] (second case).
My goal is to change it to be [new_word]string[new_word].
If I use my_string = re.sub(r'\[[^\]]*\]', [new_word], my_string) it only works for the first case.
Can I modify the regex to work for both cases or should I use if statement instead?
You can use a regex alternation (|) to achieve this:
my_string = re.sub(r'(?:\[[^\]]*\]|")', '[new_word]', my_string)
Explanation:
(?: # Beginning of alternating group
\[[^\]]*\] # Matches [word]
| # OR
" # Matches literal double quote
)
Live Demo

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'
Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).
I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

Categories