Combining two regular expressions with different grouping requirements [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have two different repeated character substitution rules I'd like to combine into one regex.
I can do this in python 3.x:
import re
s = r'http://www.google.com/search=ooo-eeee-aa-ii-uuuu'
aiu=re.compile(r'(([aiu])\2{1,})')
eo=re.compile(r'(([eo])\2{2,})')
eo.sub(r'\2',aiu.sub(r'\2',s))
IF there is a major performance gain (this operation will be applied millions of times), is there a single regex expression that achieves what these two achieve (without having to nest calls like I did above).

You can combine the two substitutions with an alternation pattern. The replacement string can be both \1 and \2 together, since one of them will be empty and not affect the output anyway.
aeiou = re.compile(r'([aiu])\1{1,}|([eo])\2{2,}')
aeiou.sub(r'\1\2', s)

Related

Guidance on basic python assignment [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Need to create a python code to provide a list of tuples (searched words, list of occurrences).
the searched words are listed in a Thesaurus which need to be searched in a series of documents in a Corpus.
Any suggestion/guidance?
After you read the file, you could simply use split on space to get a list of words. This however would include punctuation. To remove the punctuation you could get a list of punctuation from "string" library's "punctuation" attribute and replace the occurences of punctuation in the words list obtained above with empty string,"". Your words might have special symbols such as "/" to represent or. Then you would need regular expressions to extract the words.

When to use Groups in Regular Expressions? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm a newbie and learning more about regular expressions. I'm still unclear as to why we use groups. I used them in the below regular expression below:
(http:)\//(\w)+\.(\w)+\.(\w)+
This will extract URL's, as in the below sentence:
This is http://www.google.com, this is http://www.yahoo.com.
I did use groups but I was very unsure as to why. I saw this explanation online but confused as to what it means:
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternation to part of the regex.
So any simplified clarification of groups would be great.
When I use groups it is usually because I need to replace some, but not all, of a specific regular expression pattern.
As an example let's say you have a large text file, and you want to change all hostnames that end in .com to end in .biz instead.
Obviously you can't just blindly replace .com with .biz, because that text could occur somewhere that isn't a hostname. So you need a way to identify just pieces of text that look like hostnames.
I won't go into the full hostname rules here, but for purposes of this example, let's pretend that hostnames are two to four sequences of alphabetic characters separated by periods, such as ibm.com or www.santa.northpole.org.
A regular expression to identify hostnames that end in .com might look like this:
([a-z]+\.){1,3}com
Which means "one or more letters followed by a period, occurring one to three times, followed by com."
The first part of the expression is inside parentheses, meaning it can be handled separately from the rest. So you could have a replacement pattern like this:
\1.biz
Meaning "Keep the first group expression unchanged and put .biz at the end."

how to understand re.match in python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have following python code:
import re
result = re.match('a.*b', 'aabab')
result.groups() # result is ()
len(result.groups()) # it's 0
result.group(0) # result is 'aabab'
I only know some basic regex, but I can not understand the groups and group. Could someone give some basic explanation about this.
and more, please give some explanation about Pattern and Matcher in python if possible.
Groups are used to capture specific parts of a regular expression. Your regex does not define any groups, therefore the group count is zero.
group(0) is a special case, and it contains the entire match.

Regular expression in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
What is the regular expression that accepts all the words start with Alphabet only and reject all words having any occurrence of less than or more than three consecutive forward slashes (///) [if and only if slashes exist].
Example:
ABC2123_987 is allowed.
AV23DS///KOLJH is allowed.
But, the word FDG56/HJU is not allowed.
Also, FDG56////HJU is not allowed.
This matches any string beginning with an alphabetic, and containing zero slashes or exactly three consecutive slashes.
^[A-Za-z][^/]*(///[^/]*)?$
Try the pattern
re.compile('(\w+(///)?)*')
A better pattern, as seen in the comments for this answer:
re.compile('^[a-zA-Z]\w*?(///)?(\w+(///)?)*$')

How to match words containing characters and digits, but not numbers? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm trying to create in Python a regular expression that matches words that contains A-Za-z or A-Za-z0-9, but not only 0-9.
For example I want to match fooT, foo23, fo24ooo, fo4o444, but NOT 40.
Is it possible?
One way would be:
r'\w*[A-Za-z]\w*'
\w matches _ as well as 'A-Za-z0-9' - if that's wrong, write out the whole class:
r'[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*'

Categories