Python regex matching words with repeating consonant - python

First off, this is homework. (I couldn't use a tag in the title and nothing showed up in the tag list at the bottom for homework, so please let me know if I should EDIT something else regarding this matter).
So I have been reading through the python docs and scavenging SO, finding several solutions that are close to what I want, but not exact.
I have a dictionary which I read in to a string:
a
aa
aabbaa
...
z
We are practicing various regex patters on this data.
The specific problem here is to return a list of words which match the pattern, NOT tuples with the groups within each match.
For example:
Given a subset of this dictionary like:
someword
sommmmmeword
someworddddd
sooooomeword
I want to return:
['sommmmmword', 'someworddddd']
NOT:
[('sommmmword', 'mmmmm', ...), ...] # or any other variant
EDIT:
My reasoning behind the above example, is that I want to see how I can avoid making a second pass over the results. That is instead of saying:
res = re.match(re.compile(r'pattern'), dictionary)
return [r[0] for r in res]
I specifically want a mechanism where I can just use:
return re.match(re.compile(r'pattern'), dictionary)
I know that may sound silly, but I am doing this to really dig into regex. I mention this at the bottom.
This is what I have tried:
# learned about back refs
r'\b([b-z&&[^eiou]])\1+\b' -> # nothing
# back refs were weird, I want to match something N times
r'\b[b-z&&[^eiou]]{2}\b' -> # nothing
Somewhere in testing I noticed a pattern returning things like '\nsomeword'. I couldn't figure out what it was but if I find the pattern again I will include it here for completeness.
# Maybe the \b word markers don't work how I think?
r'.*[b-z&&[^eiou]]{2}' -> # still nothing
# Okay lets just try to match something in between anything
r'.*[b-z&&[^eiou]].*' -> # nope
# Since its words, maybe I should be more explicit.
r'[a-z]*[b-z&&[^eiou]][a-z]*' -> # still nope
# Decided to go back to grouping.
r'([b-z&&[^eiou]])(\1)' # I realize set difference may be the issue
# I saw someone (on SO) use set difference claiming it works
# but I gave up on it...
# OKAY getting close
r'(([b-df-hj-np-tv-xz])(\2))' -> [('ll', 'l', 'l'), ...]
# Trying the the previous ones without set difference
r'\b(.*(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # returned everything (all words)
# Here I realize I need a non-greedy leading pattern (.* -> .*?)
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # still everything
# Maybe I need the comma in {3,} to get anything 3 or more
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3,}).*)\b' -> # still everything
# okay I'll try a 1 line test just in case
r'\b(.*?([b-df-hj-np-tv-xz])(\2{3,}).*)\b'
# Using 'asdfdffff' -> [('asdfdffff', 'f', 'fff')]
# Using dictionary -> [] # WAIT WHAT?!
How does this last one work? Maybe there there are no 3+ repeating consonant words? I'm using /usr/share/dict/cracklib-small on my schools server which is about 50,000 words I think.
I am still working on this but any advice would be awesome.
One thing I find curious is that you can not back reference a non-capturing group. If I want to output only the full word, I use (?:...) to avoid capture, but then I can not back reference. Obviously I could leave the captures, loop over the results and filter out the extra stuff, but I absolutely want to figure this out using ONLY regex!
Perhaps there is a way to do the non-capture, but still allow back reference? Or maybe there is an entirely different expression I haven't tested yet.

Here are some points to consider:
Use re.findall to get all the results, not re.match (that only searches for 1 match and only at the string start).
[b-z&&[^eiou]] is a Java/ICU regex, this syntax is not supported by Python re. In Python, you can either redefine the ranges to skip the vowels, or use (?![eiou])[b-z].
To avoid "extra" values in tuples with re.findall, do not use capturing groups. If you need backreferences, use re.finditer instead of re.findall and access .group() of each match.
Coming back to the question, how you can use a backreference and still get the whole match, here is a working demo:
import re
s = """someword
sommmmmeword
someworddddd
sooooomeword"""
res =[x.group() for x in re.finditer(r"\w*([b-df-hj-np-tv-xz])\1\w*", s)]
print(res)
# => ['sommmmmeword', 'someworddddd']

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Python Regex return both results when 2 conditions set which partially satisfy one another WITHOUT IF statements nor Test groups and NOT AS A TUPLE

I'm going to have quite a few questions about regex in the coming days. Out of 10 challenges I gave myself over the past 5 days, I managed to solve 6.
I'm hoping the following isn't simple and embarrassing, but what I'm trying to do use re.findall to return results for both conditions even though the condition for set 2 may have already partially been satisfied by set 1.
Example (Problem):
>>> str = 'ab1cd2efg1hij2k'
>>> re.findall('ab1cd|ab', str)
['ab1cd']
>>> re.findall('ab|ab1cd', str)
['ab']
So notice that depending on whichever comes first in the OR statement determines what the single element of the array is. What I want is to be able to return both for a 2 element array and preferably not a Tuple. The readings I've done on regex ANDing have focused on making regexes match 2 different strings as opposed to returning multiple results that may mutually satisfy one another partially. Below is what I desire.
Desired Output:
>>> str = 'ab1cd2efg1hij2k'
>>> re.findall('{SOMETHING_THAT_RETURNS_BOTH}', str)
['ab', 'ab1cd']
The closest I've gotten is the following:
>>> re.findall('ab|[\S]+?(?=2e)', str)
['ab', '1cd']
>>> re.findall('ab|[\S]+(?=2e)', str)
['ab', '1cd']
but the second capture group ignores ab. Is there a directive in regex to say restart from the beginning? (?:^) seems to work the same as a ^ and using it in several ways didn't help thus far. Please note I DO NOT want to use regex IF statements nor test to see if a previous group matched just yet because I'm not quite ready to learn those methods before forming a more solid foundation for the things I don't yet know.
Thanks so much.
If you can relax tuple requirement then following regex with 2 independent lookaheads is needed due to your requirement of capturing overlapping text:
>>> print re.search(r'(?=(ab1cd))(?=(ab))', str).groups()
('ab1cd', 'ab')
Both lookaheads have a capturing group thus giving us required output.
You can also use findall:
>>> print re.findall(r'(?=(ab1cd))(?=(ab))', str)[0]
('ab1cd', 'ab')
Looking at the desired output the regex pattern shouldn't really require any lookaheads:
str = 'ab1cd2efg1hij2k1cd'
res = re.findall(r'((ab)?1cd)', str)
[list(row) for row in res][0]
The ? Quantifier — Matches between zero and one times, as many times
as possible, giving back as needed (greedy).
Result:
['ab1cd', 'ab']

Try to rescan with re.scanner if "False"

How can I get re.Scanner to attempt next options?
I have some nested groups. When we return None, we skip over this text.
Is there a way within the function I have the scanner retry next rules (i.e. have the function hit, but not produce output, and retry matching)?
Since we cannot have negative lookahead I was thinking that it would be useful in the function itself to communicate to the scanner to be allowed to try the next rules.
s = re.Scanner([
("a", retry_under_some_conditions), # I want to try the next rule, "aa"
("aa", None) # essentially means skip over "aa"
])
s.scan("aa") # output should not be from retry function, but "None"; it matches on "aa".
This is not the actual problem, but it is related enough (just don't try to solve this actual problem in a traditional way). Being able to apply next rules would be very powerful.
I'm now thinking it might not be possible. If the regex engine optimization can only match in a straightforward way and does a complete parse, and only afterwards applies the functions, then this is not possible.
How can I get re.Scanner to attempt next options?
Short answer: You can't.
Since we cannot have negative lookahead
I'm a bit puzzled. Because you can:
(r"a(?!a)", retry_under_some_conditions)
That would work, and the scanner would try the next rule. So I'm assuming the reason you can't, has something to do with your project / actual problem.
I have some nested groups. When we return None, we skip over this text.
That's just because of how the scanner works. When creating the scanner it takes your patterns, in this case a and aa. These would be combined into (a)|(aa). The subpattern is the important part because it relates to how the scanner works.
If we instead take the negative lookahead into account, and use this pattern (a(?!a))|(aa). Then doing this:
string = "a aa"
pattern = r"(a(?!a))|(aa)"
for match in re.finditer(pattern, string):
print(match.lastindex, match.span(), match.group())
Would print:
1 (0, 1) a
2 (2, 4) aa
The key here is that the scanner leverages match.lastindex, to map back to get the callback or value to return as a result.
I'm now thinking it might not be possible
Without using the lookahead or some other way to make it try the next subpattern. Then this isn't possible. Neither with the scanner or regex in general. The engine matches the first subpattern, if it matches then the others are skipped. Call it a limitation of regex if you will, but that's just how regex works.
The solution would then be to make your own scanner and scan each pattern individually.

Simple regular expression not working

I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.
Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.
This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail

Regular expression how to get middle strings

I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?
This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]
If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.

Categories