Regular Expression Testing - python

So i have been working on this project for myself to understand regular expressions There are 6 lines of input. The first line will contain 10 character strings. The last 5 lines will contain a valid regular expression string.
For the output, each regular expression print all the character strings that are matches to the strings according to line 1; if none match then print none. # is used to say it is an empty string. I have gotten everything but the empty string part so here is my code
and example input that would be
1)#,aac,acc,abc,ac,abbc,abbbc,abbbbc,aabc,accb
and i would like the second input to be
2)b*
the output im trying to get is #
and so far it outputs nothing
import re
inp = input("Search String:").upper().split(',')
for runs in range(50):
temp = []
query = input("Search Query:").replace("?", "[A-Z_0-9]+?+$").upper()
for item in inp:
search = re.match(query, item)
if search:
if search.group() not in temp:
temp.append(search.group())
if len(temp) > 0:
print(" ".join(temp))
else:
print("NONE")

b matches only the literal character 'b', so your search string will only match a sequence of zero or more b's, such as
b
or
bbbb
or
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb (and so on)
Your match string will not match anything else.
I don't know why you are using a specific letter, but I assume you intended an escape sequence, like "\b*", although that only matches transitions between types of characters, so it won't match # in this context. If you use \W*, it will match # (not sure whether it will match the other stuff you want).
If you haven't already, check out the following resources on regular expressions, including all of the escape characters and metacharacters:
Wikipedia
Python.org 2.7

Related

Can't wrap my head around how to remove a list of characters from another list

I've been able to isolate the list (or string) of characters I want excluded from a user entered string. But I don't see how to then remove all these unwanted characters. After I do this, I think I can try joining the user string so it all becomes one alphabet input like the instructions say.
Instructions:
Remove all non-alpha characters
Write a program that removes all non-alpha characters from the given input.
For example, if the input is:
-Hello, 1 world$!
the output should be:
Helloworld
My code:
userEntered = input()
makeList = userEntered.split()
def split(userEntered):
return list(userEntered)
if userEntered.isalnum() == False:
for i in userEntered:
if i.isalpha() == False:
#answer = userEntered[slice(userEntered.index(i))]
reference = split(userEntered)
excludeThis = i
print(excludeThis)
When I print excludeThis, I get this as my output:
-
,
1
$
!
So I think I might be on the right track. I need to figure it out how to get these characters out of the user input. Any help is appreciated.
Loop over the input string. If the character is alphabetic, add it to the result string.
userEntered = input()
result = ''
for char in userEntered:
if char.isalpha():
result += char
print(result)
This can also be done with a regular expression:
import re
userEntered = input()
result = re.sub(r'[^a-z]', '', userEntered, flags=re.I)
The regexp [^a-z] matches anything except an alphabetic character. The re.I flag makes it case-insensitive. These are all replaced with an empty string, which removes them.
There's basically two main parts to this: distinguish alpha from non-alpha, and get a string with only the former. If isalpha() is satisfactory for the former, then that leaves the latter. My understanding is that the solution that is considered most Pythonic would be to join a comprehension. This would like this:
''.join(char for char in userEntered if char.isalpha())
BTW, there are several places in the code where you are making it more complicated than it needs to be. In Python, you can iterate over strings, so there's no need to convert userEntered to a list. isalnum() checks whether the string is all alphanumeric, so it's rather irrelevant (alphanumeric includes digits). You shouldn't ever compare a boolean to True or False, just use the boolean. So, for instance, if i.isalpha() == False: can be simplified to just if not i.isalpha():.

Why does the order of expressions matter in re.match?

I'm making a function that will take a string like "three()" or something like "{1 + 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.
def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
if s == '':
break
seq.append(m.group(2))
s = s[len(m.group(0)):]
m = re.match(patterns,s)
return seq
This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop.
But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?
works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.
If you put this sequence at the end, like here:
patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).
Read an explanation of how the full expression is tested in the documentation (thanks to #kaya3).
However, if you put that sequence a the start, like here:
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.
It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a + which will only match if there is 1 or more matches:
patterns = (r"^(\t|\n|\r| )*(([a-z])+|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

Change string for defiened pattern (Python)

Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?
To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.

Python regular expression for consonants and vowels

I was trying to create a regular expression which would allow any number of consonants , or any number of vowels , or a mix of consonants and vowels such that we only have any number of consonants in the beginning followed by any number of vowels ONLY, no consonant should be allowed after the vowels, it would be more clear from the example:
The following cases should pass:
TR, EE, TREE, Y, BY.
But the following should not pass the expression :
TROUBLE, OATS, TREES, IVY, TROUBLES, PRIVATE, OATEN, ORRERY.
So generally it can be visualized as : [C] [V]
C - Consonants
V - Vowels
[ ] - where the square brackets denote arbitrary presence of their contents.
And I reached upto this piece of code:
import re
def find_m(word):
if re.match("[^aeiou]*?[aeiou]*?",word):
print "PASS"
else:
print "FAIL"
find_m("tr")
find_m("ee")
find_m("tree")
find_m("y")
find_m("by")
find_m("trouble")
find_m("oats")
find_m("trees")
find_m("ivy")
find_m("aaabbbbaaa")
But it is passing for all the cases, I need a correct expressions which gives the desired results.
All you need to do is to add another anchor $ at the end of you regex as
if re.match("[^aeiou]*[aeiou]*$",word):
$ achors the regex at the end of the string. Allows nothing after The vowels
Note
You can drop the non greedy ? from the regex as the non greedy does not have any effect on the specified character class
The regex would match empty strings as well.
The mandatory part in the string can be specified by replacing the * with +. Say for example if the input must contain vowels, then the regex must be
if re.match("[^aeiou]*[aeiou]+$",word):
Or you can check if the string is empty using a lookahead so as to ensure the string is non empty as
if re.match("^(?=.+)[^aeiou]*[aeiou]*$",word):
Test
$ cat test.py
import re
def find_m(word):
if re.match("[^aeiou]*[aeiou]*$",word):
print "PASS"
else:
print "FAIL"
find_m("tr")
find_m("ee")
find_m("tree")
find_m("y")
find_m("by")
find_m("trouble")
find_m("oats")
find_m("trees")
find_m("ivy")
find_m("aaabbbbaaa")
$ python test.py
PASS
PASS
PASS
PASS
PASS
FAIL
FAIL
FAIL
FAIL
FAIL
Fill in the code to check if the text passed contains the vowels a, e and i, with exactly one occurrence of any other character in between.
import re
def check_aei (text):
result = re.search(r"a.e.i", text)
return result != None
print(check_aei("academia")) # True
print(check_aei("aerial")) # False
print(check_aei("paramedic")) # True

Can you place an instance of a member of a list within a regex to match in python?

So essentially I am trying to read lines from multiple files in a directory and using a regex to specifically find the beginnings of a sort of time stamp, I want to also place an instance of a list of months within the regex and then create a counter for each month based on how many times it appears. I have some code below, but it is still a work in progress. I know I closed off date_parse, but I that's why I'm asking. And please leave another suggestion if you can think of a more efficient method. thanks.
months = ['Jan','Feb','Mar','Apr','May','Jun',\
'Jul','Aug','Sep','Oct','Nov',' Dec']
date_parse = re.compile('[Date:\s]+[[A-Za-z]{3},]+[[0-9]{1,2}\s]')
counter=0
for line in sys.stdin:
if data_parse.match(line):
for month in months in line:
print '%s %d' % (month, counter)
In a regular expression, you can have a list of alternative patterns, separated using vertical bars.
http://docs.python.org/library/re.html
from collections import defaultdict
date_parse = re.compile(r'Date:\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)')
c = defaultdict(int)
for line in sys.stdin:
m = date_parse.match(line)
if m is None:
# pattern did not match
# could handle error or log it here if desired
continue # skip to handling next input line
month = m.group(1)
c[month] += 1
Some notes:
I recommend you use a raw string (with r'' or r"") for a pattern, so that backslashes will not become string escapes. For example, inside a normal string, \s is not an escape and you will get a backslash followed by an 's', but \n is an escape and you will get a single character (a newline).
In a regular expression, when you enclose a series of characters in square brackets, you get a "character class" that matches any of the characters. So when you put [Date:\s]+ you would match Date: but you would also match taD:e or any other combination of those characters. It's perfectly okay to just put in a string that should match itself, like Date:.

Categories