How to match through new line in regular expression in python?

How to match through new line in regular expression in python? - python

This is my code. I want to ignore whatever that is within ~~. Even if it contains new lines, white spaces. So that I can ignore the comments.
for letter in code :
tok += letter #adding each character to the token.
if not is_str and (tok == " " or tok == "\n"):
#ignoring whitespaces and new line if it's not a string.
tok = "" #reseting each the iterator token.
#Always always always remember. It's not lexer's job to generate errors
#It's the work of parser. One thing should only do one thing.
elif re.search(r'Enter', tok):
tokens.append("ENTER")
tok = ""
elif re.search(r'~(.*?|\n*?)~',tok):
#to ignore the comments written within ~this~
tok = ""

If no other ~ is allowed within ~ strings, you can use:
r'~[^~]*~'
This will match any character but ~.

You can use the re.DOTALL flag:
Make the '.' special character match any character at all, including a
newline; without this flag, '.' will match anything except a newline.
pattern = re.compile(r'~(.*?)~', re.DOTALL)
Trial:
>>> import re
>>> s = '''~dksdjs
... sdjs~'''
>>> pattern = re.compile(r'~(.*?)~', re.DOTALL)
>>> pattern.search(s)
<_sre.SRE_Match object; span=(0, 13), match='~dksdjs\nsdjs~'>
# ^

Related

re.compile and putting variables with it [duplicate]

I'd like to use a variable inside a regex, how can I do this in Python?
TEXTO = sys.argv[1]
if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed

You have to build the regex as a string:
TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"
if re.search(my_regex, subject, re.IGNORECASE):
etc.
Note the use of re.escape so that if your text has special characters, they won't be interpreted as such.

From python 3.6 on you can also use Literal String Interpolation, "f-strings". In your particular case the solution would be:
if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
...do something
EDIT:
Since there have been some questions in the comment on how to deal with special characters I'd like to extend my answer:
raw strings ('r'):
One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:
In short:
Let's say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:
TEXTO = "Var"
subject = r"Var\boundary"
if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
print("match")
This only works because we are using a raw-string (the regex is preceded by 'r'), otherwise we must write "\\\\boundary" in the regex (four backslashes). Additionally, without '\r', \b' would not converted to a word boundary anymore but to a backspace!
re.escape:
Basically puts a backslash in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
print("match")
NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, #, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)
Curly braces:
If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let's say you want to match TEXTO followed by exactly 2 digits:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
print("match")

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):
This will insert what is in TEXTO into the regex as a string.

rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)

I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.
import re
string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)
Output:
[('begin', 'id1'), ('middl', 'id2')]

I agree with all the above unless:
sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor
sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"
you would not want to use re.escape, because in that case you would like it to behave like a regex
TEXTO = sys.argv[1]
if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed

you can try another usage using format grammer suger:
re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)

I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:
pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)
Output can be printed using the following:
print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False

here's another format you can use (tested on python 3.7)
regex_str = r'\b(?<=\w)%s\b(?!\w)'%TEXTO
I find it's useful when you can't use {} for variable (here replaced with %s)

You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.
if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
# Successful match**strong text**
else:
# Match attempt failed

more example
I have configus.yml
with flows files
"pattern":
- _(\d{14})_
"datetime_string":
- "%m%d%Y%H%M%f"
in python code I use
data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

How to use regex to tell if first and last character of a string match?

I'm relatively new to using Python and Regex, and I wanted to check if strings first and last characters are the same.
If first and last characters are same, then return 'True' (Ex: 'aba')
If first and last characters are not same, then return 'False' (Ex: 'ab')
Below is the code, I've written:
import re
string = 'aba'
pattern = re.compile(r'^/w./1w$')
matches = pattern.finditer(string)
for match in matches
print (match)
But from the above code, I don't see any output

if and only if you really want to use regex (for learning purpose):
import re
string = 'aba'
string2 = 'no match'
pattern = re.compile(r'^(.).*\1$')
if re.match(pattern, string):
print('ok')
else:
print('nok')
if re.match(pattern, string2):
print('ok')
else:
print('nok')
output:
ok
nok
Explanations:
^(.).*\1$
^ start of line anchor
(.) match the first character of the line and store it in a group
.* match any characters any time
\1 backreference to the first group, in this case the first character to impose that the first char and the last one are equal
$ end of line anchor
Demo: https://regex101.com/r/DaOPEl/1/
Otherwise the best approach is to simply use the comparison string[0] == string[-1]
string = 'aba'
if string[0] == string[-1]:
print 'same'
output:
same

Why do you overengineer with an regex at all? One principle of programming should be keeping it simple like:
string[0] is string[-1]
Or is there a need for regex?

The above answer of #Tobias is perfect & simple but if you want solution using regex then try the below code.
Try this code !
Code :
import re
string = 'abbaaaa'
pattern = re.compile(r'^(.).*\1$')
matches = pattern.finditer(string)
for match in matches:
print (match)
Output :
<_sre.SRE_Match object; span=(0, 7), match='abbaaaa'>

I think this is the regex you are trying to execute:
Code:
import re
string = 'aba'
pattern = re.compile(r'^(\w).(\1)$')
matches = pattern.finditer(string)
for match in matches:
print (match.group(0))
Output:
aba

if you want to check with regex use below:
import re
string = 'aba is a cowa'
pat = r'^(.).*\1$'
re.findall(pat,string)
if re.findall(pat,string):
print(string)
this will match first and last character of line or string if they match then it returns matching character in that case it will print string of line otherwise it will skip

Trouble joining a list of multiple strings back to a list of a single string in Python

I am trying to write a Python program to check if the phrases in a file occur in a document. My program works fine until it hits a phrase such as "happy(+) feet". I think the error is related to the "(+)" in the phrase; however, I am not sure how to revise my regular expression to make it work.
This is my code:
import re
handle = open('document.txt', 'r')
text = handle.read()
lst = list()
with open('phrases.txt', 'r') as phrases:
for phrase in phrases:
phrase = phrase.rstrip()
if len(phrase) > 0 and phrase not in lst:
ealst.append(phrase)
counts = {}
for each_phrase in lst:
word = each_phrase.rsplit()
pattern = re.compile(r'%s' % '\s+'.join(word), re.IGNORECASE)
counts[each_phrase] = len(pattern.findall(text))
for key, value in counts.items():
if value > 0:
print key,',', value
handle.close()
phrases.close()

You need to use re.escape when declaring a word:
word = map(re.escape, each_phrase.rsplit())
And perhaps, change \s+ to \s* to make the space optional:
pattern = re.compile(r'%s' % '\s*'.join(word), re.IGNORECASE)
The round brackets ( and ) as well as + plus symbol being special regex characters must be escaped in a regular expression outside of character class to match literal characters.
Sample IDEONE demo

Python regex - understanding the difference between match and search

From what I figured,
match: given a string str and a pattern pat, match checks if str matches the pattern from str's start.
search: given a string str and a pattern pat, search checks if str matches the pattern from every index of str.
If so, is there a meaning using '^' at the start of a regex with match?
From what I understood, since match already checks from the start, there isn't. I'm probably wrong; where is my mistake?

I believe there is no use. The following is copy/pasted from: http://docs.python.org/library/re.html#search-vs-match
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object at ...>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object at ...>
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object at ...>

When calling the function re.match specifically, the ^ character does have little meaning because this function begins the matching process at the beginning of the line. However, it does have meaning for other functions in the re module, and when calling match on a compiled regular expression object.
For example:
text = """\
Mares eat oats
and does eat oats
"""
print re.findall('^(\w+)', text, re.MULTILINE)
This prints:
['Mares', 'and']
With a re.findall() and re.MULTILINE enabled, it gives you the first word (with no leading whitespace) on each line of your text.
It might be useful if doing something more complex, like lexical analysis with regular expressions, and passing into the compiled regular expression a starting position in the text it should start matching at (which you can choose to be the ending position from the previous match). See the documentation for RegexObject.match method.
Simple lexer / scanner as an example:
text = """\
Mares eat oats
and does eat oats
"""
pattern = r"""
(?P<firstword>^\w+)
|(?P<lastword>\w+$)
|(?P<word>\w+)
|(?P<whitespace>\s+)
|(?P<other>.)
"""
rx = re.compile(pattern, re.MULTILINE | re.VERBOSE)
def scan(text):
pos = 0
m = rx.match(text, pos)
while m:
toktype = m.lastgroup
tokvalue = m.group(toktype)
pos = m.end()
yield toktype, tokvalue
m = rx.match(text, pos)
for tok in scan(text):
print tok
which prints
('firstword', 'Mares')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
('firstword', 'and')
('whitespace', ' ')
('word', 'does')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
This distinguishes between types of word; a word at the beginning of a line, a word at the end of a line, and any other word.

In normal mode, you don't need ^ if you are using match.
But in multiline mode (re.MULTILINE), it can be useful because ^ can match not only the beginning of the whole string, but also beginning of every line.

How to use a variable inside a regular expression?

I'd like to use a variable inside a regex, how can I do this in Python?
TEXTO = sys.argv[1]
if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed

You have to build the regex as a string:
TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"
if re.search(my_regex, subject, re.IGNORECASE):
etc.
Note the use of re.escape so that if your text has special characters, they won't be interpreted as such.

From python 3.6 on you can also use Literal String Interpolation, "f-strings". In your particular case the solution would be:
if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
...do something
EDIT:
Since there have been some questions in the comment on how to deal with special characters I'd like to extend my answer:
raw strings ('r'):
One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:
In short:
Let's say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:
TEXTO = "Var"
subject = r"Var\boundary"
if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
print("match")
This only works because we are using a raw-string (the regex is preceded by 'r'), otherwise we must write "\\\\boundary" in the regex (four backslashes). Additionally, without '\r', \b' would not converted to a word boundary anymore but to a backspace!
re.escape:
Basically puts a backslash in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
print("match")
NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, #, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)
Curly braces:
If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let's say you want to match TEXTO followed by exactly 2 digits:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
print("match")

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):
This will insert what is in TEXTO into the regex as a string.

rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)

I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.
import re
string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)
Output:
[('begin', 'id1'), ('middl', 'id2')]

I agree with all the above unless:
sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor
sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"
you would not want to use re.escape, because in that case you would like it to behave like a regex
TEXTO = sys.argv[1]
if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed

you can try another usage using format grammer suger:
re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)

I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:
pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)
Output can be printed using the following:
print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False

here's another format you can use (tested on python 3.7)
regex_str = r'\b(?<=\w)%s\b(?!\w)'%TEXTO
I find it's useful when you can't use {} for variable (here replaced with %s)

You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.
if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
# Successful match**strong text**
else:
# Match attempt failed

more example
I have configus.yml
with flows files
"pattern":
- _(\d{14})_
"datetime_string":
- "%m%d%Y%H%M%f"
in python code I use
data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to match through new line in regular expression in python? - python

If no other ~ is allowed within ~ strings, you can use: r'~[^~]*~' This will match any character but ~.

Related

re.compile and putting variables with it [duplicate]

How to use regex to tell if first and last character of a string match?

Trouble joining a list of multiple strings back to a list of a single string in Python

Python regex - understanding the difference between match and search

How to use a variable inside a regular expression?

Categories

Resources