Python Regex to find String between two strings - python

I am trying to use Regex to look through a specific part of a string and take what is between but I cant get the right Regex pattern for this.
My biggest issue is with trying to form a Regex pattern for this. I've tried a bunch of variations close to the example listed. It should be close.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text).lower())
# Gets rid of whitespace in case they move the []/[x] around
result = result.replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Happy Path:
I take string (text) and use Regex expression to get the substring between Link Created and Research Done.
Then make the result lowercase and get rid of whitespace just in case they move the []/[x]s. Then it looks at the string (result) for '[]' or '[x]' and print.
Actual Output:
At the moment all I keep getting is None because the the Regex syntax is off...

If you want . to match newlines, you have the use the re.S option.
Also, it would seem a better idea to check if the regex matched before proceeding with further calls. Your call to lower() gave me an error because the regex didn't match, so calling result.group(0).lower() only when result evaluates as true is safer.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text, re.S))
if result:
# Gets rid of whitespace in case they move the []/[x] around
result = result.group(0).lower().replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
else:
print("re did not match")
PS: all the re options are documented in the re module documentation. Search for re.DOTALL for the details on re.S (they're synonyms). If you want to combine options, use bitwise OR. E.g., re.S|re.I will have . match newline and do case-insensitive matching.

I believe it's the \n newline characters giving issues. You can get around this using [\s\S]+ as such:
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# New regex to match text between
# Remove all newlines, tabs, whitespace and column separators
result = re.search(r"Link Created([\s\S]+)Research Done", text).group(1)
result = re.sub(r"[\n\t\s\|]*", "", result)
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")

Seems like regex is overkill for this particular job unless I am missing something (also not clear to me why you need the step that removes the whitespace from the substring). You could just split on "Link Created" and then split the following string on "Research Done".
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
s = text.split("Link Created")[1].split("Research Done")[0].lower()
if "[]" in s or "[x]" in s:
print("Exists")
else:
print("Doesn't Exist")
# Exists

Related

Regex to exclude words followed by space

I tried a lot of solutions but can't get this Regex to work.
The string-
"Flow Control None"
I want to exclude "Flow Control" plus the blank space, and only return whatever is on the right.
Since you have tagged your question with #python and #regex, I'll outline a simple solution to your problem using these tools. Furthermore, the other two answers don't really tackle the exact problem of matching "whatever is on the right" of your "Flow Control " prefix.
First, start by importing the re builtin module (read the docs).
import re
Define the pattern you want to match. Here, we're matching "whatever is on the right" ((?P<suffix>.+)$) of ^Flow Control .
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
Grab the match for a given string (e.g. "Flow Control None")
suffix = pattern.search("Flow Control None").group("suffix")
print(suffix) # Out: None
Hopefully, this complete working example will also help you
import re
def get_suffix(text: str):
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
matches = pattern.search(text)
return matches.group("suffix") if matches else None
examples = [
"Flow Control None",
"Flow Control None None",
"Flow Control None",
"Flow Control ",
]
for example in examples:
suffix = get_suffix(text=example)
if suffix:
print(f"Matched: {repr(suffix)}")
else:
print(f"No matches for: {repr(example)}")
Use split like so:
my_str = 'Flow Control None'
out_str = my_str.split()[-1]
# 'None'
Or use re.findall:
import re
out_str = re.findall(r'^.*\s(\S+)$', my_str)[0]
If you really want a purely regex solution try this: (?<= )[a-zA-Z]*$
The (?<= ) matches a single ' ' but doesn't include it in the match. [a-zA-Z]* matches anything from a to z or A to Z any number of times. $ matches the end of the line.
You could also try replacing the * with a + if you want to ensure that your match has at least one letter (* will produce a 0-length match if your string ends in a space, + will match nothing).
But it may be clearer to do something like
data = "Flow Control None"
split = data.split(' ')
split[len(split) - 1] # returns "None"
EDIT data.split(' ')[-1] also returns "None"
or
data[data.rfind(' ') + 1:] # returns "None"
that don't involve regexes at all.

How to replace a word which occurs before another word in python

I want to replace(re-spell) a word A in a text string with another word B if the word A occurs before an operator. Word A can be any word.
E.G:
Hi I am Not == you
Since "Not" occurs before operator "==", I want to replace it with alist["Not"]
So, above sentence should changed to
Hi I am alist["Not"] == you
Another example
My height > your height
should become
My alist["height"] > your height
Edit:
On #Paul's suggestion, I am putting the code which I wrote myself.
It works but its too bulky and I am not happy with it.
operators = ["==", ">", "<", "!="]
text_list = text.split(" ")
for index in range(len(text_list)):
if text_list[index] in operators:
prev = text_list[index - 1]
if "." in prev:
tokens = prev.split(".")
prev = "alist"
for token in tokens:
prev = "%s[\"%s\"]" % (prev, token)
else:
prev = "alist[\"%s\"]" % prev
text_list[index - 1] = prev
text = " ".join(text_list)
This can be done using regular expressions
import re
...
def replacement(match):
return "alist[\"{}\"]".format(match.group(0))
...
re.sub(r"[^ ]+(?= +==)", replacement, s)
If the space between the word and the "==" in your case is not needed, the last line becomes:
re.sub(r"[^ ]+(?= *==)", replacement, s)
I'd highly recommend you to look into regular expressions, and the python implementation of them, as they are really useful.
Explanation for my solution:
re.sub(pattern, replacement, s) replaces occurences of patterns, that are given as regular expressions, with a given string or the output of a function.
I use the output of a function, that puts the whole matched object into the 'alist["..."]' construct. (match.group(0) returns the whole match)
[^ ] match anything but space.
+ match the last subpattern as often as possible, but at least once.
* match the last subpattern as often as possible, but it is optional.
(?=...) is a lookahead. It checks if the stuff after the current cursor position matches the pattern inside the parentheses, but doesn't include them in the final match (at least not in .group(0), if you have groups inside a lookahead, those are retrievable by .group(index)).
str = "Hi I am Not == you"
s = str.split()
y = ''
str2 = ''
for x in s:
if x in "==":
str2 = str.replace(y, 'alist["'+y+'"]')
break
y = x
print(str2)
You could try using the regular expression library I was able to create a simple solution to your problem as shown here.
import re
data = "Hi I am Not == You"
x = re.search(r'(\w+) ==', data)
print(x.groups())
In this code, re.search looks for the pattern of (1 or more) alphanumeric characters followed by operator (" ==") and stores the result ("Hi I am Not ==") in variable x.
Then for swaping you could use the re.sub() method which CodenameLambda suggested.
I'd also recommend learning how to use regular expressions, as they are useful for solving many different problems and are similar between different programming languages

Efficient way to search for invalid characters in python

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whole post to check for the invalid characters. What I have so far is the following although it does not work correctly and I do not think the idea is very efficient.
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
words = topic_message.split()
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
***for word in words:
if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
return topic_message
Thanks for any help.
For a regex solution, there are two ways to go here:
Find one invalid char anywhere in the string.
Validate every char in the string.
Here is a script that implements both:
import re
topic_message = 'This topic is a-ok'
# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
print ("RE1: Invalid char detected.")
else:
print ("RE1: No invalid char detected.")
# Option 2: Validate all chars in string.
re2 = re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
print ("RE2: All chars are valid.")
else:
print ("RE2: Not all chars are valid.")
Take your pick.
Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.
Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:
Test data:
r"""
TEST topic_message STRINGS:
ok: 'This topic is A-ok. This topic is A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'
MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""
Results:
r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method Ok-time Bad-time
1 1.054 1.190
2 1.830 1.636
3 4.364 4.577
"""
The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.
You have to be much more careful when using regular expressions - they are full of traps.
in the case of [^<>/\{}[]~] the first ] closes the group which is probably not what you intended. If you want to use ] in a group it has to be the first character after the [ eg []^<>/\{}[~]
simple test confirms this
>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>
regex is overkill for this problem anyway
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
invalid_chars = '^<>/\{}[]~`$'
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
if set(invalid_chars).intersection(topic_message):
raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
return topic_message
If efficiency is a major concern I would re.compile() the re string, since you're going to use the same regex many times.
re.match and re.search behave differently. Splitting words is not required to search using regular expressions.
import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");
if symbols_re.search(self.cleaned_data('topic_message')):
//raise Validation error
I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.
In any case you need to scan the entire message. So wouldn't something simple like this work ?
def checkMessage(topic_message):
for char in topic_message:
if char in "<>/\{}[]~`":
return False
return True
is_valid = not any(k in text for k in '<>/{}[]~`')
I agree with gnibbler, regex is an overkiller for this situation. Probably after removing this unwanted chars you'll want to remove unwanted words also, here's a little basic way to do it:
def remove_bad_words(title):
'''Helper to remove bad words from a sentence based in a dictionary of words.
'''
word_list = title.split(' ')
for word in word_list:
if word in BAD_WORDS: # BAD_WORDS is a list of unwanted words
word_list.remove(word)
#let's build the string again
title2 = u''
for word in word_list:
title2 = ('%s %s') % (title2, word)
#title2 = title2 + u' '+ word
return title2
Example: just tailor to your needs.
### valid chars: 0-9 , a-z, A-Z only
import re
REGEX_FOR_INVALID_CHARS=re.compile( r'[^0-9a-zA-Z]+' )
list_of_invalid_chars_found=REGEX_FOR_INVALID_CHARS.findall( topic_message )

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Python Regular expression must strip whitespace except between quotes

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

Categories