I tried a lot of solutions but can't get this Regex to work.
The string-
"Flow Control None"
I want to exclude "Flow Control" plus the blank space, and only return whatever is on the right.
Since you have tagged your question with #python and #regex, I'll outline a simple solution to your problem using these tools. Furthermore, the other two answers don't really tackle the exact problem of matching "whatever is on the right" of your "Flow Control " prefix.
First, start by importing the re builtin module (read the docs).
import re
Define the pattern you want to match. Here, we're matching "whatever is on the right" ((?P<suffix>.+)$) of ^Flow Control .
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
Grab the match for a given string (e.g. "Flow Control None")
suffix = pattern.search("Flow Control None").group("suffix")
print(suffix) # Out: None
Hopefully, this complete working example will also help you
import re
def get_suffix(text: str):
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
matches = pattern.search(text)
return matches.group("suffix") if matches else None
examples = [
"Flow Control None",
"Flow Control None None",
"Flow Control None",
"Flow Control ",
]
for example in examples:
suffix = get_suffix(text=example)
if suffix:
print(f"Matched: {repr(suffix)}")
else:
print(f"No matches for: {repr(example)}")
Use split like so:
my_str = 'Flow Control None'
out_str = my_str.split()[-1]
# 'None'
Or use re.findall:
import re
out_str = re.findall(r'^.*\s(\S+)$', my_str)[0]
If you really want a purely regex solution try this: (?<= )[a-zA-Z]*$
The (?<= ) matches a single ' ' but doesn't include it in the match. [a-zA-Z]* matches anything from a to z or A to Z any number of times. $ matches the end of the line.
You could also try replacing the * with a + if you want to ensure that your match has at least one letter (* will produce a 0-length match if your string ends in a space, + will match nothing).
But it may be clearer to do something like
data = "Flow Control None"
split = data.split(' ')
split[len(split) - 1] # returns "None"
EDIT data.split(' ')[-1] also returns "None"
or
data[data.rfind(' ') + 1:] # returns "None"
that don't involve regexes at all.
Related
I have this code to check if a string is a valid IPv4 address:
import re
def is_ip4(IP):
label = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it works fine. but if I remove the parentheses from the label, as this
import re
def is_ip4(IP):
label = "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it show valid ip for "2090.1.11.0", "20.1.11.0", but not for "2.1.11.0". I'm actually a bit confused for the cases with vs without parentheses. Can someone explain this for me? thanks
The reason you need the parentheses is because of the two-step process you're using. By itself, the parentheses don't do anything (other than capturing in a group). But you're also doing this:
pattern = re.compile("(" + label + "\.){3}" + label + "$")
The label regex is copied twice, first for three repetitions followed by a period. That copy is fine (almost), because in the statement, it is enclosed in parentheses once more. However, the second copy is outside any parentheses, so you end up with a regex like (simplified):
pattern == '(a|ab|abc\.){3}a|ab|abc$'
This matches if either (a|ab|abc\.){3}a matches, or ab or abc. With parentheses, it would be like:
pattern == '((a|ab|abc)\.){3}(a|ab|abc)$'
So, although the parentheses appear superfluous, they are not for two reasons. They are keeping the period separate from the last option abc and they are keeping the final choices together and apart from the first bit.
However, you shouldn't be doing this in the first place. Just use:
from ipaddress import ip_address
def is_ip4(ip):
try:
ip_address(ip)
return True
except ValueError:
return False
No installation required, it's a standard library.
The reason you get a match for '2090.1.11.0' is because matching it to this:
'([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$'
Comes down to matching it to this:
'([0-9]){3}[0-9]'
Since, [0-9] is the first option in the 'or' expression in parentheses, repeated three times and the second [0-9] is the first option in the 'or' expression after the {3}.
Note that the $ you put in to ensure the entire string was matches is lumped in with the final 'or' option, so that doesn't do anything here.
Try running the below and note the identical first match:
import re
print(re.findall('([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$', '2090.1.11.0'))
print(re.findall('([0-9]){3}[0-9]', '2090.1.11.0'))
(ignore the second match on the first line, not as relevant)
My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex
You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group
I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.
I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']
I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result
I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.