Regex for word exclusion in python - python

I have a regular expression '[\w_-]+' which allows alphanumberic character or underscore.
I have a set of words in a python list which I don't want to allow
listIgnore = ['summary', 'config']
What changes need to be made in the regex?
P.S: I am new to regex

>>> line="This is a line containing a summary of config changes"
>>> listIgnore = ['summary', 'config']
>>> patterns = "|".join(listIgnore)
>>> print re.findall(r'\b(?!(?:' + patterns + r'))[\w_-]+', line)
['This', 'is', 'a', 'line', 'containing', 'a', 'of', 'changes']

This question intrigued me, so I set about for an answer:
'^(?!summary)(?!config)[\w_-]+$'
Now this only works if you want to match the regex against a complete string:
>>> re.match('^(?!summary)(?!config)[\w_-]+$','config_test')
>>> (None)
>>> re.match('^(?!summary)(?!config)[\w_-]+$','confi_test')
>>> <_sre.SRE_Match object at 0x21d34a8>
So to use your list, just add in more (?!<word here>) for each word after ^ in your regex. These are called lookaheads. Here's some good info.
If you're trying to match within a string (i.e. without the ^ and $) then I'm not sure it's possible. For instance the regex will just pick a subset of the string that doesn't match. Example: ummary for summary.
Obviously the more exclusions you pick the more inefficient it will get. There's probably better ways to do it.

Related

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Python Regex Subgroup Capturing

I'm trying to parse the following string:
constructor: function(some, parameters, here) {
With the following regex:
re.search("(\w*):\s*function\((?:(\w*)(?:,\s)*)*\)", line).groups()
And I'm getting:
('constructor', '')
But I was expecting something more like:
('constructor', 'some', 'parameters', 'here')
What am I missing?
If you change your pattern to:
print re.search(r"(\w*):\s*function\((?:(\w+)(?:,\s)?)*\)", line).groups()
You'll get:
('constructor', 'here')
This is because (from docs):
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
If you can do this in one step, I don't know how. Your alternative, of course is to do something like:
def parse_line(line):
cons, args = re.search(r'(\w*):\s*function\((.*)\)', line).groups()
mats = re.findall(r'(\w+)(?:,\s*)?', args)
return [cons] + mats
print parse_line(line) # ['constructor', 'some', 'parameters', 'here']
One option is to use more advanced regex instead of the stock re. Among other nice things, it supports captures, which, unlike groups, save every matching substring:
>>> line = "constructor: function(some, parameters, here) {"
>>> import regex
>>> regex.search("(\w*):\s*function\((?:(\w+)(?:,\s)*)*\)", line).captures(2)
['some', 'parameters', 'here']
The re module doesn't support repeated captures: the group count is fixed. Possible workarounds include:
1) Capture the parameters as a string and then split it:
match = re.search("(\w*):\s*function\(([\w\s,]*)\)", line).groups()
args = [arg.strip() for arg in math[1].split(",")]
2) Capture the parameters as a string and then findall it:
match = re.search("(\w*):\s*function\(([\w\s,]*)\)", line).groups()
args = re.findall("(\w+)(?:,\s)*", match[1])
3) If your input string has already been verified, you can just findall the whole thing:
re.findall("(\w+)[:,)]", string)
Alternatively, you can use the regex module and captures(), as suggested by #georg.
You might need two operations here (search and findall):
[re.search(r'[^:]+', given_string).group()] + re.findall(r'(?<=[ (])\w+?(?=[,)])', given_string)
Output: ['constructor', 'some', 'parameters', 'here']

Python: Getting text of a Regex match

I have a regex match object in Python. I want to get the text it matched. Say if the pattern is '1.3', and the search string is 'abc123xyz', I want to get '123'. How can I do that?
I know I can use match.string[match.start():match.end()], but I find that to be quite cumbersome (and in some cases wasteful) for such a basic query.
Is there a simpler way?
You can simply use the match object's group function, like:
match = re.search(r"1.3", "abc123xyz")
if match:
doSomethingWith(match.group(0))
to get the entire match. EDIT: as thg435 points out, you can also omit the 0 and just call match.group().
Addtional note: if your pattern contains parentheses, you can even get these submatches, by passing 1, 2 and so on to group().
You need to put the regex inside "()" to be able to get that part
>>> var = 'abc123xyz'
>>> exp = re.compile(".*(1.3).*")
>>> exp.match(var)
<_sre.SRE_Match object at 0x691738>
>>> exp.match(var).groups()
('123',)
>>> exp.match(var).group(0)
'abc123xyz'
>>> exp.match(var).group(1)
'123'
or else it will not return anything:
>>> var = 'abc123xyz'
>>> exp = re.compile("1.3")
>>> print exp.match(var)
None

Checking and removing extra symbols

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"
Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

Categories