Accessing delimiter in Python regular expressions split()

Accessing delimiter in Python regular expressions split() - python

In the python re module, I'm making use of re.split()
string = '$aText$bFor$cStack$dOverflow'
parts = re.split(r'\$\w', string)
assert parts == ['Text', 'For', 'Stack', 'Overflow']
My question: is it possible to return the instance of the delimiter at the same time as the list of parts? I'd like to know if the delimiter was $c, $d, etc. preceding the various parts.
I suppose I could do a findall() call first, but that would mean manually calling positions in a list, which would introduce bugs. That also doesn't seem very pythonic.

If you put the pattern in a capture group, the delimiters appear in the results:
>>> string = '$aText$bFor$cStack$dOverflow'
>>> re.split(r'(\$\w)', string)
['', '$a', 'Text', '$b', 'For', '$c', 'Stack', '$d', 'Overflow']

Related

How to put a condition in regex in python?

I have a regex like --
query = "(A((hh)|(hn)|(n))?)"
and an input inp = "Ahhwps edAn". I want to extract all the matched pattern along with unmatched(remaining) but with preserving order of the input.
The output should look like -- ['Ahh', 'wps ed', 'An'] or ['Ahh', 'w', 'p', 's', ' ', 'e', 'd', 'An'].
I had searched online but found nothing.
How can I do this?

The re.split method may output captured submatches in the resulting array.
Capturing groups are those constructs that are formed with a pair of unescaped parentheses. Your pattern abounds in redundant capturing groups, and re.split will return all of them. You need to remove those unnecessary ones, and convert all capturing groups to non-capturing ones, and just keep the outer pair of parentheses to make the whole pattern a single capturing group.
Use
re.split(r'(A(?:hh|hn|n)?)', s)
Note that there may be an empty element in the output list. Just use filter(None, result) to get rid of the empty values.

The match objects' span() method is really useful for what you're after.
import re
pat = re.compile("(A((hh)|(hn)|(n))?)")
inp = "Ahhwps edAn"
result=[]
i=k=0
for m in re.finditer(pat,inp):
j,k=m.span()
if i<j:
result.append(inp[i:j])
result.append(inp[j:k])
i=k
if i<len(inp):
result.append(inp[k:])
print result
Here's what the output looks like.
['Ahh', 'wps ed', 'An']
This technique handles any non-matching prefix and suffix text as well. If you use an inp value of "prefixAhhwps edAnsuffix", you'll get the output I think you'd want:
['prefix', 'Ahh', 'wps ed', 'An', 'suffix']

You can try this:
import re
import itertools
new_data = list(itertools.chain.from_iterable([re.findall(".{"+str(len(i)/2)+"}", i) for i in inp.split()]))
Output:
['Ahh', 'wps', 'ed', 'An']

How to parse a CSV with commas between parenthesis and missing values

I tried using pyparsing to parse a CSV with:
Commas between parenthesis (or brackets, etc): "a(1,2),b" should return the list ["a(1,2)","b"]
Missing values: "a,b,,c," should return the list ['a','b','','c','']
I worked a solution but it seems "dirty". Mainly, the Optional inside only one of the possible atomics. I think the optional should be independent of the atomics. That is, I feel it should be put somewhere else, for example in the delimitedList optional arguments, but in my trial and error that was the only place that worked and made sense. It could be in any of the possible atomics so I chose the first.
Also, I don't fully understand what originalTextFor is doing but if I remove it it stops working.
Working example:
import pyparsing as pp
# Function that parses a line of columns separated by commas and returns a list of the columns
def fromLineToRow(line):
sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]") # matches "a[1,2]"
parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")") # matches "a(1,2)"
# In the following line:
# * The "^" means "choose the longest option"
# * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values
atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col))) ^ pp.originalTextFor(pp.OneOrMore(sqbrackets_col))
grammar = pp.delimitedList(atomic)
row = grammar.parseString(line).asList()
return row
file_str = \
"""YEAR,a(2,3),b[3,4]
1960,2.8,3
1961,4,
1962,,1
1963,1.27,3"""
for line in file_str.splitlines():
row = fromLineToRow(line)
print(row)
Prints:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
Is this the right way to do this? Is there a "cleaner" way to use the Optional inside the first atomic?

Working inside-out, I get this:
# chars not in ()'s or []'s - also disallow ','
non_grouped = pp.Word(pp.printables, excludeChars="[](),")
# grouped expressions in ()'s or []'s
grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")")
# use OneOrMore to allow non_grouped and grouped together
atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped))
# or based on your examples, you *could* tighten this up to:
# atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))
originalTextFor recombines the original input text within the leading and trailing boundaries of the matched expressions, and returns a single string. If you leave this out, then you will get all the sub-expressions in a nested list of strings, like ['a',['2,3']]. You could rejoin them with repeated calls to ''.join, but that would collapse out whitespace (or use ' '.join, but that has the opposite problem of potentially introducing whitespace).
To optionalize the elements of the list, just say so in the definition of the delimited list:
grammar = pp.delimitedList(pp.Optional(atomic, default=''))
Be sure to add the default value, else the empty slots will just get dropped.
With these changes I get:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']

What you can do is using regex re, for instance:
>>> import re
>>> re.split(r',\s*(?![^()]*\))', line1)
['a(1,2)', 'b']
>>> re.split(r',\s*(?![^()]*\))', line2)
['a', 'b', '', 'c', '']

import re
with open('44289614.csv') as f:
for line in map(str.strip, f):
l = re.split(',\s*(?![^()[]]*[\)\]])', line)
print(len(l), l)
Output:
3 ['YEAR', 'a(2,3)', 'b[3,4]']
3 ['1960', '2.8', '3']
3 ['1961', '4', '']
3 ['1962', '', '1']
3 ['1963', '1.27', '3']
Modified from this answer.
I also like this answer, which suggests modifying the input slightly and using quotechar of the csv module.

String split with default delimiter vs user defined delimiter

I tried a simple example with string split, but get some unexpected behavior. Here is the sample code:
def split_string(source,splitlist):
for delim in splitlist:
source = source.replace(delim, ' ')
return source.split(' ')
out = split_string("This is a test-of the,string separation-code!", " ,!-")
print out
>>> ['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code', '']
As you can see, I got an extra empty string at the end of the list when I use space as delimiter argument for split() function. However, if I don't pass in any argument for split() function, I got no empty string at the end of the output list.
From what I read in python docs, they said the default argument for split() is space. So, why when I explicitly pass in a ' ' as delimiter, it creates an empty string at the end of the output list?

The docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.

That may happen if you have multiple spaces separating two words.
For example,
'a b'.split(' ') will return ['a', '', '', '', 'b']
But I would suggest you to use split from re module. Check the example below:
import re
print re.split('[\s,; !]+', 'a b !!!!!!! , hello ;;;;; world')
When we run the above piece, it outputs ['a', 'b', 'hello', 'world']

Regex for word exclusion in python

I have a regular expression '[\w_-]+' which allows alphanumberic character or underscore.
I have a set of words in a python list which I don't want to allow
listIgnore = ['summary', 'config']
What changes need to be made in the regex?
P.S: I am new to regex

>>> line="This is a line containing a summary of config changes"
>>> listIgnore = ['summary', 'config']
>>> patterns = "|".join(listIgnore)
>>> print re.findall(r'\b(?!(?:' + patterns + r'))[\w_-]+', line)
['This', 'is', 'a', 'line', 'containing', 'a', 'of', 'changes']

This question intrigued me, so I set about for an answer:
'^(?!summary)(?!config)[\w_-]+$'
Now this only works if you want to match the regex against a complete string:
>>> re.match('^(?!summary)(?!config)[\w_-]+$','config_test')
>>> (None)
>>> re.match('^(?!summary)(?!config)[\w_-]+$','confi_test')
>>> <_sre.SRE_Match object at 0x21d34a8>
So to use your list, just add in more (?!<word here>) for each word after ^ in your regex. These are called lookaheads. Here's some good info.
If you're trying to match within a string (i.e. without the ^ and $) then I'm not sure it's possible. For instance the regex will just pick a subset of the string that doesn't match. Example: ummary for summary.
Obviously the more exclusions you pick the more inefficient it will get. There's probably better ways to do it.

python split string on multiple delimeters without regex

I have a string that I need to split on multiple characters without the use of regular expressions. for example, I would need something like the following:
>>>string="hello there[my]friend"
>>>string.split(' []')
['hello','there','my','friend']
is there anything in python like this?

If you need multiple delimiters, re.split is the way to go.
Without using a regex, it's not possible unless you write a custom function for it.
Here's such a function - it might or might not do what you want (consecutive delimiters cause empty elements):
>>> def multisplit(s, delims):
... pos = 0
... for i, c in enumerate(s):
... if c in delims:
... yield s[pos:i]
... pos = i + 1
... yield s[pos:]
...
>>> list(multisplit('hello there[my]friend', ' []'))
['hello', 'there', 'my', 'friend']

Solution without regexp:
from itertools import groupby
sep = ' []'
s = 'hello there[my]friend'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
I've just posted an explanation here https://stackoverflow.com/a/19211729/2468006

A recursive solution without use of regex. Uses only base python in contrast to the other answers.
def split_on_multiple_chars(string_to_split, set_of_chars_as_string):
# Recursive splitting
# Returns a list of strings
s = string_to_split
chars = set_of_chars_as_string
# If no more characters to split on, return input
if len(chars) == 0:
return([s])
# Split on the first of the delimiter characters
ss = s.split(chars[0])
# Recursive call without the first splitting character
bb = []
for e in ss:
aa = split_on_multiple_chars(e, chars[1:])
bb.extend(aa)
return(bb)
Works very similarly to pythons regular string.split(...), but accepts several delimiters.
Example use:
print(split_on_multiple_chars('my"example_string.with:funny?delimiters', '_.:;'))
Output:
['my"example', 'string', 'with', 'funny?delimiters']

If you're not worried about long strings, you could force all delimiters to be the same using string.replace(). The following splits a string by both - and ,
x.replace('-', ',').split(',')
If you have many delimiters you could do the following:
def split(x, delimiters):
for d in delimiters:
x = x.replace(d, delimiters[0])
return x.split(delimiters[0])

re.split is the right tool here.
>>> string="hello there[my]friend"
>>> import re
>>> re.split('[] []', string)
['hello', 'there', 'my', 'friend']
In regex, [...] defines a character class. Any characters inside the brackets will match. The way I've spaced the brackets avoids needing to escape them, but the pattern [\[\] ] also works.
>>> re.split('[\[\] ]', string)
['hello', 'there', 'my', 'friend']
The re.DEBUG flag to re.compile is also useful, as it prints out what the pattern will match:
>>> re.compile('[] []', re.DEBUG)
in
literal 93
literal 32
literal 91
<_sre.SRE_Pattern object at 0x16b0850>
(Where 32, 91, 93, are the ascii values assigned to , [, ])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing delimiter in Python regular expressions split() - python

If you put the pattern in a capture group, the delimiters appear in the results: >>> string = '$aText$bFor$cStack$dOverflow' >>> re.split(r'(\$\w)', string) ['', '$a', 'Text', '$b', 'For', '$c', 'Stack', '$d', 'Overflow']

Related

How to put a condition in regex in python?

How to parse a CSV with commas between parenthesis and missing values

String split with default delimiter vs user defined delimiter

Regex for word exclusion in python

python split string on multiple delimeters without regex

Categories

Resources