String split using regex with pattern present in text - python

I have many string that I need to split by commas. Example:
myString = r'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
myString = r'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
My desired output would be:
["test", "Test", "NEAR(this,that,DISTANCE=4)", "test again", """another test"""] #list length = 5
I can't figure out how to keep the commas between "this,that,DISTANCE" in one item. I tried this:
l = re.compile(r',').split(myString) # matches all commas
l = re.compile(r'(?<!\(),(?=\))').split(myString) # (negative lookback/lookforward) - no matches at all
Any ideas? Let's say the list of allowed "functions" is defined as:
f = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]

You may use
(?:\([^()]*\)|[^,])+
See the regex demo.
The (?:\([^()]*\)|[^,])+ pattern matches one or more occurrences of any substring between parentheses with no ( and ) in them or any char other than ,.
See the Python demo:
import re
rx = r"(?:\([^()]*\)|[^,])+"
s = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
print(re.findall(rx, s))
# => ['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']

If explicitly want to specify which strings count as functions, you need to build the regex dynamically. Otherwise, go with Wiktor's solution.
>>> functions = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
>>> funcs = '|'.join('{}\([^\)]+\)'.format(f) for f in functions)
>>> regex = '({})|,'.format(funcs)
>>>
>>> myString1 = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString1)))
['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
>>> myString2 = 'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString2)))
['test',
'Test',
'FOLLOWEDBY(this,that,DISTANCE=4)',
'test again',
'"another test"']

Related

Removing punctuations and spaces in a string without using regex

I used import string and string.punctuation but I realized I still have '…' after conducting string.split(). I also get '', which I don't know why I would get it after doing strip(). As far as I understand, strip() removes the peripheral spaces, so if I have spaces between a string it would not matter:
>>> s = 'a dog barks meow! # … '
>>> s.strip()
'a dog barks meow! # …'
>>> import string
>>> k = []
>>> for item in s.split():
... k.append(item.strip(string.punctuation))
...
>>> k
['a', 'dog', 'barks', 'meow', '', '…']
I would like to get rid of '', '…', the final output I'd like is ['a', 'dog', 'barks', 'meow'].
I would like to refrain from using regex, but if that's the only solution I will consider it .. for now I'm more interested in solving this without resorting to regex.
You can remove punctuation by retaining only alphanumeric characters and spaces:
s = 'a dog barks meow! # …'
print(''.join(c for c in s if c.isalnum() or c.isspace()).split())
This outputs:
['a', 'dog', 'barks', 'meow']
I used the following:
s = 'a dog barks Meow! # … '
import string
p = string.punctuation+'…'
k = []
for item in s.split():
k.append(item.strip(p).lower())
k = [x for x in k if x]
building on the accepted answer to this question:
import itertools
k = []
for ok, grp in itertools.groupby(s, lambda c: c.isalnum()):
if ok:
k.append(''.join(list(grp)))
or the same as a one-liner (except for the import):
k = [''.join(list(grp)) for ok, grp in itertools.groupby(s, lambda c: c.isalnum()) if ok]
itertools.groupby() scans the string s as a list of characters, grouping them (grp) by the value (ok) of the lambda expression. The if ok filters out the groups not matching the lambda. The groups are iterators that have to be converted to a list of characters and then joined to get back the words.
The meaning of isalnum() is essentially “is alphanumeric”. Depending on your use case, you might prefer isalpha(). In both cases, for this input:
s = 'a 狗 barks meow! # …'
the output is
['a', '狗', 'barks', 'meow']
(For experts: this reminds us of the problem that not in all languages words are separated by non-word characters - e.g.)

Python regex match between characters

I'm doing a pretty straightforward regex in python and seeing some odd behavior when I use the "or" operator.
I am trying to parse the following:
>> str = "blah [in brackets] stuff"
so that it returns:
>> ['blah', 'in brackets', 'stuff']
To match the text between brackets, I am using look behind and look ahead, i.e.:
>> '(?<=\[).*?(?=\])'
If used alone this does indeed capture the text in brackets:
>> re.findall( '(?<=\[).*?(?=\])' , str )
>> ['in brackets']
But when I combine the or operator to parse the strings between spaces, the bracket-match somehow breaks down:
>> [x for x in re.findall( '(?<=\[).*?(?=\])|.*?[, ]' , str ) if x!=' ' ]
>> ['blah', '[in ', 'brackets] ']
For the life of me I can't understand this behavior. Any help would be appreciated.
Thanks!
You can do:
>>> s = "blah [in brackets] stuff"
>>> re.findall(r'\b\w+\s*\w+\b', s)
['blah', 'in brackets', 'stuff']
For those interested, this is the successful regex that I ended up going with. There is probably a more elegant solution somewhere but this works:
>>> s = "blah 2.0 stuff 1 1 0 [in brackets] more stuff [1]"
>>> brackets_re = '(?<=\[).*?(?=\])'
>>> space_re = '[-\.\w]+(?= )'
>>> my_re = brackets_re + '|' + space_re
>>> re.findall(my_re, s)
['blah', '2.0', 'stuff', '1', '1', '0', 'in brackets', 'more', 'stuff', '1']
If you are looking for an easy way to do this, then use this.
Note : I replaced str with string as 'str' is a built-in function of python.
import re
string = "blah [in brackets] stuff"
f = re.findall(r'\w+\w', string)
print(f)
Output: ['blah', 'in brackets', 'stuff']
The answers so far don't take into account that you may have more than 2 words inside the brackets, or even one word. The following regex will split on the brackets and any leading or trailing white space of the brackets. It will also work if there are more bracketed content in the string.
s = "blah [in brackets] stuff"
s = re.split(r'\s*\[|\]\s*', s) # note the 'or' operator is used and literal opening and closing brackets '\[' and '\]'
print(s)
output: ['blah', 'in brackets', 'stuff']
And an example using a string with different amounts of words inside brackets and using several sets of brackets:
s = "blah [in brackets] stuff [three words here] more stuff [one-word] stuff [a digit 1!] stuff."
s = re.split(r'\s*\[|\]\s*', s)
print (s)
output: ['blah', 'in brackets', 'stuff', 'three words here', 'more stuff', 'one-word', 'stuff', 'a digit 1!', 'stuff.']

How do you set a variable number of regex expressions?

Currently I have out = re.sub(r'[0-9][0-9][0-9]', '', input). I would like to have a variable number of [0-9]'s.
So far I have;
string = ''
for i in xrange(numlen):
string = string + '[0-9]'
string = 'r' + string
out = re.sub(string, '', input)
This doesn't work, and I've tried using re.compile, but haven't had any luck. Is there a better way of doing this? Or am I just missing something trivial?
You can specify repetition using {}, for example 3 digits would be
[0-9]{3}
So you can do something like
reps = 5 # or whatever value you'd like
out = re.sub('[0-9]{{{}}}'.format(reps), '', input)
Or if you don't know how many digits there will be
out = re.sub('[0-9]+', '', input)
Use quantified + which would match one or more occurence of digits
out = re.sub(r'[0-9]+', '', input)
See how the regex matches http://regex101.com/r/cE6yS6/1
For example
>>> import re
>>> word="hello 123"
>>> out = re.sub(r'[0-9]+', '', word)
>>> word
'hello 123'
>>> out
'hello '

How to remove non-alphanumeric characters at the beginning or end of a string

I have a list with elements that have unnecessary (non-alphanumeric) characters at the beginning or end of each string.
Ex.
'cats--'
I want to get rid of the --
I tried:
for i in thelist:
newlist.append(i.strip('\W'))
That didn't work. Any suggestions.
def strip_nonalnum(word):
if not word:
return word # nothing to strip
for start, c in enumerate(word):
if c.isalnum():
break
for end, c in enumerate(word[::-1]):
if c.isalnum():
break
return word[start:len(word) - end]
print([strip_nonalnum(s) for s in thelist])
Or
import re
def strip_nonalnum_re(word):
return re.sub(r"^\W+|\W+$", "", word)
To remove one or more chars other than letters, digits and _ from both ends you may use
re.sub(r'^\W+|\W+$', '', '??cats--') # => cats
Or, if _ is to be removed, too, wrap \W into a character class and add _ there:
re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_')
See the regex demo and the regex graph:
See the Python demo:
import re
print( re.sub(r'^\W+|\W+$', '', '??cats--') ) # => cats
print( re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_') ) # => cats
You can use a regex expression. The method re.sub() will take three parameters:
The regex expression
The replacement
The string
Code:
import re
s = 'cats--'
output = re.sub("[^\\w]", "", s)
print output
Explanation:
The part "\\w" matches any alphanumeric character.
[^x] will match any character that is not x
I believe that this is the shortest non-regex solution:
text = "`23`12foo--=+"
while len(text) > 0 and not text[0].isalnum():
text = text[1:]
while len(text) > 0 and not text[-1].isalnum():
text = text[:-1]
print text
By using strip you have to know the substring to be stripped.
>>> 'cats--'.strip('-')
'cats'
You could use re to get rid of the non-alphanumeric characters but you would shoot with a cannon on a mouse IMO. With str.isalpha() you can test any strings to contain alphabetic characters, so you only need to keep those:
>>> ''.join(char for char in '#!cats-%' if char.isalpha())
'cats'
>>> thelist = ['cats5--', '#!cats-%', '--the#!cats-%', '--5cats-%', '--5!cats-%']
>>> [''.join(c for c in e if c.isalpha()) for e in thelist]
['cats', 'cats', 'thecats', 'cats', 'cats']
You want to get rid of non-alphanumeric so we can make this better:
>>> [''.join(c for c in e if c.isalnum()) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
This one is exactly the same result you would get with re (as of Christian's answer):
>>> import re
>>> [re.sub("[^\\w]", "", e) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
However, If you want to strip non-alphanumeric characters from the end of the strings only you should use another pattern like this one (check re Documentation):
>>> [''.join(re.search('^\W*(.+)(?!\W*$)(.)', e).groups()) for e in thelist]
['cats5', 'cats', 'the#!cats', '5cats', '5!cats']

list comprehension using regex conditional

i have a list of strings.
If any of these strings has a 4-digit year, i want to truncate the string at the end of the year.
Otherwise I leave the strings alone.
I tried using:
for x in my_strings:
m=re.search("\D\d\d\d\d\D",x)
if m: x=x[:m.end()]
I also tried:
my_strings=[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) for x in my_strings]
Neither of these is working.
Can you tell me what I am doing wrong?
Something like this seems to work on trivial data:
>>> regex = re.compile(r'^(.*(?<=\D)\d{4}(?=\D))(.*)')
>>> strings = ['foo', 'bar', 'baz', 'foo 1999', 'foo 1999 never see this', 'bar 2010 n 2015', 'bar 20156 see this']
>>> [regex.sub(r'\1', s) for s in strings]
['foo', 'bar', 'baz', 'foo 1999', 'foo 1999', 'bar 2010', 'bar 20156 see this']
Looks like your only bound on the result string is at the end(), so you should be using re.match() instead, and modify your regex to:
my_expr = r".*?\D\d{4}\D"
Then, in your code, do:
regex = re.compile(my_expr)
my_new_strings = []
for string in my_strings:
match = regex.match(string)
if match:
my_new_strings.append(match.group())
else:
my_new_strings.append(string)
Or as a list comprehension:
regex = re.compile(my_expr)
matches = ((regex.match(string), string) for string in my_strings)
my_new_strings = [match.group() if match else string for match, string in matches]
Alternatively, you could use re.sub:
regex = re.compile(r'(\D\d{4})\D')
new_strings = [regex.sub(r'\1', string) for string in my_strings]
I am not entirely sure of your usecase, but the following code can give you some hints:
import re
my_strings = ['abcd', 'ab12cd34', 'ab1234', 'ab1234cd', '1234cd', '123cd1234cd']
for index, string in enumerate(my_strings):
match = re.search('\d{4}', string)
if match:
my_strings[index] = string[0:match.end()]
print my_strings
# ['abcd', 'ab12cd34', 'ab1234', 'ab1234', '1234', '123cd1234']
You were actually pretty close with the list comprehension, but your syntax is off - you need to make the first expression a "conditional expression" aka x if <boolean> else y:
[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) else x for x in my_strings]
Obviously this is pretty ugly/hard to read. There are several better ways to split your string around a 4-digit year. Such as:
[re.split(r'(?<=\D\d{4})\D', x)[0] for x in my_strings]

Categories