Python regex match between characters - python

I'm doing a pretty straightforward regex in python and seeing some odd behavior when I use the "or" operator.
I am trying to parse the following:
>> str = "blah [in brackets] stuff"
so that it returns:
>> ['blah', 'in brackets', 'stuff']
To match the text between brackets, I am using look behind and look ahead, i.e.:
>> '(?<=\[).*?(?=\])'
If used alone this does indeed capture the text in brackets:
>> re.findall( '(?<=\[).*?(?=\])' , str )
>> ['in brackets']
But when I combine the or operator to parse the strings between spaces, the bracket-match somehow breaks down:
>> [x for x in re.findall( '(?<=\[).*?(?=\])|.*?[, ]' , str ) if x!=' ' ]
>> ['blah', '[in ', 'brackets] ']
For the life of me I can't understand this behavior. Any help would be appreciated.
Thanks!

You can do:
>>> s = "blah [in brackets] stuff"
>>> re.findall(r'\b\w+\s*\w+\b', s)
['blah', 'in brackets', 'stuff']

For those interested, this is the successful regex that I ended up going with. There is probably a more elegant solution somewhere but this works:
>>> s = "blah 2.0 stuff 1 1 0 [in brackets] more stuff [1]"
>>> brackets_re = '(?<=\[).*?(?=\])'
>>> space_re = '[-\.\w]+(?= )'
>>> my_re = brackets_re + '|' + space_re
>>> re.findall(my_re, s)
['blah', '2.0', 'stuff', '1', '1', '0', 'in brackets', 'more', 'stuff', '1']

If you are looking for an easy way to do this, then use this.
Note : I replaced str with string as 'str' is a built-in function of python.
import re
string = "blah [in brackets] stuff"
f = re.findall(r'\w+\w', string)
print(f)
Output: ['blah', 'in brackets', 'stuff']

The answers so far don't take into account that you may have more than 2 words inside the brackets, or even one word. The following regex will split on the brackets and any leading or trailing white space of the brackets. It will also work if there are more bracketed content in the string.
s = "blah [in brackets] stuff"
s = re.split(r'\s*\[|\]\s*', s) # note the 'or' operator is used and literal opening and closing brackets '\[' and '\]'
print(s)
output: ['blah', 'in brackets', 'stuff']
And an example using a string with different amounts of words inside brackets and using several sets of brackets:
s = "blah [in brackets] stuff [three words here] more stuff [one-word] stuff [a digit 1!] stuff."
s = re.split(r'\s*\[|\]\s*', s)
print (s)
output: ['blah', 'in brackets', 'stuff', 'three words here', 'more stuff', 'one-word', 'stuff', 'a digit 1!', 'stuff.']

Related

String split using regex with pattern present in text

I have many string that I need to split by commas. Example:
myString = r'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
myString = r'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
My desired output would be:
["test", "Test", "NEAR(this,that,DISTANCE=4)", "test again", """another test"""] #list length = 5
I can't figure out how to keep the commas between "this,that,DISTANCE" in one item. I tried this:
l = re.compile(r',').split(myString) # matches all commas
l = re.compile(r'(?<!\(),(?=\))').split(myString) # (negative lookback/lookforward) - no matches at all
Any ideas? Let's say the list of allowed "functions" is defined as:
f = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
You may use
(?:\([^()]*\)|[^,])+
See the regex demo.
The (?:\([^()]*\)|[^,])+ pattern matches one or more occurrences of any substring between parentheses with no ( and ) in them or any char other than ,.
See the Python demo:
import re
rx = r"(?:\([^()]*\)|[^,])+"
s = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
print(re.findall(rx, s))
# => ['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
If explicitly want to specify which strings count as functions, you need to build the regex dynamically. Otherwise, go with Wiktor's solution.
>>> functions = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
>>> funcs = '|'.join('{}\([^\)]+\)'.format(f) for f in functions)
>>> regex = '({})|,'.format(funcs)
>>>
>>> myString1 = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString1)))
['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
>>> myString2 = 'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString2)))
['test',
'Test',
'FOLLOWEDBY(this,that,DISTANCE=4)',
'test again',
'"another test"']

Removing punctuations and spaces in a string without using regex

I used import string and string.punctuation but I realized I still have '…' after conducting string.split(). I also get '', which I don't know why I would get it after doing strip(). As far as I understand, strip() removes the peripheral spaces, so if I have spaces between a string it would not matter:
>>> s = 'a dog barks meow! # … '
>>> s.strip()
'a dog barks meow! # …'
>>> import string
>>> k = []
>>> for item in s.split():
... k.append(item.strip(string.punctuation))
...
>>> k
['a', 'dog', 'barks', 'meow', '', '…']
I would like to get rid of '', '…', the final output I'd like is ['a', 'dog', 'barks', 'meow'].
I would like to refrain from using regex, but if that's the only solution I will consider it .. for now I'm more interested in solving this without resorting to regex.
You can remove punctuation by retaining only alphanumeric characters and spaces:
s = 'a dog barks meow! # …'
print(''.join(c for c in s if c.isalnum() or c.isspace()).split())
This outputs:
['a', 'dog', 'barks', 'meow']
I used the following:
s = 'a dog barks Meow! # … '
import string
p = string.punctuation+'…'
k = []
for item in s.split():
k.append(item.strip(p).lower())
k = [x for x in k if x]
building on the accepted answer to this question:
import itertools
k = []
for ok, grp in itertools.groupby(s, lambda c: c.isalnum()):
if ok:
k.append(''.join(list(grp)))
or the same as a one-liner (except for the import):
k = [''.join(list(grp)) for ok, grp in itertools.groupby(s, lambda c: c.isalnum()) if ok]
itertools.groupby() scans the string s as a list of characters, grouping them (grp) by the value (ok) of the lambda expression. The if ok filters out the groups not matching the lambda. The groups are iterators that have to be converted to a list of characters and then joined to get back the words.
The meaning of isalnum() is essentially “is alphanumeric”. Depending on your use case, you might prefer isalpha(). In both cases, for this input:
s = 'a 狗 barks meow! # …'
the output is
['a', '狗', 'barks', 'meow']
(For experts: this reminds us of the problem that not in all languages words are separated by non-word characters - e.g.)

list comprehension using regex conditional

i have a list of strings.
If any of these strings has a 4-digit year, i want to truncate the string at the end of the year.
Otherwise I leave the strings alone.
I tried using:
for x in my_strings:
m=re.search("\D\d\d\d\d\D",x)
if m: x=x[:m.end()]
I also tried:
my_strings=[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) for x in my_strings]
Neither of these is working.
Can you tell me what I am doing wrong?
Something like this seems to work on trivial data:
>>> regex = re.compile(r'^(.*(?<=\D)\d{4}(?=\D))(.*)')
>>> strings = ['foo', 'bar', 'baz', 'foo 1999', 'foo 1999 never see this', 'bar 2010 n 2015', 'bar 20156 see this']
>>> [regex.sub(r'\1', s) for s in strings]
['foo', 'bar', 'baz', 'foo 1999', 'foo 1999', 'bar 2010', 'bar 20156 see this']
Looks like your only bound on the result string is at the end(), so you should be using re.match() instead, and modify your regex to:
my_expr = r".*?\D\d{4}\D"
Then, in your code, do:
regex = re.compile(my_expr)
my_new_strings = []
for string in my_strings:
match = regex.match(string)
if match:
my_new_strings.append(match.group())
else:
my_new_strings.append(string)
Or as a list comprehension:
regex = re.compile(my_expr)
matches = ((regex.match(string), string) for string in my_strings)
my_new_strings = [match.group() if match else string for match, string in matches]
Alternatively, you could use re.sub:
regex = re.compile(r'(\D\d{4})\D')
new_strings = [regex.sub(r'\1', string) for string in my_strings]
I am not entirely sure of your usecase, but the following code can give you some hints:
import re
my_strings = ['abcd', 'ab12cd34', 'ab1234', 'ab1234cd', '1234cd', '123cd1234cd']
for index, string in enumerate(my_strings):
match = re.search('\d{4}', string)
if match:
my_strings[index] = string[0:match.end()]
print my_strings
# ['abcd', 'ab12cd34', 'ab1234', 'ab1234', '1234', '123cd1234']
You were actually pretty close with the list comprehension, but your syntax is off - you need to make the first expression a "conditional expression" aka x if <boolean> else y:
[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) else x for x in my_strings]
Obviously this is pretty ugly/hard to read. There are several better ways to split your string around a 4-digit year. Such as:
[re.split(r'(?<=\D\d{4})\D', x)[0] for x in my_strings]

Search list: match only exact word/string

How to match exact string/word while searching a list. I have tried, but its not correct. below I have given the sample list, my code and the test results
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
my code:
for str in list:
if s in str:
print str
test results:
s = "hello" ~ expected output: 'Hi, hello' ~ output I get: 'Hi, hello'
s = "123" ~ expected output: *nothing* ~ output I get: 'hi mr 12345'
s = "12345" ~ expected output: 'hi mr 12345' ~ output I get: 'hi mr 12345'
s = "come" ~ expected output: *nothing* ~ output I get: 'welcome sir'
s = "welcome" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
s = "welcome sir" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
My list contains more than 200K strings
It looks like you need to perform this search not only once so I would recommend to convert your list into dictionary:
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> d = dict()
>>> for item in l:
... for word in item.split():
... d.setdefault(word, list()).append(item)
...
So now you can easily do:
>>> d.get('hi')
['hi mr 12345']
>>> d.get('come') # nothing
>>> d.get('welcome')
['welcome sir']
p.s. probably you have to improve item.split() to handle commas, point and other separators. maybe use regex and \w.
p.p.s. as cularion mentioned this won't match "welcome sir". if you want to match whole string, it is just one additional line to proposed solution. but if you have to match part of string bounded by spaces and punctuation regex should be your choice.
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> search = lambda word: filter(lambda x: word in x.split(),l)
>>> search('123')
[]
>>> search('12345')
['hi mr 12345']
>>> search('hello')
['Hi, hello']
if you search for exact match:
for str in list:
if set (s.split()) & set(str.split()):
print str
Provided s only ever consists of just a few words, you could do
s = s.split()
n = len(s)
for x in my_list:
words = x.split()
if s in (words[i:i+n] for i in range(len(words) - n + 1)):
print x
If s consists of many words, there are more efficient, but also much more complex algorithm for this.
use regular expression here to match exact word with word boundary \b
import re
.....
for str in list:
if re.search(r'\b'+wordToLook+'\b', str):
print str
\b only matches a word which is terminated and starts with word terminator e.g. space or line break
or do something like this to avoid typing the word for searching again and again.
import re
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
listOfWords = ['hello', 'Mr', '123']
reg = re.compile(r'(?i)\b(?:%s)\b' % '|'.join(listOfWords))
for str in list:
if reg.search(str):
print str
(?i) is to search for without worrying about the case of words, if you want to search with case sensitivity then remove it.

How can i parse a comma delimited string into a list (caveat)?

I need to be able to take a string like:
'''foo, bar, "one, two", three four'''
into:
['foo', 'bar', 'one, two', 'three four']
I have an feeling (with hints from #python) that the solution is going to involve the shlex module.
It depends how complicated you want to get... do you want to allow more than one type of quoting. How about escaped quotes?
Your syntax looks very much like the common CSV file format, which is supported by the Python standard library:
import csv
reader = csv.reader(['''foo, bar, "one, two", three four'''], skipinitialspace=True)
for r in reader:
print r
Outputs:
['foo', 'bar', 'one, two', 'three four']
HTH!
The shlex module solution allows escaped quotes, one quote escape another, and all fancy stuff shell supports.
>>> import shlex
>>> my_splitter = shlex.shlex('''foo, bar, "one, two", three four''', posix=True)
>>> my_splitter.whitespace += ','
>>> my_splitter.whitespace_split = True
>>> print list(my_splitter)
['foo', 'bar', 'one, two', 'three', 'four']
escaped quotes example:
>>> my_splitter = shlex.shlex('''"test, a",'foo,bar",baz',bar \xc3\xa4 baz''',
posix=True)
>>> my_splitter.whitespace = ',' ; my_splitter.whitespace_split = True
>>> print list(my_splitter)
['test, a', 'foo,bar",baz', 'bar \xc3\xa4 baz']
You may also want to consider the csv module. I haven't tried it, but it looks like your input data is closer to CSV than to shell syntax (which is what shlex parses).
You could do something like this:
>>> import re
>>> pattern = re.compile(r'\s*("[^"]*"|.*?)\s*,')
>>> def split(line):
... return [x[1:-1] if x[:1] == x[-1:] == '"' else x
... for x in pattern.findall(line.rstrip(',') + ',')]
...
>>> split("foo, bar, baz")
['foo', 'bar', 'baz']
>>> split('foo, bar, baz, "blub blah"')
['foo', 'bar', 'baz', 'blub blah']
I'd say a regular expression would be what you're looking for here, though I'm not terribly familiar with Python's Regex engine.
Assuming you use lazy matches, you can get a set of matches on a string which you can put into your array.
If it doesn't need to be pretty, this might get you on your way:
def f(s, splitifeven):
if splitifeven & 1:
return [s]
return [x.strip() for x in s.split(",") if x.strip() != '']
ss = 'foo, bar, "one, two", three four'
print sum([f(s, sie) for sie, s in enumerate(ss.split('"'))], [])

Categories