list comprehension using regex conditional - python

i have a list of strings.
If any of these strings has a 4-digit year, i want to truncate the string at the end of the year.
Otherwise I leave the strings alone.
I tried using:
for x in my_strings:
m=re.search("\D\d\d\d\d\D",x)
if m: x=x[:m.end()]
I also tried:
my_strings=[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) for x in my_strings]
Neither of these is working.
Can you tell me what I am doing wrong?

Something like this seems to work on trivial data:
>>> regex = re.compile(r'^(.*(?<=\D)\d{4}(?=\D))(.*)')
>>> strings = ['foo', 'bar', 'baz', 'foo 1999', 'foo 1999 never see this', 'bar 2010 n 2015', 'bar 20156 see this']
>>> [regex.sub(r'\1', s) for s in strings]
['foo', 'bar', 'baz', 'foo 1999', 'foo 1999', 'bar 2010', 'bar 20156 see this']

Looks like your only bound on the result string is at the end(), so you should be using re.match() instead, and modify your regex to:
my_expr = r".*?\D\d{4}\D"
Then, in your code, do:
regex = re.compile(my_expr)
my_new_strings = []
for string in my_strings:
match = regex.match(string)
if match:
my_new_strings.append(match.group())
else:
my_new_strings.append(string)
Or as a list comprehension:
regex = re.compile(my_expr)
matches = ((regex.match(string), string) for string in my_strings)
my_new_strings = [match.group() if match else string for match, string in matches]
Alternatively, you could use re.sub:
regex = re.compile(r'(\D\d{4})\D')
new_strings = [regex.sub(r'\1', string) for string in my_strings]

I am not entirely sure of your usecase, but the following code can give you some hints:
import re
my_strings = ['abcd', 'ab12cd34', 'ab1234', 'ab1234cd', '1234cd', '123cd1234cd']
for index, string in enumerate(my_strings):
match = re.search('\d{4}', string)
if match:
my_strings[index] = string[0:match.end()]
print my_strings
# ['abcd', 'ab12cd34', 'ab1234', 'ab1234', '1234', '123cd1234']

You were actually pretty close with the list comprehension, but your syntax is off - you need to make the first expression a "conditional expression" aka x if <boolean> else y:
[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) else x for x in my_strings]
Obviously this is pretty ugly/hard to read. There are several better ways to split your string around a 4-digit year. Such as:
[re.split(r'(?<=\D\d{4})\D', x)[0] for x in my_strings]

Related

String split using regex with pattern present in text

I have many string that I need to split by commas. Example:
myString = r'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
myString = r'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
My desired output would be:
["test", "Test", "NEAR(this,that,DISTANCE=4)", "test again", """another test"""] #list length = 5
I can't figure out how to keep the commas between "this,that,DISTANCE" in one item. I tried this:
l = re.compile(r',').split(myString) # matches all commas
l = re.compile(r'(?<!\(),(?=\))').split(myString) # (negative lookback/lookforward) - no matches at all
Any ideas? Let's say the list of allowed "functions" is defined as:
f = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
You may use
(?:\([^()]*\)|[^,])+
See the regex demo.
The (?:\([^()]*\)|[^,])+ pattern matches one or more occurrences of any substring between parentheses with no ( and ) in them or any char other than ,.
See the Python demo:
import re
rx = r"(?:\([^()]*\)|[^,])+"
s = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
print(re.findall(rx, s))
# => ['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
If explicitly want to specify which strings count as functions, you need to build the regex dynamically. Otherwise, go with Wiktor's solution.
>>> functions = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
>>> funcs = '|'.join('{}\([^\)]+\)'.format(f) for f in functions)
>>> regex = '({})|,'.format(funcs)
>>>
>>> myString1 = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString1)))
['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
>>> myString2 = 'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString2)))
['test',
'Test',
'FOLLOWEDBY(this,that,DISTANCE=4)',
'test again',
'"another test"']

Removing punctuations and spaces in a string without using regex

I used import string and string.punctuation but I realized I still have '…' after conducting string.split(). I also get '', which I don't know why I would get it after doing strip(). As far as I understand, strip() removes the peripheral spaces, so if I have spaces between a string it would not matter:
>>> s = 'a dog barks meow! # … '
>>> s.strip()
'a dog barks meow! # …'
>>> import string
>>> k = []
>>> for item in s.split():
... k.append(item.strip(string.punctuation))
...
>>> k
['a', 'dog', 'barks', 'meow', '', '…']
I would like to get rid of '', '…', the final output I'd like is ['a', 'dog', 'barks', 'meow'].
I would like to refrain from using regex, but if that's the only solution I will consider it .. for now I'm more interested in solving this without resorting to regex.
You can remove punctuation by retaining only alphanumeric characters and spaces:
s = 'a dog barks meow! # …'
print(''.join(c for c in s if c.isalnum() or c.isspace()).split())
This outputs:
['a', 'dog', 'barks', 'meow']
I used the following:
s = 'a dog barks Meow! # … '
import string
p = string.punctuation+'…'
k = []
for item in s.split():
k.append(item.strip(p).lower())
k = [x for x in k if x]
building on the accepted answer to this question:
import itertools
k = []
for ok, grp in itertools.groupby(s, lambda c: c.isalnum()):
if ok:
k.append(''.join(list(grp)))
or the same as a one-liner (except for the import):
k = [''.join(list(grp)) for ok, grp in itertools.groupby(s, lambda c: c.isalnum()) if ok]
itertools.groupby() scans the string s as a list of characters, grouping them (grp) by the value (ok) of the lambda expression. The if ok filters out the groups not matching the lambda. The groups are iterators that have to be converted to a list of characters and then joined to get back the words.
The meaning of isalnum() is essentially “is alphanumeric”. Depending on your use case, you might prefer isalpha(). In both cases, for this input:
s = 'a 狗 barks meow! # …'
the output is
['a', '狗', 'barks', 'meow']
(For experts: this reminds us of the problem that not in all languages words are separated by non-word characters - e.g.)

Python re: if string has one word AND any one of a list of words?

I want to find if a string matches on this rule using a regular expression:
list_of_words = ['a', 'boo', 'blah']
if 'foo' in temp_string and any(word in temp_string for word in list_of_words)
The reason I want it in a regular expression is that I have hundreds of rules like it and different from it so I want to save them all as patterns in a dict.
The only one I could think of is this but it doesn't seem pretty:
re.search(r'foo.*(a|boo|blah)|(a|boo|blah).*foo')
You can join the array elements using | to construct a lookahead assertion regex:
>>> list_of_words = ['a', 'boo', 'blah']
>>> reg = re.compile( r'^(?=.*\b(?:' + "|".join(list_of_words) + r')\b).*foo' )
>>> print reg.pattern
^(?=.*\b(?:a|boo|blah)\b).*foo
>>> reg.findall(r'abcd foo blah')
['abcd foo']
As you can see we have constructed a regex ^(?=.*\b(?:a|boo|blah)\b).*foo which asserts presence of one word from list_of_words and matches foo anywhere.

Python regex match between characters

I'm doing a pretty straightforward regex in python and seeing some odd behavior when I use the "or" operator.
I am trying to parse the following:
>> str = "blah [in brackets] stuff"
so that it returns:
>> ['blah', 'in brackets', 'stuff']
To match the text between brackets, I am using look behind and look ahead, i.e.:
>> '(?<=\[).*?(?=\])'
If used alone this does indeed capture the text in brackets:
>> re.findall( '(?<=\[).*?(?=\])' , str )
>> ['in brackets']
But when I combine the or operator to parse the strings between spaces, the bracket-match somehow breaks down:
>> [x for x in re.findall( '(?<=\[).*?(?=\])|.*?[, ]' , str ) if x!=' ' ]
>> ['blah', '[in ', 'brackets] ']
For the life of me I can't understand this behavior. Any help would be appreciated.
Thanks!
You can do:
>>> s = "blah [in brackets] stuff"
>>> re.findall(r'\b\w+\s*\w+\b', s)
['blah', 'in brackets', 'stuff']
For those interested, this is the successful regex that I ended up going with. There is probably a more elegant solution somewhere but this works:
>>> s = "blah 2.0 stuff 1 1 0 [in brackets] more stuff [1]"
>>> brackets_re = '(?<=\[).*?(?=\])'
>>> space_re = '[-\.\w]+(?= )'
>>> my_re = brackets_re + '|' + space_re
>>> re.findall(my_re, s)
['blah', '2.0', 'stuff', '1', '1', '0', 'in brackets', 'more', 'stuff', '1']
If you are looking for an easy way to do this, then use this.
Note : I replaced str with string as 'str' is a built-in function of python.
import re
string = "blah [in brackets] stuff"
f = re.findall(r'\w+\w', string)
print(f)
Output: ['blah', 'in brackets', 'stuff']
The answers so far don't take into account that you may have more than 2 words inside the brackets, or even one word. The following regex will split on the brackets and any leading or trailing white space of the brackets. It will also work if there are more bracketed content in the string.
s = "blah [in brackets] stuff"
s = re.split(r'\s*\[|\]\s*', s) # note the 'or' operator is used and literal opening and closing brackets '\[' and '\]'
print(s)
output: ['blah', 'in brackets', 'stuff']
And an example using a string with different amounts of words inside brackets and using several sets of brackets:
s = "blah [in brackets] stuff [three words here] more stuff [one-word] stuff [a digit 1!] stuff."
s = re.split(r'\s*\[|\]\s*', s)
print (s)
output: ['blah', 'in brackets', 'stuff', 'three words here', 'more stuff', 'one-word', 'stuff', 'a digit 1!', 'stuff.']

How can i parse a comma delimited string into a list (caveat)?

I need to be able to take a string like:
'''foo, bar, "one, two", three four'''
into:
['foo', 'bar', 'one, two', 'three four']
I have an feeling (with hints from #python) that the solution is going to involve the shlex module.
It depends how complicated you want to get... do you want to allow more than one type of quoting. How about escaped quotes?
Your syntax looks very much like the common CSV file format, which is supported by the Python standard library:
import csv
reader = csv.reader(['''foo, bar, "one, two", three four'''], skipinitialspace=True)
for r in reader:
print r
Outputs:
['foo', 'bar', 'one, two', 'three four']
HTH!
The shlex module solution allows escaped quotes, one quote escape another, and all fancy stuff shell supports.
>>> import shlex
>>> my_splitter = shlex.shlex('''foo, bar, "one, two", three four''', posix=True)
>>> my_splitter.whitespace += ','
>>> my_splitter.whitespace_split = True
>>> print list(my_splitter)
['foo', 'bar', 'one, two', 'three', 'four']
escaped quotes example:
>>> my_splitter = shlex.shlex('''"test, a",'foo,bar",baz',bar \xc3\xa4 baz''',
posix=True)
>>> my_splitter.whitespace = ',' ; my_splitter.whitespace_split = True
>>> print list(my_splitter)
['test, a', 'foo,bar",baz', 'bar \xc3\xa4 baz']
You may also want to consider the csv module. I haven't tried it, but it looks like your input data is closer to CSV than to shell syntax (which is what shlex parses).
You could do something like this:
>>> import re
>>> pattern = re.compile(r'\s*("[^"]*"|.*?)\s*,')
>>> def split(line):
... return [x[1:-1] if x[:1] == x[-1:] == '"' else x
... for x in pattern.findall(line.rstrip(',') + ',')]
...
>>> split("foo, bar, baz")
['foo', 'bar', 'baz']
>>> split('foo, bar, baz, "blub blah"')
['foo', 'bar', 'baz', 'blub blah']
I'd say a regular expression would be what you're looking for here, though I'm not terribly familiar with Python's Regex engine.
Assuming you use lazy matches, you can get a set of matches on a string which you can put into your array.
If it doesn't need to be pretty, this might get you on your way:
def f(s, splitifeven):
if splitifeven & 1:
return [s]
return [x.strip() for x in s.split(",") if x.strip() != '']
ss = 'foo, bar, "one, two", three four'
print sum([f(s, sie) for sie, s in enumerate(ss.split('"'))], [])

Categories