Reg ex for multiple characters - python

I am trying capture regex for dates like:
14-July-2012-11_31_59
I do:
\d{2}-\w{4}-\d{4}-\d{2}_\d{2}_\d{2}$
But the month part here is 4 letters, it could be long e.g. September.
That is the only variable. The length of digits is ok.
How do regex the word part to say at least 3 letters?

In general, X{n,} means "X at least n times". But \w matches digits and underscores as well, you probably want to use [a-zA-Z]{3,} instead, since month-names shouldn't contain digits or underscores.
\d{2}-[a-zA-Z]{3,}-\d{4}-\d{2}_\d{2}_\d{2}$

Try this:
\d{2}-\w{3,}-\d{4}-\d{2}_\d{2}_\d{2}$

Is this something you're looking for...
>>> a = '14-July-2012-11_31_59'
>>>
>>> pat = r'\b\d{2}\-\w{3,}\-\d{2,4}\-\d{2}\_\d{2}\_\d{2}\b'
>>> regexp = re.compile(pat)
>>> m = regexp.match(a)
>>> m
<_sre.SRE_Match object at 0xa54c870>
>>> m.group()
'14-July-2012-11_31_59'
>>> m = regexp.match('14-September-2012-11_31_59')
>>> m.group()
'14-September-2012-11_31_59'
>>> m = regexp.match('14-September-12-11_31_59')
>>> m.group()
'14-September-12-11_31_59'
>>> m = regexp.match('14-Sep-12-11_31_59')
>>> m.group()
'14-Sep-12-11_31_59'
>>> m = regexp.match('14-Se-12-11_31_59')
>>> m.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>

Related

multi line v single line for-loop different results

This is an exercise on Kaggle/Python/Strings and Dictionaries. I wasn't able to solve it so I peeked at the solution and tried to write it in a way I would do it (i.e. not necessarily as sophisticated but in a way I understood). I use Python tutor to visualise what's going on behind the code and understand most things but the for-loop is getting me.
normalised = (token.strip(",.").lower() for token in tokens) This works and gives me index [0]
but if I rewrite as:
for token in tokens:
normalised = token.strip(",.").lower()
it doesn't work; it gives me index [0][2] (presumably because casino is in casinoville). Can someone write the multi-line equivalent: for token in tokens:...?
code is below for a bit more context.
def word_search(doc_list, keyword):
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents
containing the keyword.
Example:
doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
>>> word_search(doc_list, 'casino')
>>> [0]
"""
indices = []
counter = 0
for doc in doc_list:
tokens = doc.split()
**normalised = (token.strip(",.").lower() for token in tokens)**
if keyword.lower() in normalised:
indices.append(counter)
counter += 1
return indices
#Test - output should be [0]
doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
keyword = 'Casino'
print(word_search(doc_list,keyword))
normalised = (token.strip(",.").lower() for token in tokens) returns a tuple generator. Let's explore this:
>>> a = [1,2,3]
>>> [x**2 for x in a]
[1, 4, 9]
This is a list comprehension. The multi-line equivalent is:
>>> a = [1,2,3]
>>> b = []
>>> for x in a:
... b.append(x**2)
...
>>> print(b)
[1, 4, 9]
Using parentheses instead of square brackets does not return a tuple (as one might suspect naively, as I did earlier), but a generator:
>>> a = [1,2,3]
>>> (x**2 for x in a)
<generator object <genexpr> at 0x0000024BD6E33B48>
We can iterate over this object with next:
>>> a = [1,2,3]
>>> b = (x**2 for x in a)
>>> next(b)
1
>>> next(b)
4
>>> next(b)
9
>>> next(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
This can be written as a multi-line expression like this:
>>> a = [1,2,3]
>>> def my_iterator(x):
... for k in x:
... yield k**2
...
>>> b = my_iterator(a)
>>> next(b)
1
>>> next(b)
4
>>> next(b)
9
>>> next(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
In the original example, an in comparison is used. This works for both the list and the generator, but for the generator it only works once:
>>> a = [1,2,3]
>>> b = [x**2 for x in a]
>>> 9 in b
True
>>> 5 in b
False
>>> b = (x**2 for x in a)
>>> 9 in b
True
>>> 9 in b
False
Here is a discussion of the issue with generator reset: Resetting generator object in Python
I hope that clarified the differences between list comprehensions, generators and multi-line loops.

Python re.search string search for open and close bracket []

Can someone explain me why my regex is not getting satisfied for below regex expression. Could someone let me know how to overcome and check for [] match.
>>> str = li= "a.b.\[c\]"
>>> if re.search(li,str,re.IGNORECASE):
... print("Matched")
...
>>>
>>> str = li= r"a.b.[c]"
>>> if re.search(li,str,re.IGNORECASE):
... print("Matched")
...
>>>
If I remove open and close brackets I get match
>>> str = li= 'a.b.c'
>>> if re.search(li,str,re.IGNORECASE):
... print("matched")
...
matched
You are attempting to match the string a.b.\\[c\\] instead of a.b.[c].
Try this:
import re
li= r"a\.b\.\[c\]"
s = "a.b.[c]"
if re.search(li, s, re.IGNORECASE):
print("Matched")
re.IGNORECASE is not needed in here by the way.
You can try the following code:
import re
str = "a.b.[c]"
if re.search(r".*\[.*\].*", str):
print("Matched")
Output:
Matched

Python Regular Expression with special characters

Having trouble writing a robust regular expression to grab information out of a string.
$ string1 = 'A_XYZ_THESE_WORDS'
$ string2 = 'A_ABC_THOSE_WORDS'
I would like a robust solution that pulls out from string1 or string2 respectfully 'THESE_WORDS' or 'THOSE_WORDS'.
Basically, I need something that removes everything before the first two underscores (_), but the text before them will vary.
$ get_text = re.search('(?<=A_)\w+(_)',string1)
$ print get_text.group()
$ 'XYZ_THESE_'
Based on your problem statement:
I need something that removes everything before the first two underscores
you don't necessarily need a regular expression:
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string1.split("_", 2)[2]
'THESE_WORDS'
The second argument to str.split is the maximum number of times to split. This will split on the first two '_'s, then take the third item (the rest of the string) from the resulting list.
This will throw an IndexError if there are fewer than two underscores in the string - this lets you know that the string is not in a format you expect, but if this behaviour is not desirable, consider:
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string1.split("_", 2)[-1]
'THESE_WORDS'
Which takes the last item in the list from str.split, rather than assuming that there will be three. Comparison:
>>> "JUST_ONE".split("_", 2)[2]
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
"JUST_ONE".split("_", 2)[2]
IndexError: list index out of range
>>> "JUST_ONE".split("_", 2)[-1]
'ONE'
The below regex will print the texts which was just after to the second underscore(_),
>>> import re
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string2 = 'A_ABC_THOSE_WORDS'
>>> m = re.search(r'^[^_]*_[^_]*_(.*)$', string1)
>>> m.group(1)
'THESE_WORDS'
>>> m = re.search(r'^[^_]*_[^_]*_(.*)$', string2)
>>> m.group(1)
'THOSE_WORDS'
In [21]: regex = re.compile(r'^([a-zA-Z]+_){2}(.*)$')
In [22]: m = regex.search(string1)
In [23]: m.groups()
Out[23]: ('XYZ_', 'THESE_WORDS')
In [24]: m = regex.search(string2)
In [25]: m.groups()
Out[25]: ('ABC_', 'THOSE_WORDS')

Python: select digits from NoneType object

I have a 'NoneType' object like:
A='ABC:123'
I would like to get an object keeping only the digits:
A2=digitsof(A)='123'
Split at the colon:
>>> A='ABC:123'
>>> numA = int(A.split(':')[1])
123
How about:
>>> import re
>>> def digitsof(a):
... return [int(x) for x in re.findall('\d+', a) ]
...
>>> digitsof('ABC:123')
[123]
>>> digitsof('ABC:123,123')
[123, 123]
>>>
Regular Expressions?
>>> from re import sub
>>> A = 'ABC:123'
>>> sub(r'\D', '', A)
123
A simple filter function
A='ABC:123'
filter(lambda s: s.isdigit(), A)

Extract square-bracketed text from a string

Could someone please help me strip characters from a string to leave me with just the characters held within '[....]'?
For example:
a = newyork_74[mylocation]
b = # strip the frist characters until you reach the first bracket [
c = [mylocation]
Something like this:
>>> import re
>>> strs = "newyork_74[mylocation]"
>>> re.sub(r'(.*)?(\[)','\g<2>',strs)
'[mylocation]'
Assuming no nested structures, one way would be using itertools.dropwhile,
>>> from itertools import dropwhile
>>> b = ''.join(dropwhile(lambda c: c != '[', a))
>>> b
'[mylocation]'
Another would be to use regexs,
>>> import re
>>> pat = re.compile(r'\[.*\]')
>>> b = pat.search(a).group(0)
>>> b
'[mylocation]'

Categories