regex fails to find in Python - python

With a given string:
Surname,MM,Forename,JTA19 R <first.second#domain.com>
I can match all the groups with this:
([A-Za-z]+),([A-Z]+),([A-Za-z]+),([A-Z0-9]+)\s([A-Z])\s<([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})
However, when I apply it to Python it always fails to find it
regex=re.compile(r"(?P<lastname>[A-Za-z]+),"
r"(?P<initials>[A-Z]+)"
r",(?P<firstname>[A-Za-z]+),"
r"(?P<ouc1>[A-Z0-9]+)\s"
r"(?P<ouc2>[A-Z])\s<"
r"(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})"
)
I think I've narrowed it down to this part of email:
[A-Z0-9._%+-]
What is wrong?

Replace
r"(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})"
with
r"(?P<email>[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4})"
to allow for lowercase letters too.

You are passing multiple strings to the compile method, you need to pass in one, whole, regular expression.
exp = '''
(?P<lastname>[A-Za-z]+),
(?P<initials>[A-Z]+),
(?P<firstname>[A-Za-Z]+),
(?P<ouc1>[A-Z0-9]+)\s
(?P<ouc2>[A-Z])\s<
(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})'''
regex = re.compile(exp, re.VERBOSE)
Although I have to say, your string is just comma separated, so this might be a bit easier:
>>> s = "Surname,MM,Forename,JTA19 R <first.second#domain.com>"
>>> lastname,initials,firstname,rest = s.split(',')
>>> ouc1,ouc2,email = rest.split(' ')
>>> lastname,initials,firstname,ouc1,ouc2,email[1:-1]
('Surname', 'MM', 'Forename', 'JTA19', 'R', 'first.second#domain.com')

Related

Split Python String by letters and keep deliminators

Using regex, how can i split a string and keep it's deliminators in the returned results? I'm trying to split a string containing numbers and strings by a set of letters followed by any numerical value including '.' however it's not appearing to work correctly.
Below is my test string, im using python 2.7 and it's not producing what id expect.
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, re.IGNORECASE))
print len(parts), parts
>>> 3 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z']
I would expect it to give me this
>>> 10 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
It should output a list of strings where each string starts with a letter, found in the original regex MLHVCSQTAZ
In your code you are passing re.IGNORECASE as 3rd argument to re.split but 3rd argument of re.split is maxsplit not flags.
re.IGNORECASE equals to 2 hence your input is split only two times.
You may use:
>>> list(filter(None, re.split(r'([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, 0, re.I)))
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
Or use inline mode for ignore case:
re.split(r'(?i)([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s)
I suggest using this simple re.findall code that uses almost identical regex:
parts = re.findall('(?i)[MLHVCSQTAZ][^MLHVCSQTAZ]*', s)
Reference: SRE_FLAG_IGNORECASE = 2 in lib/python2.7/sre_constants.py (thanks to comment from #vks)
You can use re.findall:
import re
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
result = re.findall('[A-Z][\.\d,]+|[A-Z]', s)
Output:
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, flags=re.IGNORECASE))
You need to use flags.Check re.split function definition.
Default re does not support 0 width assertion split.So you can also use regex module for that.
import regex
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
print regex.split('(?=[MLHVCSQTAZ][^MLHVCSQTAZ])', s, flags=regex.IGNORECASE|regex.VERSION1)

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

Extract substrings from logical expressions

Let's say I have a string that looks like this:
myStr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
What I would like to obtain in the end would be:
myStr_l1 = '(Txt_l1) or (Txt2_l1)'
and
myStr_l2 = '(Txt_l2) or (Txt2_l2)'
Some properties:
all "Txt_"-elements of the string start with an uppercase letter
the string can contain much more elements (so there could also be Txt3, Txt4,...)
the suffixes '_l1' and '_l2' look different in reality; they cannot be used for matching (I chose them for demonstration purposes)
I found a way to get the first part done by using:
myStr_l1 = re.sub('\(\w+\)','',myStr)
which gives me
'(Txt_l1 ) or (Txt2_l1 )'
However, I don't know how to obtain myStr_l2. My idea was to remove everything between two open parentheses. But when I do something like this:
re.sub('\(w+\(', '', myStr)
the entire string is returned.
re.sub('\(.*\(', '', myStr)
removes - of course - far too much and gives me
'Txt2_l2))'
Does anyone have an idea how to get myStr_l2?
When there is an "and" instead of an "or", the strings look slightly different:
myStr2 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2))'
Then I can still use the command from above:
re.sub('\(\w+\)','',myStr2)
which gives:
'(Txt_l1 and Txt2_l1 )'
but I again fail to get myStr2_l2. How would I do this for these kind of strings?
And how would one then do this for mixed expressions with "and" and "or" e.g. like this:
myStr3 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2)) or (Txt3_l1 (Txt3_l2) and Txt4_l1 (Txt2_l2))'
re.sub('\(\w+\)','',myStr3)
gives me
'(Txt_l1 and Txt2_l1 ) or (Txt3_l1 and Txt4_l1 )'
but again: How would I obtain myStr3_l2?
Regexp is not powerful enough for nested expressions (in your case: nested elements in parentheses). You will have to write a parser. Look at https://pyparsing.wikispaces.com/
I'm not entirely sure what you want but I wrote this to strip everything between the parenthesis.
import re
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
sets = mystr.split(' or ')
noParens = []
for line in sets:
mat = re.match(r'\((.* )\((.*\)\))', line, re.M)
if mat:
noParens.append(mat.group(1))
noParens.append(mat.group(2).replace(')',''))
print(noParens)
This takes all the parenthesis away and puts your elements in a list. Here's an alternate way of doing it without using Regular Expressions.
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
noParens = []
mystr = mystr.replace(' or ', ' ')
mystr = mystr.replace(')','')
mystr = mystr.replace('(','')
noParens = mystr.split()
print(noParens)

Python Regex Get String Between Two Substrings

First off, I know this may seem like a duplicate question, however, I could find no working solution to my problem.
I have string that looks like the following:
string = "api('randomkey123xyz987', 'key', 'text')"
I need to extract randomkey123xyz987 which will always be between api(' and ',
I was planning on using Regex for this, however, I seem to be having some trouble.
This is the only progress that I have made:
import re
string = "api('randomkey123xyz987', 'key', 'text')"
match = re.findall("\((.*?)\)", string)[0]
print match
The following code returns 'randomkey123xyz987', 'key', 'text'
I have tried to use [^'], but my guess is that I am not properly inserting it into the re.findall function.
Everything that I am trying is failing.
Update: My current workaround is using [2:-4], but I would still like to avoid using match[2:-4].
If the string contains only one instance, use re.search() instead:
>>> import re
>>> s = "api('randomkey123xyz987', 'key', 'text')"
>>> match = re.search(r"api\('([^']*)'", s).group(1)
>>> print match
randomkey123xyz987
You want the string between the ( and ,, you are catching everything between the parens:
match = re.findall("api\((.*?),", string)
print match
["'randomkey123xyz987'"]
Or match between the '':
match = re.findall("api\('(.*?)'", string)
print match
['randomkey123xyz987']
If that is how your strings actually look you can split:
string = "api('randomkey123xyz987', 'key', 'text')"
print(string.split(",",1)[0][4:])
You should use the following regex:
api\('(.*?)'
Assuming that api( is fixed prefix
It matches api(, then captures what appears next, until ' token.
>>> re.findall(r"api\('(.*?)'", "api('randomkey123xyz987', 'key', 'text')")
['randomkey123xyz987']
If you are certain that randomkey123xyz987 will always be between "api('" and "',", then using the split() method can get it done in one line. This way you will not have to use regex matching. It will match the pattern between the starting and ending delimiter which is "api('" and "',
".
>>> string = "api('randomkey123xyz987', 'key', 'text')"
>>> value = (string.split("api('")[1]).split("',")[0]
>>> print value
randomkey123xyz987

Removing many types of chars from a Python string

I have some string X and I wish to remove semicolons, periods, commas, colons, etc, all in one go. Is there a way to do this that doesn't require a big chain of .replace(somechar,"") calls?
You can use the translate method with a first argument of None:
string2 = string1.translate(None, ";.,:")
Alternatively, you can use the filter function:
string2 = filter(lambda x: x not in ";,.:", string1)
Note that both of these options only work for non-Unicode strings and only in Python 2.
You can use re.sub to pattern match and replace. The following replaces h and i only with empty strings:
In [1]: s = 'byehibyehbyei'
In [1]: re.sub('[hi]', '', s)
Out[1]: 'byebyebye'
Don't forget to import re.
>>> import re
>>> foo = "asdf;:,*_-"
>>> re.sub('[;:,*_-]', '', foo)
'asdf'
[;:,*_-] - List of characters to be matched
'' - Replace match with nothing
Using the string foo.
For more information take a look at the re.sub(pattern, repl, string, count=0, flags=0) documentation.
Don't know about the speed, but here's another example without using re.
commas_and_stuff = ",+;:"
words = "words; and stuff!!!!"
cleaned_words = "".join(c for c in words if c not in commas_and_stuff)
Gives you:
'words and stuff!!!!'

Categories