How can I split this string in two parts using a regex? - python

Imagine this:
a = "('a','b','c'),('d','e','f')"
I am trying to split it using re so that I will get an array of 2 elements, containing "('a','b','c')" and ('d','e','f'). I tried :
matches = re.split("(?:\)),(?:\()",a)
but this gives me the result:
'(2,3,4'
'1,6,7)'
I could parse it, character by character, but if a regex solution is possible, I would prefer it.

You need to split on the comma which is preceded by a ) and followed by a (. But the parenthesis themselves should not be part of the split point. For that you need to use positive lookahead and positive look behind assertions as:
matches = re.split("(?<=\)),(?=\()",a)
See it

Try this:
from ast import literal_eval
a = "('a','b','c'),('d','e','f')"
x, y = literal_eval(a)
After this, x will be ('a', 'b', 'c') which can be stringized with str(x), or, if spaces matter,
"(%s)" % ",".join(repr(z) for z in x)

split is the wrong tool here. You want findall:
import re
a = "('a','b','c'),('d','e','f')"
matches = re.findall("\([^)]*\)", a)
or pretty much equivalently,
matches = re.findall("\(.*?\)", a)

Related

Split Python String by letters and keep deliminators

Using regex, how can i split a string and keep it's deliminators in the returned results? I'm trying to split a string containing numbers and strings by a set of letters followed by any numerical value including '.' however it's not appearing to work correctly.
Below is my test string, im using python 2.7 and it's not producing what id expect.
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, re.IGNORECASE))
print len(parts), parts
>>> 3 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z']
I would expect it to give me this
>>> 10 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
It should output a list of strings where each string starts with a letter, found in the original regex MLHVCSQTAZ
In your code you are passing re.IGNORECASE as 3rd argument to re.split but 3rd argument of re.split is maxsplit not flags.
re.IGNORECASE equals to 2 hence your input is split only two times.
You may use:
>>> list(filter(None, re.split(r'([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, 0, re.I)))
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
Or use inline mode for ignore case:
re.split(r'(?i)([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s)
I suggest using this simple re.findall code that uses almost identical regex:
parts = re.findall('(?i)[MLHVCSQTAZ][^MLHVCSQTAZ]*', s)
Reference: SRE_FLAG_IGNORECASE = 2 in lib/python2.7/sre_constants.py (thanks to comment from #vks)
You can use re.findall:
import re
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
result = re.findall('[A-Z][\.\d,]+|[A-Z]', s)
Output:
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, flags=re.IGNORECASE))
You need to use flags.Check re.split function definition.
Default re does not support 0 width assertion split.So you can also use regex module for that.
import regex
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
print regex.split('(?=[MLHVCSQTAZ][^MLHVCSQTAZ])', s, flags=regex.IGNORECASE|regex.VERSION1)

How to start at a specific letter and end when it hits a digit?

I have some sample strings:
s = 'neg(able-23, never-21) s2-1/3'
i = 'amod(Market-8, magical-5) s1'
I've got the problem where I can figure out if the string has 's1' or 's3' using:
word = re.search(r's\d$', s)
But if I want to know if the contains 's2-1/3' in it, it won't work.
Is there a regex expression that can be used so that it works for both cases of 's#' and 's#+?
Thanks!
You can allow the characters "-" and "/" to be captured as well, in addition to just digits. It's hard to tell the exact pattern you're going for here, but something like this would capture "s2-1/3" from your example:
import re
s = "neg(able-23, never-21) s2-1/3"
word = re.search(r"s\d[-/\d]*$", s)
I'm guessing that maybe you would want to extract that with some expression, such as:
(s\d+)-?(.*)$
Demo 1
or:
(s\d+)-?([0-9]+)?\/?([0-9]+)?$
Demo 2
Test
import re
expression = r"(s\d+)-?(.*)$"
string = """
neg(able-23, never-21) s211-12/31
neg(able-23, never-21) s2-1/3
amod(Market-8, magical-5) s1
"""
print(re.findall(expression, string, re.M))
Output
[('s211', '12/31'), ('s2', '1/3'), ('s1', '')]

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

regex fails to find in Python

With a given string:
Surname,MM,Forename,JTA19 R <first.second#domain.com>
I can match all the groups with this:
([A-Za-z]+),([A-Z]+),([A-Za-z]+),([A-Z0-9]+)\s([A-Z])\s<([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})
However, when I apply it to Python it always fails to find it
regex=re.compile(r"(?P<lastname>[A-Za-z]+),"
r"(?P<initials>[A-Z]+)"
r",(?P<firstname>[A-Za-z]+),"
r"(?P<ouc1>[A-Z0-9]+)\s"
r"(?P<ouc2>[A-Z])\s<"
r"(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})"
)
I think I've narrowed it down to this part of email:
[A-Z0-9._%+-]
What is wrong?
Replace
r"(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})"
with
r"(?P<email>[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4})"
to allow for lowercase letters too.
You are passing multiple strings to the compile method, you need to pass in one, whole, regular expression.
exp = '''
(?P<lastname>[A-Za-z]+),
(?P<initials>[A-Z]+),
(?P<firstname>[A-Za-Z]+),
(?P<ouc1>[A-Z0-9]+)\s
(?P<ouc2>[A-Z])\s<
(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})'''
regex = re.compile(exp, re.VERBOSE)
Although I have to say, your string is just comma separated, so this might be a bit easier:
>>> s = "Surname,MM,Forename,JTA19 R <first.second#domain.com>"
>>> lastname,initials,firstname,rest = s.split(',')
>>> ouc1,ouc2,email = rest.split(' ')
>>> lastname,initials,firstname,ouc1,ouc2,email[1:-1]
('Surname', 'MM', 'Forename', 'JTA19', 'R', 'first.second#domain.com')

Python RegEx search and replace with part of original expression

I'm new to Python and looking for a way to replace all occurrences of "[A-Z]0" with the [A-Z] portion of the string to get rid of certain numbers that are padded with a zero. I used this snippet to get rid of the whole occurrence from the field I'm processing:
import re
def strip_zeros(s):
return re.sub("[A-Z]0", "", s)
test = strip_zeros(!S_fromManhole!)
How do I perform the same type of procedure but without removing the leading letter of the "[A-Z]0" expression?
Thanks in advance!
Use backreferences.
http://www.regular-expressions.info/refadv.html "\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses."
http://docs.python.org/2/library/re.html#re.sub "Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern."
Untested, but it would look like this:
return re.sub(r"([A-Z])0", r"\1", s)
Placing the first letter inside a capture group and referencing it with \1
you can try something like
In [47]: s = "ab0"
In [48]: s.translate(None, '0')
Out[48]: 'ab'
In [49]: s = "ab0zy"
In [50]: s.translate(None, '0')
Out[50]: 'abzy'
I like Patashu's answer for this case but for the sake of completeness, passing a function to re.sub instead of a replacement string may be cleaner in more complicated cases. The function should take a single match object and return a string.
>>> def strip_zeros(s):
... def unpadded(m):
... return m.group(1)
... return re.sub("([A-Z])0", unpadded, s)
...
>>> strip_zeros("Q0")
'Q'

Categories