Split with single colon but not double colon using regex - python

I have a string like this
"yJdz:jkj8h:jkhd::hjkjh"
I want to split it using colon as a separator, but not a double colon. Desired result:
("yJdz", "jkj8h", "jkhd::hjkjh")
I'm trying with:
re.split(":{1}", "yJdz:jkj8h:jkhd::hjkjh")
but I got a wrong result.
In the meanwhile I'm escaping "::", with string.replace("::", "$$")

You could split on (?<!:):(?!:). This uses two negative lookarounds (a lookbehind and a lookahead) which assert that a valid match only has one colon, without a colon before or after it.
To explain the pattern:
(?<!:) # assert that the previous character is not a colon
: # match a literal : character
(?!:) # assert that the next character is not a colon
Both lookarounds are needed, because if there was only the lookbehind, then the regular expression engine would match the first colon in :: (because the previous character isn't a colon), and if there was only the lookahead, the second colon would match (because the next character isn't a colon).

You can do this with lookahead and lookbehind, if you want:
>>> s = "yJdz:jkj8h:jkhd::hjkjh"
>>> l = re.split("(?<!:):(?!:)", s)
>>> print l
['yJdz', 'jkj8h', 'jkhd::hjkjh']
This regex essentially says "match a : that is not followed by a : or preceded by a :"

Related

How to match and replace this pattern in Python RE?

s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.

regex for extracting all urls from dict like string

here is my string from which i have to extract urls
s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"
my attempted code till now prints only
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', s)
but it prints only repetition of this url
['https://www.riteaid.com']
As you have mentioned dict like string you have to use regex for your particular case this can be used.
s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"
urls = re.findall(r"url:'(https?://.*?)'}", s)
result:
['https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442',
'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009',
'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249',
'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568']
Explanation
url:'(http: literal string
s?: optional literal character "s"
.*?: non greedy any character.
'}:: literal string
If you must use a regex for your current example to match between {url:' and '} you could use a positive lookbehind (?<= and a positive lookahead (?= and match the url using a negated character class [^']+ which matches not a ' one or more times.
(?<={url:')[^']+(?='})
Demo
You can also be less restrictive for your example data and leave out the leading { and trailing }:
(?<=url:')[^']+(?=')

regular expressions excluding words that begin with a semi colon

I have been trying to figure out how to include certain word groups and exclude others.I have this string for example
string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
I want to find HI:MYDLKJL:ajkld? and app? but not :JKLJBLKJD:DKJL? because it begins with a : I have made this code but it still includes the :JKLJBLKJD:DKJL? just ignoring the : in the front
match3=re.findall("[A-Za-z]{1,15}[:]{0,1}[A-Za-z]{0,15}[:]{0,1}[A-Za-z]{0,15}[:]{0,1}[A-Za-z]{0,15}[\?]{1}",string1)
The actual pattern is pretty simple to specify. But, you'll also need to specify a look-behind to handle the second term appropriately.
>>> re.findall(r'(?:(?<=\s)|(?<=^))[^:]\S+\?', string1)
['HI:MYDLKJL:ajkld?', 'app?']
The regex means "any expression that does not start with a colon but ends with a question mark".
(?: # lookbehind
(?<=\s) # space
| # OR
(?<=^) # start-of-line metachar
)
[^:] # anything that is not a colon
\S+ # one or more characters that are not a space
\? # literal question mark
A simple word boundary does not work because \b will also match the boundary between : and JKLJBLKJD... no bueno, hence the lookbehind.
Alternate approach
>>> string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
>>> string1.split()
['HI:MYDLKJL:ajkld?', ':JKLJBLKJD:DKJL?', 'app?']
# filter out elements not needed
>>> [s for s in string1.split() if not s.startswith(':')]
['HI:MYDLKJL:ajkld?', 'app?']
Or, using the regex module
>>> string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
>>> regex.findall(r'(?:^|\s):\S+(*SKIP)(*F)|\S+', string1)
['HI:MYDLKJL:ajkld?', 'app?']
(?:^|\s):\S+(*SKIP)(*F) will effectively ignore strings starting with :
(?: means non-capturing group

Split string at capital letter but only if no whitespace

Set-up
I've got a string of names which need to be separated into a list.
Following this answer, I have,
string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)
where the last line gives me,
['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']
Problems
1) Whitespace is ignored
'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.
What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?
2) Special characters not handled well
The code used cannot handle 'ö'. How do I include such 'German' characters?
I.e. I want to obtain,
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
You can use positive and negative lookbehind and just list the Umlauts explicitly:
>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
(?<!\s)...: matches ... that is not preceded by \s
(?<=\s)...: matches ... that is preceded by \s
(?:...): non-capturing group so as to not mess with the findall results
This works
string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Regex replace with negative look ahead in Python

I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"

Categories