Regex to split string excluding delimiters between escapable quotes - python

I need to be able space separate a string unless the space is contained within escapable quotes. In other words spam spam spam "and \"eggs" should return spam, spam, spam and and "eggs. I intend to do this using the re.split method in python where you identify the characters to split on using regex.
I found this which finds everything between escapable quotes:
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
from: https://www.metaltoad.com/blog/regex-quoted-string-escapable-quotes
and this which splits by character unless between quotes:
\s(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
from: https://stackabuse.com/regex-splitting-by-character-unless-in-quotes/. This finds all spaces with an even number of doubles quotes between the space and the end of the line.
I'm struggling join those two solution together.
For ref reference I found this I found this super-useful regex cheat sheet: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
I also found https://regex101.com/ extremely useful: allows you to test regex

Finally managed it:
\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)
This combines to two solutions in the question to find spaces with even numbers of unescaped double quotes to the right hand side. Explanation:
\s # space
(?= # followed by (not included in match though)
(?: # match pattern (but don't capture)
(?:
\\\" # match escaped double quotes
| # OR
[^\"] # any character that is not double quotes
)* # 0 or more times
(?<!\\)\" # followed by unescaped quotes
(?:\\\"|[^\"])* # as above match escaped double quotes OR any character that is not double quotes
(?<!\\)\" # as above - followed by unescaped quotes
# the above pairs of unescaped quotes
)* # repeated 0 or more times (acting on pairs of quotes given an even number of quotes returned)
(?:\\\"|[^\"])* # as above
$ # end of the line
)
So the final python is:
import re
test_str = r'spam spam spam "and \"eggs"'
regex = r'\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)'
test_list = re.split(regex, test_str)
print(test_list)
>>> ['spam', 'spam', 'spam', '"and \\"eggs"']
The only down side to this method is that it leave leading trailing quotes, however I can easily identify and remove these with the following python:
# remove leading and trailing unescaped quotes
test_list = list(map(lambda x: re.sub(r'(?<!\\)"', '', x), test_list))
# remove escape characters - they are no longer required
test_list = list(map(lambda x: x.replace(r'\"', '"'), test_list))
print(test_list)
>>> ['spam', 'spam', 'spam', 'and "eggs']

Related

Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python

s = 'hello "ok and #com" name'
s.split()
Is there a way to split this into a list that splits whitespace characters but as well not split white characters in quotes and allow special characters in the quotes.
["hello", '"ok and #com"', "name"]
I want it to be able to output like this but also allow the special characters in it no matter what.
Can someone help me with this?
(I've looked at other posts that are related to this, but those posts don't allow the special characters when I have tested it.)
You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and #com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
One option is to use regular expressions to capture the strings in quotes, delete them, and then to split the remaining text on whitespace. Note that this won't work if the order of the resulting list matters.
import re
items = []
s = 'hello "ok and #com" name'
patt = re.compile(r'(".*?")')
# regex to find quoted strings
match = re.search(patt, s)
if match:
for item in match.groups():
items.append(item)
# split on whitespace after removing quoted strings
for item in re.sub(patt, '', s).split():
items.append(item)
>>>items
['"ok and #com"', 'hello', 'name']

Extracting substrings between single quotes

I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?
E.G input
text = "[(u'apple',), (u'banana',)]"
I want to extract apple and banana as list items like ['apple', 'banana']
In the general case, to extract any chars in between single quotes, the most efficient regex approach is
re.findall(r"'([^']*)'", text) # to also extract empty values
re.findall(r"'([^']+)'", text) # to only extract non-empty values
See the regex demo.
Details
' - a single quote (no need to escape inside a double quote string literal)
([^']*) - a capturing group that captures any 0+ (or 1+ if you use + quantifier) chars other than ' (the [^...] is a negated character class that matches any chars other than those specified in the class)
' - a closing single quote.
Note that re.findall only returns captured substrings if capturing groups are specified in the pattern:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Python demo:
import re
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"'([^']*)'", text))
# => ['apple', 'banana']
Escaped quote support
If you need to support escaped quotes (so as to match abc\'def in 'abc\'def' you will need a regex like
re.findall(r"'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # in case the text contains only "valid" pairs of quotes
re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # if your text is too messed up and there can be "wild" single quotes out there
See regex variation 1 and regex variation 2 demos.
Pattern details
(?<!\\) - a negative lookbehind that fails the match if there is a backslash immediately to the left of the current position
(?:\\\\)* - 0 or more consecutive double backslashes (since these are not escaping the neighboring character)
' - an open '
([^'\\]*(?:\\.[^'\\]*)*) - Group 1 (what will be returned by re.findall)matching...
[^'\\]* - 0 or more chars other than ' and \
(?: - start of a non-capturing group that matches
\\. - any escaped char (a backslash and any char including line breaks due to the re.DOTALL modifier)
[^'\\]* - 0 or more chars other than ' and \
)* - ... zero or more times
' - a closing '.
See another Python demo:
import re
text = r"[(u'apple',), (u'banana',)] [(u'apple',), (u'banana',), (u'abc\'def',)] \\'abc''def' \\\'abc 'abc\\\\\'def'"
print(re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text))
# => apple, banana, apple, banana, abc\'def, abc, def, abc\\\\\'def
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"\(u'(.*?)',\)", text)
['apple', 'banana']
text = "[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"
print(re.findall(r"\(u'(.*?)',\)", text)[0])
this string contains' an escaped quote mark and \ an escaped slash
You may alternatively use ast.literal_eval then extract the first item by list comprehension:
from ast import literal_eval
text = "[(u'apple',), (u'banana',)]"
literal_eval(text)
Out[3]: [(u'apple',), (u'banana',)]
[t[0] for t in literal_eval(text)]
Out[4]: [u'apple', u'banana']

How can I find a match and update it with RegEx?

I have a string as
a = "hello i am stackoverflow.com user +-"
Now I want to convert the escape characters in the string except the quotation marks and white space. So my expected output is :
a = "hello i am stackoverflow\.com user \+\-"
What I did so far is find all the special characters in a string except whitespace and double quote using
re.findall(r'[^\w" ]',a)
Now, once I found all the required special characters I want to update the string. I even tried re.sub but it replaces the special characters. Is there anyway I can do it?
Use re.escape.
>>> a = "hello i am stackoverflow.com user +-"
>>> print(re.sub(r'\\(?=[\s"])', r'', re.escape(a)))
hello i am stackoverflow\.com user \+\-
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
r'\\(?=[\s"])' matches all the backslashes which exists just before to space or double quotes. Replacing the matched backslashes with an empty string will give you the desired output.
OR
>>> a = 'hello i am stackoverflow.com user "+-'
>>> print(re.sub(r'((?![\s"])\W)', r'\\\1', a))
hello i am stackoverflow\.com user "\+\-
((?![\s"])\W) captures all the non-word characters but not of space or double quotes. Replacing the matched characters with backslash + chars inside group index 1 will give you the desired output.
It seems like you could use backreferences with re.sub to achieve what your desired output:
import re
a = "hello i am stackoverflow.com user +-"
print re.sub(r'([^\w" ])', r'\\\1', a) # hello i am stackoverflow\.com user \+\-
The replacement pattern r'\\\1' is just \\ which means a literal backslash, followed \1 which means capture group 1, the pattern captured in the parentheses in the first argument.
In other words, it will escape everything except:
alphanumeric characters
underscore
double quotes
space

Add string between tabs and text

I simply want to add string after (0 or more) tabs in the beginning of a string.
i.e.
a = '\t\t\tHere is the next part of string. More garbage.'
(insert Added String here.)
to
b = '\t\t\t Added String here. Here is the next part of string. More garbage.'
What is the easiest/simplest way to go about it?
Simple:
re.sub(r'^(\t*)', r'\1 Added String here. ', inputtext)
The ^ caret matches the start of the string, \t a tab character, of which there should be zero or more (*). The parenthesis capture the matched tabs for use in the replacement string, where \1 inserts them again in front of the string you need adding.
Demo:
>>> import re
>>> a = '\t\t\tHere is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', a)
'\t\t\t Added String here. Here is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', 'No leading tabs.')
' Added String here. No leading tabs.'

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Categories