Extracting substrings between single quotes - python

I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?
E.G input
text = "[(u'apple',), (u'banana',)]"
I want to extract apple and banana as list items like ['apple', 'banana']

In the general case, to extract any chars in between single quotes, the most efficient regex approach is
re.findall(r"'([^']*)'", text) # to also extract empty values
re.findall(r"'([^']+)'", text) # to only extract non-empty values
See the regex demo.
Details
' - a single quote (no need to escape inside a double quote string literal)
([^']*) - a capturing group that captures any 0+ (or 1+ if you use + quantifier) chars other than ' (the [^...] is a negated character class that matches any chars other than those specified in the class)
' - a closing single quote.
Note that re.findall only returns captured substrings if capturing groups are specified in the pattern:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Python demo:
import re
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"'([^']*)'", text))
# => ['apple', 'banana']
Escaped quote support
If you need to support escaped quotes (so as to match abc\'def in 'abc\'def' you will need a regex like
re.findall(r"'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # in case the text contains only "valid" pairs of quotes
re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # if your text is too messed up and there can be "wild" single quotes out there
See regex variation 1 and regex variation 2 demos.
Pattern details
(?<!\\) - a negative lookbehind that fails the match if there is a backslash immediately to the left of the current position
(?:\\\\)* - 0 or more consecutive double backslashes (since these are not escaping the neighboring character)
' - an open '
([^'\\]*(?:\\.[^'\\]*)*) - Group 1 (what will be returned by re.findall)matching...
[^'\\]* - 0 or more chars other than ' and \
(?: - start of a non-capturing group that matches
\\. - any escaped char (a backslash and any char including line breaks due to the re.DOTALL modifier)
[^'\\]* - 0 or more chars other than ' and \
)* - ... zero or more times
' - a closing '.
See another Python demo:
import re
text = r"[(u'apple',), (u'banana',)] [(u'apple',), (u'banana',), (u'abc\'def',)] \\'abc''def' \\\'abc 'abc\\\\\'def'"
print(re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text))
# => apple, banana, apple, banana, abc\'def, abc, def, abc\\\\\'def

text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"\(u'(.*?)',\)", text)
['apple', 'banana']
text = "[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"
print(re.findall(r"\(u'(.*?)',\)", text)[0])
this string contains' an escaped quote mark and \ an escaped slash

You may alternatively use ast.literal_eval then extract the first item by list comprehension:
from ast import literal_eval
text = "[(u'apple',), (u'banana',)]"
literal_eval(text)
Out[3]: [(u'apple',), (u'banana',)]
[t[0] for t in literal_eval(text)]
Out[4]: [u'apple', u'banana']

Related

Regex to split string excluding delimiters between escapable quotes

I need to be able space separate a string unless the space is contained within escapable quotes. In other words spam spam spam "and \"eggs" should return spam, spam, spam and and "eggs. I intend to do this using the re.split method in python where you identify the characters to split on using regex.
I found this which finds everything between escapable quotes:
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
from: https://www.metaltoad.com/blog/regex-quoted-string-escapable-quotes
and this which splits by character unless between quotes:
\s(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
from: https://stackabuse.com/regex-splitting-by-character-unless-in-quotes/. This finds all spaces with an even number of doubles quotes between the space and the end of the line.
I'm struggling join those two solution together.
For ref reference I found this I found this super-useful regex cheat sheet: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
I also found https://regex101.com/ extremely useful: allows you to test regex
Finally managed it:
\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)
This combines to two solutions in the question to find spaces with even numbers of unescaped double quotes to the right hand side. Explanation:
\s # space
(?= # followed by (not included in match though)
(?: # match pattern (but don't capture)
(?:
\\\" # match escaped double quotes
| # OR
[^\"] # any character that is not double quotes
)* # 0 or more times
(?<!\\)\" # followed by unescaped quotes
(?:\\\"|[^\"])* # as above match escaped double quotes OR any character that is not double quotes
(?<!\\)\" # as above - followed by unescaped quotes
# the above pairs of unescaped quotes
)* # repeated 0 or more times (acting on pairs of quotes given an even number of quotes returned)
(?:\\\"|[^\"])* # as above
$ # end of the line
)
So the final python is:
import re
test_str = r'spam spam spam "and \"eggs"'
regex = r'\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)'
test_list = re.split(regex, test_str)
print(test_list)
>>> ['spam', 'spam', 'spam', '"and \\"eggs"']
The only down side to this method is that it leave leading trailing quotes, however I can easily identify and remove these with the following python:
# remove leading and trailing unescaped quotes
test_list = list(map(lambda x: re.sub(r'(?<!\\)"', '', x), test_list))
# remove escape characters - they are no longer required
test_list = list(map(lambda x: x.replace(r'\"', '"'), test_list))
print(test_list)
>>> ['spam', 'spam', 'spam', 'and "eggs']

Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python

s = 'hello "ok and #com" name'
s.split()
Is there a way to split this into a list that splits whitespace characters but as well not split white characters in quotes and allow special characters in the quotes.
["hello", '"ok and #com"', "name"]
I want it to be able to output like this but also allow the special characters in it no matter what.
Can someone help me with this?
(I've looked at other posts that are related to this, but those posts don't allow the special characters when I have tested it.)
You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and #com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
One option is to use regular expressions to capture the strings in quotes, delete them, and then to split the remaining text on whitespace. Note that this won't work if the order of the resulting list matters.
import re
items = []
s = 'hello "ok and #com" name'
patt = re.compile(r'(".*?")')
# regex to find quoted strings
match = re.search(patt, s)
if match:
for item in match.groups():
items.append(item)
# split on whitespace after removing quoted strings
for item in re.sub(patt, '', s).split():
items.append(item)
>>>items
['"ok and #com"', 'hello', 'name']

remove only consecutive special characters but keep consecutive [a-zA-Z0-9] and single characters

How can I remove multiple consecutive occurrences of all the special characters in a string?
I can get the code like:
re.sub('\.\.+',' ',string)
re.sub('##+',' ',string)
re.sub('\s\s+',' ',string)
for individual and in best case, use a loop for all the characters in a list like:
from string import punctuation
for i in punctuation:
to = ('\\' + i + '\\' + i + '+')
string = re.sub(to, ' ', string)
but I'm sure there is an effective method too.
I tried:
re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y.')
but it removes all the special characters except one preceded by alphabets.
string can have different consecutive special characters like 99#aaaa*!##$. but not same like ++--....
A pattern to match all non-alphanumeric characters in Python is [\W_].
So, all you need is to wrap the pattern with a capturing group and add \1+ after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:
text = re.sub(r'([\W_])\1+',' ',text)
In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A or re.ASCII flag:
text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)
Mind the use of the r prefix that defines a raw string literal (so that you do not have to escape \ char).
See the regex demo. See the Python demo:
import re
text = "\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))
Output:
.AAA.x. +*##= xx000 x .x
x*+Y.

find all substring wrapped in double quotes satisfying serveral constraints in python regular expression

I want to find all the substrings wrapped in the double quotes satisfying the following two constraints:
The shortest substring starting with "http"
End with ".bmp" or ".jpg"
My codes are as below:
import re
pat = '"(http.+?\.(jpg|bmp))"' # I don't how to modify this pattern
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk" ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
My expected outputs are
['http:afd/aa.bmp', 'http:kk/bb.jpg']
But the execution results are
[('http:afd/aa.bmp', 'bmp'), ('http--test--http:kk/bb.jpg', 'jpg')]
I have already tried several kinds of patterns but I still can't get what I want.
How should I modify my codes to get the results I expect? Thanks!
Use a [^"]* negated character class after the first " to stay within double quoted substring (note - this will only work if there are no escape sequences in the string and get to the last http, then add it at the end, too, to get to the trailing ".
import re
pat = r'"[^"]*(http.*?\.(?:jpg|bmp))[^"]*"'
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk" ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
# => ['http:afd/aa.bmp', 'http:kk/bb.jpg']
See the Python demo online.
Pattern details:
" - a literal double quote
[^"]* - 0+ chars other than a double quote, as many as possible, since * is a greedy quantifier
(http.*?\.(?:jpg|bmp)) - Group 1 (extracted with re.findall) that matches:
http - a literal substring http
.*? - any 0+ chars, as few as possible (as *? is a lazy quantifier)
\. - a literal dot
(?:jpg|bmp) - a non-capturing group (so that the text it matches could not be output with re.findall) matching either jpg or bmp substring
[^"]* - 0+ chars other than a double quote, as many as possible
" - a literal double quote

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Categories