I need to be able space separate a string unless the space is contained within escapable quotes. In other words spam spam spam "and \"eggs" should return spam, spam, spam and and "eggs. I intend to do this using the re.split method in python where you identify the characters to split on using regex.
I found this which finds everything between escapable quotes:
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
from: https://www.metaltoad.com/blog/regex-quoted-string-escapable-quotes
and this which splits by character unless between quotes:
\s(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
from: https://stackabuse.com/regex-splitting-by-character-unless-in-quotes/. This finds all spaces with an even number of doubles quotes between the space and the end of the line.
I'm struggling join those two solution together.
For ref reference I found this I found this super-useful regex cheat sheet: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
I also found https://regex101.com/ extremely useful: allows you to test regex
Finally managed it:
\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)
This combines to two solutions in the question to find spaces with even numbers of unescaped double quotes to the right hand side. Explanation:
\s # space
(?= # followed by (not included in match though)
(?: # match pattern (but don't capture)
(?:
\\\" # match escaped double quotes
| # OR
[^\"] # any character that is not double quotes
)* # 0 or more times
(?<!\\)\" # followed by unescaped quotes
(?:\\\"|[^\"])* # as above match escaped double quotes OR any character that is not double quotes
(?<!\\)\" # as above - followed by unescaped quotes
# the above pairs of unescaped quotes
)* # repeated 0 or more times (acting on pairs of quotes given an even number of quotes returned)
(?:\\\"|[^\"])* # as above
$ # end of the line
)
So the final python is:
import re
test_str = r'spam spam spam "and \"eggs"'
regex = r'\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)'
test_list = re.split(regex, test_str)
print(test_list)
>>> ['spam', 'spam', 'spam', '"and \\"eggs"']
The only down side to this method is that it leave leading trailing quotes, however I can easily identify and remove these with the following python:
# remove leading and trailing unescaped quotes
test_list = list(map(lambda x: re.sub(r'(?<!\\)"', '', x), test_list))
# remove escape characters - they are no longer required
test_list = list(map(lambda x: x.replace(r'\"', '"'), test_list))
print(test_list)
>>> ['spam', 'spam', 'spam', 'and "eggs']
s = 'hello "ok and #com" name'
s.split()
Is there a way to split this into a list that splits whitespace characters but as well not split white characters in quotes and allow special characters in the quotes.
["hello", '"ok and #com"', "name"]
I want it to be able to output like this but also allow the special characters in it no matter what.
Can someone help me with this?
(I've looked at other posts that are related to this, but those posts don't allow the special characters when I have tested it.)
You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and #com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
One option is to use regular expressions to capture the strings in quotes, delete them, and then to split the remaining text on whitespace. Note that this won't work if the order of the resulting list matters.
import re
items = []
s = 'hello "ok and #com" name'
patt = re.compile(r'(".*?")')
# regex to find quoted strings
match = re.search(patt, s)
if match:
for item in match.groups():
items.append(item)
# split on whitespace after removing quoted strings
for item in re.sub(patt, '', s).split():
items.append(item)
>>>items
['"ok and #com"', 'hello', 'name']
How can I remove multiple consecutive occurrences of all the special characters in a string?
I can get the code like:
re.sub('\.\.+',' ',string)
re.sub('##+',' ',string)
re.sub('\s\s+',' ',string)
for individual and in best case, use a loop for all the characters in a list like:
from string import punctuation
for i in punctuation:
to = ('\\' + i + '\\' + i + '+')
string = re.sub(to, ' ', string)
but I'm sure there is an effective method too.
I tried:
re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y.')
but it removes all the special characters except one preceded by alphabets.
string can have different consecutive special characters like 99#aaaa*!##$. but not same like ++--....
A pattern to match all non-alphanumeric characters in Python is [\W_].
So, all you need is to wrap the pattern with a capturing group and add \1+ after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:
text = re.sub(r'([\W_])\1+',' ',text)
In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A or re.ASCII flag:
text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)
Mind the use of the r prefix that defines a raw string literal (so that you do not have to escape \ char).
See the regex demo. See the Python demo:
import re
text = "\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))
Output:
.AAA.x. +*##= xx000 x .x
x*+Y.
I want to find all the substrings wrapped in the double quotes satisfying the following two constraints:
The shortest substring starting with "http"
End with ".bmp" or ".jpg"
My codes are as below:
import re
pat = '"(http.+?\.(jpg|bmp))"' # I don't how to modify this pattern
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk" ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
My expected outputs are
['http:afd/aa.bmp', 'http:kk/bb.jpg']
But the execution results are
[('http:afd/aa.bmp', 'bmp'), ('http--test--http:kk/bb.jpg', 'jpg')]
I have already tried several kinds of patterns but I still can't get what I want.
How should I modify my codes to get the results I expect? Thanks!
Use a [^"]* negated character class after the first " to stay within double quoted substring (note - this will only work if there are no escape sequences in the string and get to the last http, then add it at the end, too, to get to the trailing ".
import re
pat = r'"[^"]*(http.*?\.(?:jpg|bmp))[^"]*"'
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk" ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
# => ['http:afd/aa.bmp', 'http:kk/bb.jpg']
See the Python demo online.
Pattern details:
" - a literal double quote
[^"]* - 0+ chars other than a double quote, as many as possible, since * is a greedy quantifier
(http.*?\.(?:jpg|bmp)) - Group 1 (extracted with re.findall) that matches:
http - a literal substring http
.*? - any 0+ chars, as few as possible (as *? is a lazy quantifier)
\. - a literal dot
(?:jpg|bmp) - a non-capturing group (so that the text it matches could not be output with re.findall) matching either jpg or bmp substring
[^"]* - 0+ chars other than a double quote, as many as possible
" - a literal double quote
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']