python 2.7 code
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = cStr.split(',')
print newStr # -> ['"aaaa"','"bbbb"','"ccc','ddd"' ]
but, I want this result.
result = ['"aaa"','"bbb"','"ccc,ddd"']
The solution using re.split() function:
import re
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = re.split(r',(?=")', cStr)
print newStr
The output:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
,(?=") - lookahead positive assertion, ensures that delimiter , is followed by double quote "
Try to use CSV.
import csv
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([cStr], delimiter=',', quotechar='"'))[0] ]
print newStr
Check Python parse CSV ignoring comma with double-quotes
By using regex try this:
COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)")
split_result = COMMA_MATCHER.split(string)
pyparsing has a builtin expression, commaSeparatedList:
cStr = '"aaaa","bbbb","ccc,ddd"'
import pyparsing as pp
print(pp.commaSeparatedList.parseString(cStr).asList())
prints:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
You can also add a parse-time action to strip those double-quotes (since you probably just want the content, not the quotation marks too):
csv_line = pp.commaSeparatedList.copy().addParseAction(pp.tokenMap(lambda s: s.strip('"')))
print(csv_line.parseString(cStr).asList())
gives:
['aaaa', 'bbbb', 'ccc,ddd']
It will be better to use regex in this case.
re.findall('".*?"', cStr) returns exactly what you need
asterisk is greedy wildcard, if you used '".*"', it would return maximal match, i.e. everything in between the very first and the very last double quote. The question mark makes it non greedy, so '".*?"' returns the smallest possible match.
I liked Mark de Haan' solution but I had to rework it, as it removed the quote characters (although they were needed) and therefore an assertion in his example failed. I also added two additional parameters to deal with different separators and quote characters.
def tokenize( string, separator = ',', quote = '"' ):
"""
Split a comma separated string into a List of strings.
Separator characters inside the quotes are ignored.
:param string: A string to be split into chunks
:param separator: A separator character
:param quote: A character to define beginning and end of the quoted string
:return: A list of strings, one element for every chunk
"""
comma_separated_list = []
chunk = ''
in_quotes = False
for character in string:
if character == separator and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
else:
chunk += character
if character == quote:
in_quotes = False if in_quotes else True
comma_separated_list.append( chunk )
return comma_separated_list
And the tests...
def test_tokenizer():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = tokenize(string)
assert expected == actual
It is always better to use existing libraries when you can, but I was struggling to get my specific use case to work with all the above answers, so I wrote my own for python 3.9 (will probably work until 3.6, and removing the type hinting will get you to 2.x compatability).
def separate(string) -> List[str]:
"""
Split a comma separated string into a List of strings.
Resulting list elements are trimmed of double quotes.
Comma's inside double quotes are ignored.
:param string: A string to be split into chunks
:return: A list of strings, one element for every chunk
"""
comma_separated_list: List[str] = []
chunk: str = ''
in_quotes: bool = False
for character in string:
if character == ',' and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
elif character == '"':
in_quotes = False if in_quotes else True
else:
chunk += character
comma_separated_list.append(chunk)
return comma_separated_list
And the tests...
def test_separator():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = separate(string)
assert expected == actual
You can first split the string by " then filter out '' or ',', finally format it, it may be the simplest way:
['"%s"' % s for s in cStr.split('"') if s and s != ',']
You need a parser. You can build your own, or you may be able to press one of the library ones into service. In this case, json could be (ab)used.
import json
cStr = '"aaaa","bbbb","ccc,ddd"'
jstr = '[' + cStr + ']'
result = json.loads( jstr) # ['aaaa', 'bbbb', 'ccc,ddd']
result = [ '"'+r+'"' for r in result ] # ['"aaaa"', '"bbbb"', '"ccc,ddd"']
This is not a standard module, you have to install it via pip, but as an option try tssplit:
In [3]: from tssplit import tssplit
In [4]: tssplit('"aaaa","bbbb","ccc,ddd"', quote='"', delimiter=',')
Out[4]: ['aaaa', 'bbbb', 'ccc,ddd']
I'm trying to write a function parse such that, for example,
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
The string consists of a fixed prefix file://, followed by a file name (which can consist of one or more of any character), followed by a colon and a string representing access flags.
Here is one implementation using regular expressions:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[rwab+0-9]+)" # The letters r, w, a, b, a '+' symbol, or any digit
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # The full pattern including the end of line character
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
I would prefer to use PyParsing, however, because it typically gives more detailed error messages if the string doesn't match the expression (rather than re.match which simply returns None), and I would eventually like to make the flags optional.
Following Paul McGuire's answer in python regex in pyparsing, I made the following attempt:
from pyparsing import Word, alphas, nums, StringEnd, Regex, FollowedBy, Suppress, Literal
def parse(string):
scheme = Literal("file://")
path = Regex(".+")
flags = Word(alphas + nums + "+")
expression = Suppress(scheme) + (~(Suppress(":") + flags + StringEnd()) + path("path") + Suppress(":") + flags("flags") + StringEnd())
tokens = expression.parseString(string)
return tokens['path'], tokens['flags']
In the second part of the expression, I'm basically trying the negative lookahead (~suffix + path + suffix), where suffix is ":" + flags + StringEnd(). However, when trying to parse "file://foo:bar.txt:r+", I run into the following error:
pyparsing.ParseException: Expected ":" (at char 21), (line:1, col:22)
Since the string is 21 characters long, I interpret this as that the Regex has 'consumed' the entire string so that the suffix is no longer 'found'.
How can I fix the parse method using pyparsing?
Try this:
s="file://foo:bar.txt:r+"
path,flag=re.sub(r'.*\/\/(.*):(.*$)',r'\1,\2',s)
I have strings that are unpredictable in terms of character content, but I know that every string contains exactly one character '*'.
How to replace two characters after the '*' with some non hard-coded string. Non hard-coded string is actually calculated checksum and converted into string:
checksum_str = str(hex(csum).lstrip('0x'))
You want something like:
star_pos = my_string.find('*')
my_string = my_string[:star_pos] + '*' + checksum_str + my_string[star_pos + 3:]
You can do it with a regular expression:
import re
my_string = re.sub(r'(?<=\*)..', checksum_str, my_string, 1)
I need to get certain words out from a string in to a new format. For example, I call the function with the input:
text2function('$sin (x)$ is an function of x')
and I need to put them into a StringFunction:
StringFunction(function, independent_variables=[vari])
where I need to get just 'sin (x)' for function and 'x' for vari. So it would look like this finally:
StringFunction('sin (x)', independent_variables=['x']
problem is, I can't seem to obtain function and vari. I have tried:
start = string.index(start_marker) + len(start_marker)
end = string.index(end_marker, start)
return string[start:end]
and
r = re.compile('$()$')
m = r.search(string)
if m:
lyrics = m.group(1)
and
send = re.findall('$([^"]*)$',string)
all seems to seems to give me nothing. Am I doing something wrong? All help is appreciated. Thanks.
Tweeky way!
>>> char1 = '('
>>> char2 = ')'
>>> mystr = "mystring(123234sample)"
>>> print mystr[mystr.find(char1)+1 : mystr.find(char2)]
123234sample
$ is a special character in regex (it denotes the end of the string). You need to escape it:
>>> re.findall(r'\$(.*?)\$', '$sin (x)$ is an function of x')
['sin (x)']
If you want to cut a string between two identical characters (i.e, !234567890!)
you can use
line_word = line.split('!')
print (line_word[1])
You need to start searching for the second character beyond start:
end = string.index(end_marker, start + 1)
because otherwise it'll find the same character at the same location again:
>>> start_marker = end_marker = '$'
>>> string = '$sin (x)$ is an function of x'
>>> start = string.index(start_marker) + len(start_marker)
>>> end = string.index(end_marker, start + 1)
>>> string[start:end]
'sin (x)'
For your regular expressions, the $ character is interpreted as an anchor, not the literal character. Escape it to match the literal $ (and look for things that are not $ instead of not ":
send = re.findall('\$([^$]*)\$', string)
which gives:
>>> import re
>>> re.findall('\$([^$]*)\$', string)
['sin (x)']
The regular expression $()$ otherwise doesn't really match anything between the parenthesis even if you did escape the $ characters.
When passing an empty string to a regular expression object, the result of a search is a match object an not None. Should it be None since there is nothing to match?
import re
m = re.search("", "some text")
if m is None:
print "Returned None"
else:
print "Return a match"
Incidentally, using the special symbols ^ and $ yield the same result.
Empty pattern matches any part of the string.
Check this:
import re
re.search("", "ffff")
<_sre.SRE_Match object at 0xb7166410>
re.search("", "ffff").start()
0
re.search("$", "ffff").start()
4
Adding $ doesn't yield the same result. Match is at the end, because it is the only place it can be.
Look at it this way: Everything you searched for was matched, therefore the search was successful and you get a match object.
What you need to be doing is not checking if m is None, rather you want to check if m is True:
if m:
print "Found a match"
else:
print "No match"
Also, the empty pattern matches the whole string.
Those regular expressions are successfully matching 0 literal characters.
All strings can be thought of as containing an infinite number of empty strings between the characters:
'Foo' = '' + '' + ... + 'F' + '' + ... + '' + 'oo' + '' + ...