python 2.7 code
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = cStr.split(',')
print newStr # -> ['"aaaa"','"bbbb"','"ccc','ddd"' ]
but, I want this result.
result = ['"aaa"','"bbb"','"ccc,ddd"']
The solution using re.split() function:
import re
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = re.split(r',(?=")', cStr)
print newStr
The output:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
,(?=") - lookahead positive assertion, ensures that delimiter , is followed by double quote "
Try to use CSV.
import csv
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([cStr], delimiter=',', quotechar='"'))[0] ]
print newStr
Check Python parse CSV ignoring comma with double-quotes
By using regex try this:
COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)")
split_result = COMMA_MATCHER.split(string)
pyparsing has a builtin expression, commaSeparatedList:
cStr = '"aaaa","bbbb","ccc,ddd"'
import pyparsing as pp
print(pp.commaSeparatedList.parseString(cStr).asList())
prints:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
You can also add a parse-time action to strip those double-quotes (since you probably just want the content, not the quotation marks too):
csv_line = pp.commaSeparatedList.copy().addParseAction(pp.tokenMap(lambda s: s.strip('"')))
print(csv_line.parseString(cStr).asList())
gives:
['aaaa', 'bbbb', 'ccc,ddd']
It will be better to use regex in this case.
re.findall('".*?"', cStr) returns exactly what you need
asterisk is greedy wildcard, if you used '".*"', it would return maximal match, i.e. everything in between the very first and the very last double quote. The question mark makes it non greedy, so '".*?"' returns the smallest possible match.
I liked Mark de Haan' solution but I had to rework it, as it removed the quote characters (although they were needed) and therefore an assertion in his example failed. I also added two additional parameters to deal with different separators and quote characters.
def tokenize( string, separator = ',', quote = '"' ):
"""
Split a comma separated string into a List of strings.
Separator characters inside the quotes are ignored.
:param string: A string to be split into chunks
:param separator: A separator character
:param quote: A character to define beginning and end of the quoted string
:return: A list of strings, one element for every chunk
"""
comma_separated_list = []
chunk = ''
in_quotes = False
for character in string:
if character == separator and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
else:
chunk += character
if character == quote:
in_quotes = False if in_quotes else True
comma_separated_list.append( chunk )
return comma_separated_list
And the tests...
def test_tokenizer():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = tokenize(string)
assert expected == actual
It is always better to use existing libraries when you can, but I was struggling to get my specific use case to work with all the above answers, so I wrote my own for python 3.9 (will probably work until 3.6, and removing the type hinting will get you to 2.x compatability).
def separate(string) -> List[str]:
"""
Split a comma separated string into a List of strings.
Resulting list elements are trimmed of double quotes.
Comma's inside double quotes are ignored.
:param string: A string to be split into chunks
:return: A list of strings, one element for every chunk
"""
comma_separated_list: List[str] = []
chunk: str = ''
in_quotes: bool = False
for character in string:
if character == ',' and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
elif character == '"':
in_quotes = False if in_quotes else True
else:
chunk += character
comma_separated_list.append(chunk)
return comma_separated_list
And the tests...
def test_separator():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = separate(string)
assert expected == actual
You can first split the string by " then filter out '' or ',', finally format it, it may be the simplest way:
['"%s"' % s for s in cStr.split('"') if s and s != ',']
You need a parser. You can build your own, or you may be able to press one of the library ones into service. In this case, json could be (ab)used.
import json
cStr = '"aaaa","bbbb","ccc,ddd"'
jstr = '[' + cStr + ']'
result = json.loads( jstr) # ['aaaa', 'bbbb', 'ccc,ddd']
result = [ '"'+r+'"' for r in result ] # ['"aaaa"', '"bbbb"', '"ccc,ddd"']
This is not a standard module, you have to install it via pip, but as an option try tssplit:
In [3]: from tssplit import tssplit
In [4]: tssplit('"aaaa","bbbb","ccc,ddd"', quote='"', delimiter=',')
Out[4]: ['aaaa', 'bbbb', 'ccc,ddd']
Related
currently I can have many dynamic separators in string like
new_123_12313131
new$123$12313131
new#123#12313131
etc etc . I just want to check if there is a special character in string then just get value after last separator like in this example just want 12313131
This is a good use case for isdigit():
l = [
'new_123_12313131',
'new$123$12313131',
'new#123#12313131',
]
output = []
for s in l:
temp = ''
for char in s:
if char.isdigit():
temp += char
output.append(temp)
print(output)
Result: ['12312313131', '12312313131', '12312313131']
Assuming you define 'special character' as anything thats not alphanumeric, you can use the str.isalnum() function to determine the first special character and leverage it something like this:
def split_non_special(input) -> str:
"""
Find first special character starting from the end and get the last piece
"""
for i in reversed(input):
if not i.isalnum():
return input.split(i)[-1] # return as soon as a separator is found
return '' # no separator found
# inputs = ['new_123_12313131', 'new$123$12313131', 'new#123#12313131', 'eefwfwrfwfwf3243']
# outputs = [split_non_special(input) for input in inputs]
# ['12313131', '12313131', '12313131', ''] # outputs
just get value after last separator
the more obvious way is using re.findall:
from re import findall
findall(r'\d+$',text) # ['12313131']
Python supplies what seems to be what you consider "special" characters using the string library as string.punctuation. Which are these characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Using that in conjunction with the re module you can do this:
from string import punctuation
import re
re.split(f"[{punctuation}]", my_string)
my_string being the string you want to split.
Results for your examples
['new', '123', '12313131']
To get just digits you can use:
re.split("\d", my_string)
Results:
['123', '12313131']
I am having a hard time doing Data Analysis on a large text that has lots of non-alphabetical chars. I tried using
string = filter(str.isalnum, string)
but I also have "#" in my text that I want to keep. How do I make an exception for a character like "#" ?
It is easier to use regular expressions:
string = re.sub("[^A-Za-z0-9#]", "", string)
You can use re.sub
re.sub(r'[^\w\s\d#]', '', string)
Example:
>>> re.sub(r'[^\w\s\d#]', '', 'This is # string 123 *$^%')
This is # string 123
One way to do this would be to create a function that returns True or False if an input character is valid.
import string
valid_characters = string.ascii_letters + string.digits + '#'
def is_valid_character(character):
return character in valid_characters
# Instead of using `filter`, we `join` all characters in the input string
# if `is_valid_character` is `True`.
def get_valid_characters(string):
return "".join(char for char in string if is_valid_character(char))
Some example output:
>>> print(valid_characters)
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#
>>> get_valid_characters("!Hello_#world?")
'Helloworld'
>>> get_valid_characters("user#example")
'user#example'
A simpler way to write it would be using regex. This will accomplish the same thing:
import re
def get_valid_characters(string):
return re.sub(r"[^\w\d#]", "", string)
You could use a lambda function to specify your allowed characters. But also note that filter returns a <filter object> which is an iterator over the returned values. So you will have to stich it back to a string:
string = "?filter_#->me3!"
extra_chars = "#!"
filtered_object = filter(lambda c: c.isalnum() or c in extra_chars, string)
string = "".join(filtered_object)
print(string)
Gives:
filter#me3!
S1C(SCC1)C1=COC2C(C{OC}C3C(OC=C3)C2)C1=O
In the above string, I want the program to ignore {OC} or technically anything in between these flower brackets but work normally with rest of the string. I have a file which thousands of such strings. Some strings have more than one set of flower brackets. How should it be done?
Presently I use python 2.5 version.
This might help. Using regex.
import re
s = "S1C(SCC1)C1=COC2C(C{OC}C3C(OC=C3)C2)C1=O"
print re.sub("\{(.*?)\}", " ", s) #Replacing curly brackets and its content by space.
Output:
S1C(SCC1)C1=COC2C(C C3C(OC=C3)C2)C1=O
You can use string slicing for this.
Note - This will work correctly only if you have one such bracket in string
str = "S1C(SCC1)C1=COC2C(C{OC}C3C(OC=C3)C2)C1=O"
startofbracket = str.find("{")
endofbracket = str.find("}")
print str[:startofbracket]+str[endofbracket+1:]
You can iterate over the string and keep track of characters that are not in between brackets. The following code assumes no '{' character inside the string
string = "S1C(SCC1)C1=COC2C(C{OC}C3C(OC=C3)C2)C1=O"
output = ""
brace_found = False
for i in range(len(string)):
if brace_found:
if string[i] == "}":
brace_found = False
else:
if string[i] != "{":
output+=string[i]
else:
brace_found = True
print output
# S1C(SCC1)C1=COC2C(CC3C(OC=C3)C2)C1=O
I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.
Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.
Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.
If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child
Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)
import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"
The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.
From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)
This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)
Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]
another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string
I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.
Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.
Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.
If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child
Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)
import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"
The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.
From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)
This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)
Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]
another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string