Python, split a string at commas, except within quotes, ignoring whitespace - python

I've found some solutions, but the results I am getting don't match what I'm expecting.
I want to take a string, and split it at commas, except when the commas are contained within double quotation marks. I would like to ignore whitespace. I can live with losing the double quotes in the process, but it's not necessary.
Is csv the best way to do this? Would a regex solution be better?
#!/usr/local/bin/python2.7
import csv
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
result = csv.reader(s, delimiter=',', quotechar='"')
for r in result:
print r
# Should display:
# abc
# def
# ghi
# jkl, mno, pqr
# stu
#
# But I get:
# ['a']
# ['b']
# ['c']
# ['', '']
# ['d']
# ['e']
# ['f']
# ['', '']
# [' ']
# ['g']
# ['h']
# ['i']
# ['', '']
# [' ']
# ['jkl, mno, pqr']
# ['', '']
# ['stu']
print r[1] # Should be "def" but I get and "list index out of range" error.

You can use the regular expression
".+?"|[\w-]+
This will match double-quotes, followed by any characters, until the next double-quote is found - OR, it will match word characters (no commas nor quotes).
https://regex101.com/r/IThYf7/1
import re
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
for r in re.findall(r'".+?"|[\w-]+', s):
print(r)
If you want to get rid of the "s around the quoted sections, the best I could figure out by using the regex module (so that \K was usable) was:
(?:^"?|, ?"?)\K(?:(?<=").+?(?=")|[\w-]+)
https://regex101.com/r/IThYf7/3

Besides using csv you could have another nice approach which is supported by the newer regex module (i.e. pip install regex):
"[^"]*"(*SKIP)(*FAIL)|,\s*
This reads as follows:
"[^"]*"(*SKIP)(*FAIL) # match everything between two double quotes and "forget" about them
| # or
,\s* # match a comma and 0+ whitespaces
In Python:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,\s*')
string = 'abc,def, ghi, "jkl, mno, pqr","stu"'
parts = rx.split(string)
print(parts)
This yields
['abc', 'def', 'ghi', '"jkl, mno, pqr"', '"stu"']
See a demo on regex101.com.

Related

Not getting expected output for some reason?

Question: please debug logic to reflect expected output
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(word_list)
OUTPUT is :
['Hello', 'there', '.', '']
Problem: needs to be expected without space
Expected: ['Hello', 'there', '.']
First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', ''] because-
The \W, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_] so it is splitting your string by space(\s) and literal dot(.) character
So if you want to get the expected output you need to do some further processing like the below-
With Earlier Code:
import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W+)", s)))
print(l)
With Edited code:
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(list(filter(None,word_list)))
Output:
['Hello', 'there', '.']
Working Code: https://rextester.com/KWJN38243
assuming word is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.
Here is the string:
Hello there.
Here is how it is split:
Hello|there|
that means you have three values: hello there and an empty string '' in the last place.
And the values you split on are a space and a period
So the output should be the three values and the two characters that we split on:
hello - space - there - period - empty string
which is exactly what I get.
import re
s = "Hello there."
t = re.split(r"(\W+)", s)
print(t)
output:
['Hello', ' ', 'there', '.', '']
Further Explanation
From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:
date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,
The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.

Regex to split string excluding delimiters between escapable quotes

I need to be able space separate a string unless the space is contained within escapable quotes. In other words spam spam spam "and \"eggs" should return spam, spam, spam and and "eggs. I intend to do this using the re.split method in python where you identify the characters to split on using regex.
I found this which finds everything between escapable quotes:
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
from: https://www.metaltoad.com/blog/regex-quoted-string-escapable-quotes
and this which splits by character unless between quotes:
\s(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
from: https://stackabuse.com/regex-splitting-by-character-unless-in-quotes/. This finds all spaces with an even number of doubles quotes between the space and the end of the line.
I'm struggling join those two solution together.
For ref reference I found this I found this super-useful regex cheat sheet: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
I also found https://regex101.com/ extremely useful: allows you to test regex
Finally managed it:
\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)
This combines to two solutions in the question to find spaces with even numbers of unescaped double quotes to the right hand side. Explanation:
\s # space
(?= # followed by (not included in match though)
(?: # match pattern (but don't capture)
(?:
\\\" # match escaped double quotes
| # OR
[^\"] # any character that is not double quotes
)* # 0 or more times
(?<!\\)\" # followed by unescaped quotes
(?:\\\"|[^\"])* # as above match escaped double quotes OR any character that is not double quotes
(?<!\\)\" # as above - followed by unescaped quotes
# the above pairs of unescaped quotes
)* # repeated 0 or more times (acting on pairs of quotes given an even number of quotes returned)
(?:\\\"|[^\"])* # as above
$ # end of the line
)
So the final python is:
import re
test_str = r'spam spam spam "and \"eggs"'
regex = r'\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)'
test_list = re.split(regex, test_str)
print(test_list)
>>> ['spam', 'spam', 'spam', '"and \\"eggs"']
The only down side to this method is that it leave leading trailing quotes, however I can easily identify and remove these with the following python:
# remove leading and trailing unescaped quotes
test_list = list(map(lambda x: re.sub(r'(?<!\\)"', '', x), test_list))
# remove escape characters - they are no longer required
test_list = list(map(lambda x: x.replace(r'\"', '"'), test_list))
print(test_list)
>>> ['spam', 'spam', 'spam', 'and "eggs']

How to extract out a specific field from structured string?

I'm now trying to extract the text from a structured string by regex.
For instance,
string = "field1:afield3:bfield2:cfield3:d"
all I want is the values of field3 which are 'b' and 'd'
I try to use the regex = "(field1:.*?)?(field2:.*?)?field3:"
and split the raw string by it.
but I ve got this:
['', 'field1:a', None, 'b', None, 'field2:c', 'd']
So, what is the solution?
The real case is:
string = "1st sentence---------------------- Forwarded by Michelle
Cash/HOU/ECT on ---------------------------Ava Syon#ENRON To: Michelle
Cash/HOU/ECT#ECTcc: Twanda Sweet/HOU/ECT#ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon#ENRON To: Michelle Cash/HOU/ECT#ECTcc: Twanda
Sweet/HOU/ECT#ECT Subject: 3rd sentence"
(one line string, without \n)
a list
re = ["1st sentence","2nd sentence","3rd sentence"]
is the result needed
Thanks!
Use
re.findall(r'field3:(.*?)(?=field\d+:|$)', s)
See the regex demo. NOTE: re.findall returns the contents of the capturing group, thus, you do not need a lookbehind in the pattern, a capturing group will do.
The regex matches:
field3: - a literal char sequence
(.*?) - any 0+ chars other than line break (if you use re.DOTALL modifier, the dot will match a newline, too)
(?=field\d+:|$) - a positive lookahead that requires (but does not consume, does not add to the match or capture) the presence of field, 1+ digits, : or the end of string after the current position.
Python demo:
import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']
NOTE: A more efficient (unrolled) version of the same regex is
field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)
See the regex demo
You could use a positive lookbehind. It will find any character directly after field3 :
>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']
This would only work for a single-character. I would add a positive lookeahead, but it would become the same answer as Wiktor's.
So here's an alternative with re.split():
>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']
A complex solution to get field values by field number using built-in str.replace(), str.split() and str.startswith() functions:
def getFieldValues(s, field_number):
delimited = s.replace('field', '|field') # setting delimiter between fields
return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]
s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"
print(getFieldValues(s, 3))
# ['b some text', 'd and data']
print(getFieldValues(s, 1))
# ['a hello-again']
print(getFieldValues(s, 2))
# ['c another text']

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

Find all strings that are in between two sub strings

I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.
Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)

Categories