I am splitting a string in python and my goal is to split by commas except these between quotations marks. I am using
fields = line.strip().split(",")
but some strings are like the following one:
10,20,"Installations, machines",3,5
How can I use regular expressions for accomplishing this?
Although I agree that regular expressions may not be the best tool for the job, I found the problem quite interesting on its own.
import re
split_on_commas = re.compile(r'[^,]*".*"[^,]*|[^,]+|(?<=,)|^(?=,)').findall
This regexp consists in four alternative parts in this order:
any number of non-commas, followed by a substring enclosed between double quotes, followed by any number of non-commas;
at least one non-comma;
an empty substring following a comma;
an empty substring at the start of the string, and followed by a comma.
Some tests:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,20,"aaa, bbb",3,5') == ['10', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,,20,"aaa, bbb",3,5') == ['10', '', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas(',10,20,"aaa, bbb",3,5') == ['', '10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb",3,5,') == ['10', '20', '"aaa, bbb"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', '"aaa, bbb" ccc', '3', '5']
assert split_on_commas('10,20,ccc "aaa, bbb",3,5') == ['10', '20', 'ccc "aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb" "ccc",3,5,') == ['10', '20', '"aaa, bbb" "ccc"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" "ccc, ddd",3,5,') == ['10', '20', '"aaa, bbb" "ccc, ddd"', '3', '5', '']
assert split_on_commas('10,20,"aaa, "bbb",3,5') == ['10', '20', '"aaa, "bbb"', '3', '5']
assert split_on_commas('10,20,"",3,5') == ['10', '20', '""', '3', '5']
assert split_on_commas('10,20,",",3,5') == ['10', '20', '","', '3', '5']
assert split_on_commas(',,,') == ['', '', '', '']
assert split_on_commas('') == []
assert split_on_commas(',') == ['', '']
assert split_on_commas('","') == ['","']
assert split_on_commas('",') == ['"', '']
assert split_on_commas(',"') == ['', '"']
assert split_on_commas('"') == ['"']
Update: comparison with the csv module solution
Similar questions have been asked many times on SO, and each time the best / accepted answer was "Just use the csv module". Perhaps it's useful to point out some differences between the recommended solution and my re proposition. But first, devise a csv function with the same interface as split (not idiomatic, but consistent with the original requirement):
import csv
split_on_commas = lambda s: csv.reader([s]).next()
The first thing to be aware of is that csv.reader does more than a smart split. The external delimiters are suppressed:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
Which can lead to some strange behaviours:
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', 'aaa, bbb ccc', '3', '5']
assert split_on_commas('10,20,aaa", bbb ccc",3,5') == ['10', '20', 'aaa"', ' bbb ccc"', '3', '5']
I am sure this is not a problem with a generated CSV, since the offending double quotes would be escaped.
More shocking is the fact that this module still does not support Unicode:
split_on_commas(u'10,20,"Juan, Chô",3,5')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-83-a0ef82b5fc26> in <module>()
----> 1 split_on_commas(u'10,20,"Juan, Chô",3,5')
<ipython-input-81-18a2b4070348> in <lambda>(s)
1 if __name__ == "__main__":
2 import csv
----> 3 split_on_commas = lambda s: csv.reader([s]).next()
4
5 assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 15: ordinal not in range(128)
But there is of course a third difference: my solution has not be thoroughly tested, and is not guaranteed to work in the cases I didn't think of... Now, since this approach seems to have several real use cases (e.g., non-TSV files, non-ASCII input), I would be glad if some regex guru, far from dismissing it as dangerous, could help to find out its limitations and improve it.
This is how I'd do it:
import re
data = "my string \"string is nice\" other string "
print re.findall(r'(\w+|".*?")', data)
The output will be:
['my', 'string', '"string is nice"', 'other', 'string']
I don't think there's anything to explain here as the regex speaks for itself. Anyway, if you have any doubts I recommend regex101
\w+ - match any word character [a-zA-Z0-9_]
" - matches the characters " literally
.*? - matches any character (except newline)
If you also want to get rid of the square brackets, do this:
import re
string = "my string \"string is nice\" other string "
parsed_string = re.findall(r'(\w+|".*?")', string)
print(", ".join(parsed_string))
The output will be:
my, string, "string is nice", other, string
As jonrsharpe and Alan Moore mentioned, the Python's built-in CSV module would be a much better solution.
As per their own example:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
Regular expressions will not work well here.
You can split by comma and then recombine...
Or use the csv module as suggested in the comments...
line = '10,20,"Installations, machines",3,5'
fields = line.strip().split(",")
result = []
tmpfield = ''
for checkfield in fields:
tmpfield = checkfield if tmpfield=='' else tmpfield +','+ checkfield
if tmpfield.strip().startswith('"'):
if tmpfield.strip().endswith('"'):
result.append(tmpfield)
tmpfield = ''
else:
result.append(tmpfield)
tmpfield = ''
if tmpfield<>'':
result.append(tmpfield)
print(result)
Related
I am trying do precess a list of files
file_list = ['.DS_Store', '9', '7', '6', '8', '01', '4', '3', '2', '5']
the goal is to find the files whose name has only one character.
I tried this code
r = re.compile('[0-9]')
result_list = list(filter(r.match, file_list))
result_list
and got
['9', '7', '6', '8', '01', '4', '3', '2', '5']
where '01' should not be included.
I made a workaround
tmp = []
for i in file_list:
if len(i)==1:
tmp.append(i)
tmp
and I got
['9', '7', '6', '8', '4', '3', '2', '5']
this is exactly what I want. Although the method is ugly.
how can I use regex in Python to finish the task?
r = re.compile('^[0-9]$')
The ^ matches the beginning of a line and $ matches the end.
And if you really want it to match any character, not just numbers, it should be
r = re.compile('^.$')
The . in the regex is a single-character wildcard.
Match a string if it's simply any single character appearing at the beginning of the string (^.) right before the end of the string ($):
^.$
Regex101
Your Python then becomes:
r = re.compile('^.$')
result_list = list(filter(r.match, file_list))
Your code is equivalent to
[ i for i in file_list if len(i)==1]
And this method adapts to every case in which file's name has only one character.
I have a tool which is outputting some data . It is known that whenever '10' comes in the data it is added with extra '10' I.e new data becomes ... '10', '10', . Sometimes there can be 4 '10' in consecutive series which means that there is actually 2 '10'.
While reading the data I am trying to remove the duplicates . Till now I have learnt how to remove duplicates if only two adjacent duplicates are found but at the same time if even number of duplicates are found , I want to return half of the duplicates .
x = [ '10', '10', '00', 'DF', '20' ,'10' ,'10' ,'10' ,'10', ....]
Expected output
[ '10', '00' , 'DF', ' 20', ' 10', '10' ..]
You may try to use groupby() from itertools:
X= [ '10', '10', '00', 'DF', '20' ,'10' ,'10' ,'10' ,'10']
from itertools import groupby
result = []
for k, g in groupby(X) :
group = list(g)
if k == '10' :
result.extend(group[:(len(group)+1)/2])
else :
result.extend(group)
print result
gives:
['10', '00', 'DF', '20', '10', '10']
A pure python approach
ls = []
dupe = True
for item in x:
if ls and ls[-1] == item and dupe:
dupe = False
continue
dupe = True
ls.append(item)
['10', '00', 'DF', '20', '10', '10']
I have the following string
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
I would like to use regular expressions to extract the groups:
group1 56,7,1
group2 88,9,1
group3 58,8,1
group4 45
group5 100
group6 null
My ultimate goal is to have tuples such as (group1, group2), (group3, group4), (group5, group6). I am not sure if this all can be accomplished with regular expressions.
I have the following regular expression with gives me partial results
(?<=h=|d=)(.*?)(?=h=|d=)
The matches have an extra comma at the end like 56,7,1, which I would like to remove and d=, is not returning a null.
You likely do not need to use regex. A list comprehension and .split() can likely do what you need like:
Code:
def split_it(a_string):
if not a_string.endswith(','):
a_string += ','
return [x.split(',')[:-1] for x in a_string.split('=') if len(x)][1:]
Test Code:
tests = (
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,",
"h=56,7,1,d=88,9,1,d=,h=58,8,1,d=45,h=100",
)
for test in tests:
print(split_it(test))
Results:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], ['']]
[['56', '7', '1'], ['88', '9', '1'], [''], ['58', '8', '1'], ['45'], ['100']]
You could match rather than split using the expression
[dh]=([\d,]*),
and grab the first group, see a demo on regex101.com.
That is
[dh]= # d or h, followed by =
([\d,]*) # capture d and s 0+ times
, # require a comma afterwards
In Python:
import re
rx = re.compile(r'[dh]=([\d,]*),')
string = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
numbers = [m.group(1) for m in rx.finditer(string)]
print(numbers)
Which yields
['56,7,1', '88,9,1', '58,8,1', '45', '100', '']
You can use ([a-z]=)([0-9,]+)(,)?
Online demo
just you need add index to group
You could use $ in positive lookahead to match against the end of the string:
import re
input_str = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
groups = []
for x in re.findall('(?<=h=|d=)(.*?)(?=d=|h=|$)', input_str):
m = x.strip(',')
if m:
groups.append(m.split(','))
else:
groups.append(None)
print(groups)
Output:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], None]
Here, I have assumed that parameters will only have numerical values. If it is so, then you can try this.
(?<=h=|d=)([0-9,]*)
Hope it helps.
Here is my input file sample (z.txt)
>qrst
ABCDE-- 6 6 35 25 10
>qqqq
ABBDE-- 7 7 28 29 2
I store the alpha and numeric in separate lists. Here is the output of numerics list
#Output : ['', '6', '', '6', '35', '25', '10']
['', '7', '', '7', '28', '29', '', '2']
The output has an extra space when there are single digits because of the way the file has been created. Is there anyway to get rid of the '' (empty spaces)?
You can take advantage of filter with None as function for that:
numbers = ['', '7', '', '7', '28', '29', '', '2']
numbers = filter(None, numbers)
print numbers
See it in action here: https://eval.in/640707
If your input looks like this:
>>> li=[' 6 6 35 25 10', ' 7 7 28 29 2']
Just use .split() which will handle the repeated whitespace as a single delimiter:
>>> [e.split() for e in li]
[['6', '6', '35', '25', '10'], ['7', '7', '28', '29', '2']]
vs .split(" "):
>>> [e.split(" ") for e in li]
[['', '6', '', '6', '', '35', '', '25', '', '10'], ['', '7', '7', '28', '', '29', '2']]
I guess there are many ways to do this. I prefer using regular expressions, although this might be slower if you have a large input file with tens of thousands of lines. For smaller files, it's okay.
Few points:
Use context manager (with statement) to open files. When the with statement ends, the file will automatically be closed.
An alternative to re.findall() is re.match() or re.search(). Subsequent code will be slightly different.
It org, sequence and numbers are related element-wise, I suggest you maintain a list of 3-element tuples instead. Of course, you have buffer the org field and add to the list of tuples when the next line is obtained.
import re
org = []
sequence = []
numbers = []
with open('ddd', 'r') as f:
for line in f.readlines():
line = line.strip()
if re.search(r'^>', line):
org.append(line)
else:
m = re.findall(r'^([A-Z]+--)\s+(.*)\s+', line)
if m:
sequence.append(m[0][0])
numbers.append(map(int, m[0][1].split())) # convert from str to int
print(org, sequence, numbers)
I'm looking for a way to expand numbers that are separated by slashes. In addition to the slashes, parentheses, may be used around some (or all) numbers to indicate a "group" which may be repeated (by the number of times directly following the parentheses) or repeated in reverse (followed by 's' as shown in the second set of examples). Some examples are:
1 -> ['1'] -> No slashes, no parentheses
1/2/3/4 -> ['1', '2', '3', '4'] -> basic example with slashes
1/(2)4/3 -> ['1', '2', '2', '2', '2', '3'] -> 2 gets repeated 4 times
1/(2/3)2/4 -> ['1', '2', '3', '2', '3', '4'] -> 2/3 is repeated 2 times
(1/2/3)2 -> ['1', '2', '3', '1', '2', '3'] -> Entire sequence is repeated twice
(1/2/3)s -> ['1', '2', '3', '3', '2', '1'] -> Entire sequence is repeated in reverse
1/(2/3)s/4 -> ['1', '2', '3', '3', '2', '4'] -> 2/3 is repeated in reverse
In the most general case, there could even be nested parentheses, which I know generally make the use of regex impossible. In the current set of data I need to process, there are no nested sets like this, but I could see potential use for it in the future. For example:
1/(2/(3)2/4)s/5 -> 1/(2/3/3/4)s/5
-> 1/2/3/3/4/4/3/3/2/5
-> ['1', '2', '3', '3', '4', '4', '3', '3', '2', '5']
I know of course that regex cannot do all of this (especially with the repeating/reversing sets of parenthesis). But if I can get a regex that at least separates the strings of parenthesis from those not in parenthesis, then I could probably make some loop pretty easily to take care of the rest. So, the regex I'd be looking for would do something like:
1 -> ['1']
1/2/3/4 -> ['1', '2', '3', '4']
1/(2)4/3 -> ['1', '(2)4', '3']
1/(2/3)2/4 -> ['1', '(2/3)2', '4']
1/(2/(3)2/4)s/5 -> ['1', '(2/(3)/2/4)s', '5']
And then I could loop on this result and continue expanding any parentheses until I have only digits.
EDIT
I wasn't totally clear in my original post. In my attempt to make the examples as simple as possible, I perhaps oversimplified them. This needs to work for numbers >= 10 as well as negative numbers.
For example:
1/(15/-23)s/4 -> ['1', '(15/-23)s', '4']
-> ['1', '15', '-23', '-23', '15', '4']
Since you are dealing with nested parenthesis, regex can't help you much here. It cannot easily convert the string to the list, as you wanted at the end.
You would better go with parsing the string yourself. You can try this code, just to meet your requirement at the end:
Parsing string into list without loosing parenthesis:
def parse(s):
li = []
open = 0
closed = False
start_index = -1
for index, c in enumerate(s):
if c == '(':
if open == 0:
start_index = index
open += 1
elif c == ')':
open -= 1
if open == 0:
closed = True
elif closed:
li.append(s[start_index: index + 1])
closed = False
elif open == 0 and c.isdigit():
li.append(c)
return li
This will give you for the string '1/(2/(3)2/4)s/5' the following list:
['1', '(2/(3)2/4)s', '5']
and for the string '1/(15/-23)s/4', as per your changed requirement, this gives:
['1', '(15/-23)s', '4']
Now, you need to take care of the breaking the parenthesis further up to get different list elements.
Expanding the strings with parenthesis to individual list elements:
Here you can make use of a regex, by just dealing with inner-most parenthesis at once:
import re
def expand(s):
''' Group 1 contains the string inside the parenthesis
Group 2 contains the digit or character `s` after the closing parenthesis
'''
match = re.search(r'\(([^()]*)\)(\d|s)', s)
if match:
group0 = match.group()
group1 = match.group(1)
group2 = match.group(2)
if group2.isdigit():
# A digit after the closing parenthesis. Repeat the string inside
s = s.replace(group0, ((group1 + '/') * int(group2))[:-1])
else:
s = s.replace(group0, '/'.join(group1.split('/') + group1.split('/')[::-1]))
if '(' in s:
return expand(s)
return s
li = parse('1/(15/-23)2/4')
for index, s in enumerate(li):
if '(' in s:
s = expand(s)
li[index] = s.split('/')
import itertools
print list(itertools.chain(*li))
This will give you the required result:
['1', '15', '-23', '-23', '15', '4']
The above code iterates over the list generated from parse(s) method, and then for each element, recursively expands the inner most parenthesis.
Here is another way to get that done.
def expand(string):
level = 0
buffer = ""
container = []
for char in string:
if char == "/":
if level == 0:
container.append(buffer)
buffer = ""
else:
buffer += char
elif char == "(":
level += 1
buffer += char
elif char == ")":
level -= 1
buffer += char
else:
buffer += char
if buffer != "":
container.append(buffer)
return container
Regular expressions are the completely wrong tool for this job. There's a long, drawn out explanation as to why regular expressions are not appropriate (If you want to know why, here's an online course). A simple recursive parser is easy enough to write to handle this that you'd probably be done with it well before you finished debugging your regular expression.
It's a slow day so I took it upon myself to write it myself, complete with doctests.
def parse(s):
"""
>>> parse('1')
['1']
>>> parse('1/2/3/4')
['1', '2', '3', '4']
>>> parse('1/(2)4/3')
['1', '2', '2', '2', '2', '3']
>>> parse('1/(2/3)2/4')
['1', '2', '3', '2', '3', '4']
>>> parse('(1/2/3)2')
['1', '2', '3', '1', '2', '3']
>>> parse('1/(2/3)s/4')
['1', '2', '3', '3', '2', '4']
>>> parse('(1/2/3)s')
['1', '2', '3', '3', '2', '1']
>>> parse('1/(2/(3)2/4)s/5')
['1', '2', '3', '3', '4', '4', '3', '3', '2', '5']
"""
return _parse(list(s))
def _parse(chars):
output = []
while len(chars):
c = chars.pop(0)
if c == '/':
continue
elif c == '(':
sub = _parse(chars)
nextC = chars.pop(0)
if nextC.isdigit():
n = int(nextC)
sub = n * sub
output.extend(sub)
elif nextC == 's':
output.extend(sub)
output.extend(reversed(sub))
elif c == ')':
return output
else:
output.extend(c)
return output
if __name__ == "__main__":
import doctest
doctest.testmod()