split string into list by regex - python

I need a regex, which split input string to list with next rules:
1) By dot;
2) Do not split expression if it is in quotes.
Examples:
'a.b.c' -> ['a', 'b', 'c'];
'a."b.c".d' -> ['a', 'b.c', 'd'];
'a.'b.c'.d' -> ['a', 'b.c', 'd'];
'a.'b c'.d' -> ['a', 'b c', 'd'];

You could leverage the newer regex module with the following expression:
(["']).*?\1(*SKIP)(*FAIL)|\.
This captures quotes, match them up to the next quote and let the matched part fail. The alternation is the dot.
In Python:
import regex as re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1(*SKIP)(*FAIL)|\.""")
for line in data.split("\n"):
if line:
parts = [part.strip("'").strip('"') for part in rx.split(line) if part]
print(parts)
Which yields
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']
['a', 'b c', 'd']
See a demo on regex101.com.
If you want to stick with the re module, you could replace the dot in question before and split by the replacement afterwards.
import re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1|(?P<dot>\.)""")
needle = "SUPERMAN"
def replacer(match):
if match.group('dot') is not None:
return needle
else:
return match.group(0)
for line in data.split("\n"):
if line:
line = rx.sub(replacer, line)
parts = [part.strip("'").strip('"') for part in line.split(needle) if part]
print(parts)
This yields the exact same output as above. Please note that both approaches won't work for escaped quotes.

You can do it with some extra efforts here how can you do.
First split with '.' and then do some logically work on it.
string_data = 'a."b.c".d'
data = string_data.split('.')
list = []
value = None
for i in range(0,len(data)):
if value:
value = None
else:
if '"' in data[i]:
value = data[i]
value = value + '.' + data[i+1]
if value:
list.append(value)
else:
list.append(data[i])
print(list)
It will give you output same as in your qus.

As an alternative you could try using an or | with a positive lookbehind (?<= and a positive lookahead (?= for the single and double quotes
(?<=").*?(?=")|(?<=').*?(?=')|[a-z]+
regex = r"(?<=\").*?(?=\")|(?<=').*?(?=')|[a-z]+"
line = "a.\"b.t\".qq.d.d.'d'.'d.g.r'.d.d"
print(re.findall(regex, line))
['a', 'b.t', 'qq', 'd', 'd', 'd', '.', 'd.g.r', 'd', 'd']
Test output python

here is a regex for you:
\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?
implementation:
import re
regex = re.compile( r"""\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?""")
def str2list(string):
b = regex.findall(string)
l = []
for i in list(b):
for j in list(i):
if j:
l.append(j)
return l
str2list('a.b.c')
str2list('a."b.c".d')
str2list("a.'b.c'.d")
output:
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']

Related

Splitting a string and retaining the delimiter with the delimiter appearing contiguously

I have the following string:
bar = 'F9B2Z1F8B30Z4'
I have a function foo that splits the string on F, then adds back the F delimiter.
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F') if elem != '']
return res
This works unless there are two "F"s back-to-back in the string. For example,
foo('FF9B2Z1F8B30Z4')
returns
['F9B2Z1', 'F8B30Z4']
(the double "F" at the start of the string is not processed)
I'd like the function to split on the first "F" and add it to the list, as follows:
['F', 'F9B2Z1', 'F8B30Z4']
If there is a double "F" in the middle of the string, then the desired behavior would be:
foo('F9B2Z1FF8B30Z4')
['F9B2Z1', 'F', 'F8B30Z4']
Any help would be greatly appreciated.
Instead of the filtering if, use slicing instead because an empty string is a problem only at the beginning:
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F')]
return res[1:] if my_str and my_str[0]=='F' else res
Output:
>>> foo('FF9B2Z1F8B30Z4')
['F', 'F9B2Z1', 'F8B30Z4']
>>> foo('FF9B2Z1FF8B30Z4FF')
['F', 'F9B2Z1', 'F', 'F8B30Z4', 'F', 'F']
>>> foo('9B2Z1F8B30Z4')
['F9B2Z1', 'F8B30Z4']
>>> foo('')
['F']
Using regex it can be done with
import re
pattern = r'^[^F]+|(?<=F)[^F]*'
The ^[^F]+ captures all characters at the beginning of strings that do not start with F.
(?<=F)[^F]* captures anything following an F so long as it is not an F character including empty matches.
>>> print(['F' + x for x in re.findall(pattern, 'abcFFFAFF')])
['Fabc', 'F', 'F', 'FA', 'F', 'F']
>>> print(['F' + x for x in re.findall(pattern, 'FFabcFA')])
['F', 'Fabc', 'FA']
>>> print(['F' + x for x in re.findall(pattern, 'abc')])
['Fabc']
Note that this returns nothing for empty strings. If empty strings need to return ['F'] then pattern can be changed to pattern = r'^[^F]+|(?<=F)[^F]*|^$' adding ^$ to capture empty strings.

How to extract the value between the key using RegEx?

I have text like:
"abababba"
I want to extract the characters as a list between a.
For the above text, I am expecting output like:
['b', 'b', 'bb']
I have used:
re.split(r'^a(.*?)a$', data)
But it doesn't work.
You could use re.findall to return the capture group values with the pattern:
a([^\sa]+)(?=a)
a Match an a char
([^\sa]+) Capture group 1, repeat matching any char except a (or a whitspace char if you don't want to match spaces)
(?=a) Positive lookahead, assert a to the right
Regex demo
import re
pattern = r"a([^\sa]+)(?=a)"
s = "abababba"
print(re.findall(pattern, s))
Output
['b', 'b', 'bb']
You could use a list comprehension to achieve this:
s = "abababba"
l = [x for x in s.split("a") if not x == ""]
print(l)
Output:
['b', 'b', 'bb']
The ^ and $ will only match the beginning and end of a line, respectively.
In this case, you will get the desired list by using the line:
re.split(r'a(.*?)a', data)[1:-1]
Why not use a normal split:
"abababba".split("a") --> ['', 'b', 'b', 'bb', '']
And remove the empty parts as needed:
# remove all empties:
[*filter(None,"abababba".split("a"))] -> ['b', 'b', 'bb']
or
# only leading/trailing empties (if any)
"abababba".strip("a").split("a") --> ['b', 'b', 'bb']
or
# only leading/trailing empties (assuming always enclosed in 'a')
"abababba".split("a")[1:-1] --> ['b', 'b', 'bb']
If you must use a regular expression, perhaps findall() will let you use a simpler pattern while covering all edge cases (ignoring all empties):
re.findall(r"[^a]+","abababba") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","abababb") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","bababb") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","babaabb") --> ['b', 'b', 'bb']

A simple question about regex usage in list

I have a list in list as ''list_all'' below, I am looking for a word as stated 'c' below. Ther is no 'c' in second list. Codes below give results as ['c', 'c'] but I want to have ['c', '', 'c'] as to be same lenght 'list_all'. Could you please help me on it how can I put empty element to result.
import re
list_all = [['a','b','c','d'],['a','b','d'],['a','b','c','d','e']]
listofresult =[]
for h in [*range(len(list_all))]:
for item in list_all[h]:
patern = r"(c)"
if re.search(patern, item):
listofresult.append(item)
else:
None
print(listofresult)
try this
import re
list_all = [['a','b','c','d'],['a','b','d'],['a','b','c','d','e']]
temp = True
listofresult =[]
for h in range(len(list_all)):
for item in list_all[h]:
patern = r"(c)"
if re.search(patern, item):
listofresult.append(item)
temp = False
if temp:
listofresult.append("")
temp = True
print(listofresult)
That's an unusual use of regex! but if you insist, this correction might help:
import re
list_all = [['a', 'b', 'c', 'd'], ['a', 'b', 'd'], ['a', 'b', 'c', 'd', 'e']]
list_of_result = []
for h in list_all:
result = ''
for item in h:
pattern = r"(c)"
if re.search(pattern, item):
result = item
break
if result:
list_of_result.append(result)
else:
list_of_result.append('')
print(list_of_result)

how to turn string into nested list with elements separated with commas

I have a string which looks like this:
'(a (b (c d e f)) g)'
I want to turn it into such a nested list:
['a', ['b', ['c', 'd', 'e', 'f']], 'g']
I used this function:
def tree_to_list(text, left=r'[(]', right=r'[)]', sep=r','):
pat = r'({}|{}|{})'.format(left, right, sep)
tokens = re.split(pat, text)
stack = [[]]
for x in tokens:
if not x or re.match(sep, x): continue
if re.match(left, x):
stack[-1].append([])
stack.append(stack[-1][-1])
elif re.match(right, x):
stack.pop()
if not stack:
raise ValueError('error: opening bracket is missing')
else:
stack[-1].append(x)
if len(stack) > 1:
print(stack)
raise ValueError('error: closing bracket is missing')
return stack.pop()
But result is not what i expected. There are no commas among strings:
['a', ['b', ['c' 'd' 'e' 'f']], 'g']
Could you please help me with that
You can use recursion with a generator:
import re
data = '(a (b (c d e f)) g)'
def group(d):
a = next(d, ')')
if a != ')':
yield list(group(d)) if a == '(' else a
yield from group(d)
print(next(group(iter(re.findall(r'\w+|[()]', data)))))
Output:
['a', ['b', ['c', 'd', 'e', 'f']], 'g']
Using string replacements to turn the input into the string with the desired Python value, and literal_eval to turn it into the value itself:
>>> import ast, re
>>> data = '(a (b (c d e f)) g)'
>>> s = re.sub(r'(\w+)', r'"\1"', data) # quote words
>>> s = re.sub(r'\s+', ',', s) # whitespace to comma
>>> s = s.replace('(', '[').replace(')', ']') # () -> []
>>> ast.literal_eval(s)
['a', ['b', ['c', 'd', 'e', 'f']], 'g']
People have suggested their own solutions, but the problem with the code you are using is that sep is set to the regex r',', which matches a single comma. Like you say, you don't use commas to separate text, you use whitespace. If you replace the default value of sep with r'\s', or call the function like tree_to_list'(a (b (c d e f)) g)', sep=r'\s'), then it works for me.

Split a string in Python having parenthesis (multiple splitters)

I have a string, for example:
"ab(abcds)kadf(sd)k(afsd)(lbne)"
I want to split it to a list such that the list is stored like this:
a
b
abcds
k
a
d
f
sd
k
afsd
lbne
I need to get the elements outside the parenthesis in separate rows and the ones inside it in separate ones.
I am not able to think of any solution to this problem.
You can use iter to make an iterator and use itertools.takewhile to extract the strings between the parens:
it = iter(s)
from itertools import takewhile
print([ch if ch != "(" else "".join(takewhile(lambda x: x!= ")",it)) for ch in it])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
If ch is not equal to ( we just take the char else if ch is a ( we use takewhile which will keep taking chars until we hit a ) .
Or using re.findall get all strings starting and ending in () with \((.+?))` and all other characters with :
print([''.join(tup) for tup in re.findall(r'\((.+?)\)|(\w)', s)])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
You just need to use the magic of 're.split' and some logic.
import re
string = "ab(abcds)kadf(sd)k(afsd)(lbne)"
temp = []
x = re.split(r'[(]',string)
#x = ['ab', 'abcds)kadf', 'sd)k', 'afsd)', 'lbne)']
for i in x:
if ')' not in i:
temp.extend(list(i))
else:
t = re.split(r'[)]',i)
temp.append(t[0])
temp.extend(list(t[1]))
print temp
#temp = ['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
Have a look at difference in append and extend here.
I hope this helps.
You have two options. The really easy one is to just iterate over the string. For example:
in_parens=False
buffer=''
for char in my_string:
if char =='(':
in_parens=True
elif char==')':
in_parens = False
my_list.append(buffer)
buffer=''
elif in_parens:
buffer+=char
else:
my_list.append(char)
The other option is regex.
I would suggest regex. It is worth practicing.
Try: Python re. If you are new to re it may take a bit of time but you can do all kind of string manipulations once you get it.
import re
search_string = 'ab(abcds)kadf(sd)k(afsd)(lbne)'
re_pattern = re.compile('(\w)|\((\w*)\)') # Match single character or characters in parenthesis
print [x if x else y for x,y in re_pattern.findall(search_string)]

Categories