Python re.split with comma and parenthesis "(" and ")" [duplicate] - python

This question already has answers here:
Tuple to List - Python / PostgreSQL return type of SETOF Record
(2 answers)
Closed 8 years ago.
so I got this code:
from dosql import *
import cgi
import simplejson as json
import re
def index(req, userID):
userID = cgi.escape(userID)
get = doSql()
rec = get.execqry("select get_progressrecord('" + userID + "');",False)[0][0]
result = str(rec)
stringed = re.split(',', result)
return json.dumps(stringed)
And it returns this:
But I want to exclude the parenthesis "(" ")" too. How could I put multiple delimiters in the regex?

Using str.strip, you can remove surround characters specified:
>>> row = ["(178.00", "65.00", "20.52", "normal", "18", "0.00)"]
>>> [x.strip('(),') for x in row]
['178.00', '65.00', '20.52', 'normal', '18', '0.00']
BTW, if get.execqry(..) returns a tuple, string manipulation is not necessary.
a_tuple = get.execqry(....)
# or (if you want a list)
a_list = list(get.execqry(....))

Put a | between them:
stringed = re.split(',|(|)', result)

You can use a simple regex like this:
[,()]
Working demo
The idea is to match the characters you want using a regex class [...]. So, this will match commas or parentheses.
On the other hand, if you want to capture the following content:
(250.00", "612.00", "55.55", "normal", "1811", "0.00)
You could use something like this:
([\w.]+)
Working demo

Related

Regex search for a word and extract until a character [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
Need help in parsing a string, where it contains values for each attribute. below is my sample string...
Type=<Series VR> Model=<1Ac4> ID=<34> conn seq=<2>
from the above, I have to generate the attribute values are below.
Type=Series VR
Model=1Ac4
ID=34
conn seq=2
I am new to using regex. any help is appreciated. thanks
This script will extract key, value from the string:
import re
s = 'Type=<Series VR> Model=<1Ac4> ID=<34> conn seq=<2>'
for k, v in re.findall(r'([^=]+)=<([^>]+)>\s*', s):
print('{}={}'.format(k, v))
Prints:
Type=Series VR
Model=1Ac4
ID=34
conn seq=2
EDIT: You can extract key,values to dictionary and then access it via .get():
import re
s = 'Type=<Series VR> Model=<1Ac4> ID=<34> conn seq=<2>'
d = dict(re.findall(r'([^=]+)=<([^>]+)>\s*', s))
print(d.get('Model', ''))
print(d.get('NonExistentKey', ''))
Prints:
1Ac4
Try this:
r'([^=]+)=<([^>]+)>'
This works as follows:
([^=]+) matches a group of any characters aren't = and is at least one character long.
=< matches those characters literally
([^>]+) matches another group of any characters that aren't > and is at least one character long.
To match a specific key and 'extract' it's value, try this (example shown is 'Model'):
r'Model=<([^>]+)>'
This now only has one grouping, the ([^>]+) grouping that matches the value contained within <>. This could be generalized for any key, like so: f'{key}=<([^>]+)>'

how to remove parantheses and string from a string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Literally, I've been trying to a way to solve this but it seems that I'm poor on regex;)
I need to remove (WindowsPath and )"from the strings in a list
x= ["(WindowsPath('D:/test/1_birds_bp.png'),WindowsPath('D:/test/1_eagle_mp.png'))", "(WindowsPath('D:/test/2_reptiles_bp.png'),WindowsPath('D:/test/2_crocodile_mp.png'))"]
So I tried
import re
cleaned_x = [re.sub("(?<=WindowsPath\(').*?(?='\))",'',a) for a in x]
outputs
["(WindowsPath(''),WindowsPath(''))", "(WindowsPath(''),WindowsPath(''))"]
what I need to have is;
cleaned_x= [('D:/test/1_birds_bp.png','D:/test/1_eagle_mp.png'), ('D:/test/2_reptiles_bp.png','D:/test/2_crocodile_mp.png')]
basically tuples in a list.
You can accomplish this by using re.findall like this:
>>> cleaned_x = [tuple(re.findall(r"[A-Z]:/[^']+", a)) for a in x]
>>> cleaned_x
[('D:/test/1_birds_bp.png', 'D:/test/1_eagle_mp.png'), ('D:/test/2_reptiles_bp.png',
'D:/test/2_crocodile_mp.png')]
>>>
Hope it helps.
Perhaps you could use capturing groups? For instance:
import re
re_winpath = re.compile(r'^\(WindowsPath\(\'(.*)\'\)\,WindowsPath\(\'(.*)\'\)\)$')
def extract_pair(s):
m = re_winpath.match(s)
if m is None:
raise ValueError(f"cannot extract pair from string: {s}")
return m.groups()
pairs = list(map(extract_pair, x))
Here's my take,
not pretty, and I did it in two steps so as not to make regexp spagetti, and you could turn it into a list comprehension if you like, but it should work
for a in x:
a = re.sub('(\()?WindowsPath', '', a)
a = re.sub('\)$','', a)
print(a)

How to read characters from a string? [duplicate]

This question already has answers here:
Python partition and split [closed]
(2 answers)
Closed last month.
I have python code where I have an input parameter. The other application is going to pass the parameter. Here's the requirement. Parameter will be passed like this.
abc_201901.txt
abc.201901.txt
abcde_201901.txt
From the above parameter, I need to extract "abc_" or "abc". or "abcded_", and then go to the directory and search for the files.
I am having hard time to find the function which can do this.
if you use only "_" and ".", you can use re.search:
import re
a = "test.123.txt"
regex = r"([0-9].*)"
m = re.search(regex, a) # match from any digit to end
m.group(1) # ->'123.txt'
Good luck
--- Correction ---
It's better to take all characters except number with the regex:
import re
a = "test.123.txt"
regex = r"([^0-9]*)"
m = re.search(regex, a) # match all characters except number
m.group(1) # -> 'abc_'
A possible solution with slicing
def extract_prefix(filename):
# list of separators
separators = ['.', '_', '-']
for sep in separators:
# rid off file extension
if sep in filename[:-4]:
# slicing, including sep (+1)
return filename[:filename.find(sep) + 1]
return None
a = "test_123.txt"
prefix = extract_prefix(a)
print(prefix)

how to find more than one match with a regular expression? [duplicate]

This question already has answers here:
regexes: How to access multiple matches of a group? [duplicate]
(2 answers)
Closed 3 years ago.
i have a string like this:
to_search = "example <a>first</a> asdqwe <a>second</a>"
and i want to find both solutions between like this:
list = ["first","second"]
i know that when searching for one solution i should use this code:
import re
if to_search.find("<a>") > -1:
result = re.search('<a>(.*?)</a>', to_search)
s = result.group(1)
print(s)
but that only prints:
first
i tried result.group(2) and result.group(0) but i get the same solution
how can i make a list of all solutions?
Just use:
import re
to_search = "example <a>first</a> asdqwe <a>second</a>"
matches = re.findall(r'<a>(.*?)</a>', to_search)
print(matches)
OUTPUT
['first', 'second']
best to use a HTML parser than regex, but change re.search to re.findall
to_search = "example <a>first</a> asdqwe <a>second</a>"
for match in re.finditer("<a>(.*?)</a>", to_search):
captured_group = match.group(1)
# do something with captured group

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

Categories