python: parse substring in brackets from string - python

What's a pythonic way to parse this string in the brackets:
txt = 'foo[bar]'
to get as result:
bar
What have I tried:
How I would solve it and I believe it's not very elegant:
result = txt.split('[')[1].split(']')[0]
I strongly think there is a library or method out there that has a more fault-tolerant and elegant solution for this. That's why I created this question.

Using Regex.
Ex:
import re
txt = 'foo[bar]'
print(re.findall(r"\[(.*?)\]", txt))
Output:
['bar']

One out of many ways, using slicing:
print(txt[4:].strip("[]"))
OR
import re
txt = 'foo[bar]'
m = re.search(r"\[([A-Za-z0-9_]+)\]", txt)
print(m.group(1))
OUTPUT:
bar

Another take on slicing, using str.index to find the locations of the starting and ending delimiters. Once you get a Python slice, you can just it just like an index into the list, except that the slice doesn't just give a single character, but the range of characters from the start to the end.
def make_slice(s, first, second):
floc = s.index(first)
sloc = s.index(second, floc)
# return a Python slice that will extract the contents between the first
# and second delimiters
return slice(floc+1, sloc)
txt = 'foo[bar]'
slc = make_slice(txt, '[', ']')
print(txt[slc])
Prints:
bar

Related

Change line in regex python

I have a few lines of string e.g:
AR0003242303
TR0402304004
CR0402340404
I want to create a dictionary from these lines.
And I need to create change it in regex to:
KOLAORM0003242303
KOLTORM0402304004
KOLCORM0402340404
So i need to split first 2 characters, before PUT KOL, between PUT O, and Afer second char put M. How can i reach it. Through many attempts I lose patience with the regex and unfortunately I now I have no time to learn it better now. Need some result now :(
Could someone help me with this case?
Using re.sub --> re.sub(r"^([A-Z])([A-Z])", r"KOL\1O\2M", string)
Ex:
import re
s = ["AR0003242303", "TR0402304004", "CR0402340404"]
for i in s:
print( re.sub(r"^([A-Z])([A-Z])", r"KOL\1O\2M", i) )
Output:
KOLAORM0003242303
KOLTORM0402304004
KOLCORM0402340404
You don't need regex for this, you can do it with getting the list of characters from the string, recreate the list, and join the string back
def get_convert_s(s):
li = list(s)
li = ['KOL', li[0], '0', li[1], 'M', *li[2:]]
return ''.join(li)
print(get_convert_s('AR0003242303'))
#KOLA0RM0003242303
print(get_convert_s('TR0402304004'))
#KOLT0RM0402304004
print(get_convert_s('CR0402340404'))
#KOLC0RM0402340404
import re
regex = re.compile(r"([A-Z])([A-Z])([0-9]+)")
inputs = [
'AR0003242303',
'TR0402304004',
'CR0402340404'
]
results = []
for input in inputs:
matches = re.match(regex, input)
groups = matches.groups()
results.append('KOL{}O{}M{}'.format(*groups))
print(results)
Assuming the length of the strings in your list will always be the same Devesh answers is pretty much the best approach (no reason to overcomplicate it).
My solution is similar to Devesh, I just like writing functions as oneliners:
list = ["AR0003242303", "TR0402304004", "CR0402340404"]
def convert_s(s):
return "KOL"+s[0]+"0"+s[1]+"M"+s[2:]
for str in list:
print(convert_s(str));
Altough it returns the same output.

How to move all special characters to the end of the string in Python?

I'm trying to filter all non-alphanumeric characters to the end of the strings. I am having a hard time with the regex since I don't know where the special characters we be. Here are a couple of simple examples.
hello*there*this*is*a*str*ing*with*asterisks
and&this&is&a&str&ing&&with&ampersands&in&i&t
one%mo%refor%good%mea%sure%I%think%you%get%it
How would I go about sliding all the special characters to the end of the string?
Here is what I tried, but I didn't get anything.
re.compile(r'(.+?)(\**)')
r.sub(r'\1\2', string)
Edit:
Expected output for the first string would be:
hellotherethisisastringwithasterisks********
There's no need for regex here. Just use str.isalpha and build up two lists, then join them:
strings = ['hello*there*this*is*a*str*ing*with*asterisks',
'and&this&is&a&str&ing&&with&ampersands&in&i&t',
'one%mo%refor%good%mea%sure%I%think%you%get%it']
for s in strings:
a = []
b = []
for c in s:
if c.isalpha():
a.append(c)
else:
b.append(c)
print(''.join(a+b))
Result:
hellotherethisisastringwithasterisks********
andthisisastringwithampersandsinit&&&&&&&&&&&
onemoreforgoodmeasureIthinkyougetit%%%%%%%%%%
Alternative print() call for Python 3.5 and higher:
print(*a, *b, sep='')
Here is my proposed solution for this with regex:
import re
def move_nonalpha(string,char):
pattern = "\\"+char
char_list = re.findall(pattern,string)
if len(char_list)>0:
items = re.split(pattern,string)
if len(items)>0:
return ''.join(items)+''.join(char_list)
Usage:
string = "hello*there*this*is*a*str*ing*with*asterisks"
print (move_nonalpha(string,"*"))
Gives me output:
hellotherethisisastringwithasterisks********
I tried with your other input patterns as well and it's working. Hope it'll help.

How to split a string and keeping the pattern

This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}

Extract substrings from logical expressions

Let's say I have a string that looks like this:
myStr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
What I would like to obtain in the end would be:
myStr_l1 = '(Txt_l1) or (Txt2_l1)'
and
myStr_l2 = '(Txt_l2) or (Txt2_l2)'
Some properties:
all "Txt_"-elements of the string start with an uppercase letter
the string can contain much more elements (so there could also be Txt3, Txt4,...)
the suffixes '_l1' and '_l2' look different in reality; they cannot be used for matching (I chose them for demonstration purposes)
I found a way to get the first part done by using:
myStr_l1 = re.sub('\(\w+\)','',myStr)
which gives me
'(Txt_l1 ) or (Txt2_l1 )'
However, I don't know how to obtain myStr_l2. My idea was to remove everything between two open parentheses. But when I do something like this:
re.sub('\(w+\(', '', myStr)
the entire string is returned.
re.sub('\(.*\(', '', myStr)
removes - of course - far too much and gives me
'Txt2_l2))'
Does anyone have an idea how to get myStr_l2?
When there is an "and" instead of an "or", the strings look slightly different:
myStr2 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2))'
Then I can still use the command from above:
re.sub('\(\w+\)','',myStr2)
which gives:
'(Txt_l1 and Txt2_l1 )'
but I again fail to get myStr2_l2. How would I do this for these kind of strings?
And how would one then do this for mixed expressions with "and" and "or" e.g. like this:
myStr3 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2)) or (Txt3_l1 (Txt3_l2) and Txt4_l1 (Txt2_l2))'
re.sub('\(\w+\)','',myStr3)
gives me
'(Txt_l1 and Txt2_l1 ) or (Txt3_l1 and Txt4_l1 )'
but again: How would I obtain myStr3_l2?
Regexp is not powerful enough for nested expressions (in your case: nested elements in parentheses). You will have to write a parser. Look at https://pyparsing.wikispaces.com/
I'm not entirely sure what you want but I wrote this to strip everything between the parenthesis.
import re
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
sets = mystr.split(' or ')
noParens = []
for line in sets:
mat = re.match(r'\((.* )\((.*\)\))', line, re.M)
if mat:
noParens.append(mat.group(1))
noParens.append(mat.group(2).replace(')',''))
print(noParens)
This takes all the parenthesis away and puts your elements in a list. Here's an alternate way of doing it without using Regular Expressions.
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
noParens = []
mystr = mystr.replace(' or ', ' ')
mystr = mystr.replace(')','')
mystr = mystr.replace('(','')
noParens = mystr.split()
print(noParens)

Replacing variable length items in a list using regex in python

I am trying to replace variable length items in a list using regex. For example this item "HD479659" should be replaced by "HD0000000479659". I need just to insert 7 0s in between.I have made the following program but every time I run it I got the following error:"TypeError: object of type '_sre.SRE_Pattern' has no len()". Can you please help me how to solve this error.
thank you very much
Here is the program
import xlrd
import re
import string
wb = xlrd.open_workbook("3_1.xls")
sh = wb.sheet_by_index(0)
outfile=open('out.txt','w')
s_pat=r"HD[1-9]{1}[0-9]{5}"
s_pat1=r"HD[0]{7}[0-9]{6}"
pat = re.compile(s_pat)
pat1 = re.compile(s_pat1)
for rownum1 in range(sh.nrows):
str1= str(sh.row_values(rownum1))
m1=[]
m1 = pat.findall(str1)
m1=list(set(m1))
for a in m1:
a=re.sub(pat,pat1,a)
print >> outfile, m1
I think your solution is quite to complicated. This one should do the job and is much simpler:
import re
def repl(match):
return match.group(1) + ("0"*7) + match.group(2)
print re.sub(r"(HD)([1-9]{1}[0-9]{5})", repl, "HD479659")
See also: http://docs.python.org/library/re.html#re.sub
Update:
To transform a list of values, you have to iterate over all values. You don't have to search the matching values first:
import re
values_to_transform = [
'HD479659',
'HD477899',
'HD423455',
'does not match',
'but does not matter'
]
def repl(match):
return match.group(1) + ("0"*7) + match.group(2)
for value in values_to_transform:
print re.sub(r"(HD)([1-9]{1}[0-9]{5})", repl, value)
The result is:
HD0000000479659
HD0000000477899
HD0000000423455
does not match
but does not matter
What you need to do is extract the variable length portion of the ID explicitly, then pad with 0's based on the desired length - matched length.
If I understand the pattern correctly you want to use the regex
r"HD(?P<zeroes>0*)(?P<num>\d+)"
At that point you can do
results = re.search(...bla...).groupdict()
Which returns the dict {'zeroes': '', 'num':'479659'} in this case. From there you can pad as necessary.
It's 5am at the moment or I'd have a better solution for you, but I hope this helps.

Categories