extract data from nested parenthesis? - python

I have a string:
test_string = 'RGBA(30(25VARGHK_65FVDFKDGV_10FVDSSLBA)_10UJN(85VOEZSR_5VAVUSR_10SQMCFE)_20BBLRG(SSLCN)_10UDSCT(80EDYFIH_10VAP_10SNE)_30EDU(50EDFva_50VAP)_10EDP(50EDFva_50SNE))'
I need to extract the data from the string and the final result should look like that:
RGBA,
30TCH:25VARGHK, 65FVDFKDGV, 10FVDSSLBA,
10UJN:85VOEZSR, 5VAVUSR, 5SQMCFE
....
and so on..
I thought using regex but it is not good solution here..

Regex will work fine. After you remove the outer I(), you have a many sets of "prefix" followed by a (group_of_data)
If you don't want trailing commas, try this
import re
regex = r"[^(]+\([^)]+\)"
s = 'RGBA(30(25VARGHK_65FVDFKDGV_10FVDSSLBA)_10UJN(85VOEZSR_5VAVUSR_10SQMCFE)_20BBLRG(SSLCN)_10UDSCT(80EDYFIH_10VAP_10SNE)_30EDU(50EDFva_50VAP)_10EDP(50EDFva_50SNE))'
first_start = s.index('(')
print(s[:first_start])
matches = re.finditer(regex, s[first_start+1:-1], re.MULTILINE)
for _, match in enumerate(matches, start=1):
g = match.group().lstrip('_')
data_start = g.index('(')
prefix = g[:data_start]
data = ', '.join(g[data_start + 1:-1].split('_'))
print(f'{prefix}:{data}')
Output
RGBA
30:25VARGHK, 65FVDFKDGV, 10FVDSSLBA,
10UJN:85VOEZSR, 5VAVUSR, 10SQMCFE
20BBLRG:SSLCN
10UDSCT:80EDYFIH, 10VAP, 10SNE
30EDU:50EDFva, 50VAP
10EDP:50EDFva, 50SNE

This seems to get you (almost) there -
[_.replace("(", ": ").replace("_", ", ") for _ in re.split(r"\)_", test_string)]
Output
['RGBA: 30TCH: 25VARGHK, 65FVDFKDGV, 10FVDSSLBA',
'10UJN: 85VOEZSR, 5VAVUSR, 10SQMCFE
'20BBLRG:SSLCN',
'10UDSCT:80EDYFIH, 10VAP, 10SNE
'30EDU:50EDF, 50VPC',
'10EDP:50EDF, 50SNELP))']

I think we may need a little more clarification on the logic. It looks like ( should translate into a :, but not every time. Here is my crack at it using regexes. This might not be exactly what you are looking for, but should be pretty close:
import re
def main():
test_string = 'RGBA(30(25VARGHK_65FVDFKDGV_10FVDSSLBA)_10UJN(85VOEZSR_5VAVUSR_10SQMCFE)_20BBLRG(SSLCN)_10UDSCT(80EDYFIH_10VAP_10SNE)_30EDURKA(50EDFL_50VAP)_10EDPJ(50EDFV_50SNOL))'
test_string = re.sub("\)_", ",\n", test_string)
test_string = re.sub("_", ",", test_string)
test_string = re.sub("\(", ":", test_string)
test_string = re.sub("\)\)", "", test_string)
print(test_string)
if __name__ == "__main__":
main()
results:
RGBA:30:25VARGHK,65FVDFKDGV,10FVDSSLBA,
10UJN:85VOEZSR,5VAVUSR,10SQMCFE,
20BBLRG:SSLCN,
10UDSCT:85EDYFIH,5VAPOR,10SQMCFE,
30EDURKA:70EDFL,30VAPOR,
10EDPJ:50EDFV,50SNOL
Pretty much just a series of regexes. Note that by using re.sub like this in an order, you clean the string as you go. You could certainly just fiddle the beginning of the string to change the first : to a ,\n but I'm not sure if the data you are getting in is always the same.

Related

How to start at a specific letter and end when it hits a digit?

I have some sample strings:
s = 'neg(able-23, never-21) s2-1/3'
i = 'amod(Market-8, magical-5) s1'
I've got the problem where I can figure out if the string has 's1' or 's3' using:
word = re.search(r's\d$', s)
But if I want to know if the contains 's2-1/3' in it, it won't work.
Is there a regex expression that can be used so that it works for both cases of 's#' and 's#+?
Thanks!
You can allow the characters "-" and "/" to be captured as well, in addition to just digits. It's hard to tell the exact pattern you're going for here, but something like this would capture "s2-1/3" from your example:
import re
s = "neg(able-23, never-21) s2-1/3"
word = re.search(r"s\d[-/\d]*$", s)
I'm guessing that maybe you would want to extract that with some expression, such as:
(s\d+)-?(.*)$
Demo 1
or:
(s\d+)-?([0-9]+)?\/?([0-9]+)?$
Demo 2
Test
import re
expression = r"(s\d+)-?(.*)$"
string = """
neg(able-23, never-21) s211-12/31
neg(able-23, never-21) s2-1/3
amod(Market-8, magical-5) s1
"""
print(re.findall(expression, string, re.M))
Output
[('s211', '12/31'), ('s2', '1/3'), ('s1', '')]

Convert a string containing integers into an integer

I am trying to convert the integers contained into a string like "15m" into an integer.
With the code below I can achieve what I want. But I am wondering if there is a better solution for this, or a function I'm not aware of which already implements this.
s = "15m"
s_result = ""
for char in s:
try:
i = int(char)
s_result = s_result + char
except:
pass
result = int(s_result)
print result
This code would output below result:
>>>
15
Maybe there is no such "better" solution but I would like to see other solutions, like using regex maybe.
I found a good solution using regex.
import re
result = int(re.sub('[^0-9]','', s))
print result
Which results in:
>>>
15
You could also match one or more digits from the start of the line ^\d+
import re
regex = r"^\d+"
test_str = "15m"
match = re.search(regex, test_str)
if match:
print (int(match.group()))

Python - remove parts of a string

I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .

Changing only one letter when there are a lot simular in string

Suppose I have the following string:
I.like.football
sky.is.blue
I need to make a loop that changes the last '.' to '_' so it looks this way
I.like_football
sky.is_blue
They are all simular style(3 words, 3 dots).
How to do that in a loop?
str='I.like.football'
str=str.rsplit('.',1) #this split from right but only first '.'
print '_'.join(str) # then join it
#output I.like_football
in single line
str='_'.join(str.rsplit('.',1))
str.replace lets you specify the number of replacements. Unfortunately there is no str.rreplace, so you'd need to reverse the string before and after :) eg.
>>> def f(s):
... return s[::-1].replace(".", "_", 1)[::-1]
...
>>> f('I.like.football')
'I.like_football'
>>> f('sky.is.blue')
'sky.is_blue'
Alternatively you could use one of str.rpartition, str.rsplit, str.rfind
This doesn't even need to run in a loop:
import re
p = re.compile(ur'\.(?=[^\.]+$)', re.IGNORECASE | re.MULTILINE)
test_str = u"I.like.football\nsky.is.blue"
subst = u"_"
result = re.sub(p, subst, test_str)

Regex: Replace one pattern with another

I am trying to replace one regex pattern with another regex pattern.
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt'
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
pattern = re.compile('\d+x\d+') # for st_srt
re.sub(pattern, 'S\1E\2',st_srt)
I know the use of S\1E\2 is wrong here. The reason am using \1 and \2 is to catch the value 01 and 02 and use it in S\1E\2.
My desired output is:
st_srt = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.srt'
So, what is the correct way to achieve this.
You need to capture what you're trying to preserve. Try this:
pattern = re.compile(r'(\d+)x(\d+)') # for st_srt
st_srt = re.sub(pattern, r'S\1E\2', st_srt)
Well, it looks like you already accepted an answer, but I think this is what you said you're trying to do, which is get the replace string from 'st_mkv', then use it in 'st_srt':
import re
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt'
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
replace_pattern = re.compile(r'Awake\.([^.]+)\.')
m = replace_pattern.match(st_mkv)
replace_string = m.group(1)
new_srt = re.sub(r'^Awake\.[^.]+\.', 'Awake.{0}.'.format(replace_string), st_srt)
print new_srt
Try using this regex:
([\w+\.]+){5}\-\w+
copy the stirngs into here: http://www.gskinner.com/RegExr/
and paste the regex at the top.
It captures the names of each string, leaving out the extension.
You can then go ahead and append the extension you want, to the string you want.
EDIT:
Here's what I used to do what you're after:
import re
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt' // dont actually need this one
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
replace_pattern = re.compile(r'([\w+\.]+){5}\-\w+')
m = replace_pattern.match(st_mkv)
new_string = m.group(0)
new_string += '.srt'
>>> new_string
'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.srt'
import re
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt'
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
pattern = re.compile(r'(\d+)x(\d+)')
st_srt_new = re.sub(pattern, r'S\1E\2', st_srt)
print st_srt_new

Categories