Extracting part of string in parenthesis using python - python

I have a csv file with a column with strings. Part of the string is in parentheses. I wish to move the part of string in parentheses to a different column and retain the rest of the string as it is.
For instance: I wish to convert:
LC(Carbamidomethyl)RLK
to
LCRLK Carbamidomethyl

Regex solution
If you only have one parentheses group in your string, you can use this regex:
>>> a = "LC(Carbamidomethyl)RLK"
>>> re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)
'LCRLK Carbamidomethyl'
>>> a = "LCRLK"
>>> re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)
'LCRLK' # works with no parentheses too
Regex decomposed:
(.*) #! Capture begin of the string
\( # match first parenthesis
(.+) #! Capture content into parentheses
\) # match the second
(.*) #! Capture everything after
---------------
\g<1>\g<3> \g<2> # Write each capture in the correct order
String manipulation solution
A faster solution, for huge data set is:
begin, end = a.find('('), a.find(')')
if begin != -1 and end != -1:
a = a[:begin] + a[end+1:] + " " + a[begin+1:end]
The process is to get the positions of parentheses (if there's any) and cut the string where we want. Then, we concatenate the result.
Performance of each method
It's clear that the string manipulation is the fastest method:
>>> timeit.timeit("re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)", setup="a = 'LC(Carbadidomethyl)RLK'; import re")
15.214869976043701
>>> timeit.timeit("begin, end = a.find('('), a.find(')') ; b = a[:begin] + a[end+1:] + ' ' + a[begin+1:end]", setup="a = 'LC(Carbamidomethyl)RL'")
1.44008207321167
Multi parentheses set
See comments
>>> a = "DRC(Carbamidomethyl)KPVNTFVHESLADVQAVC(Carbamidomethyl)SQKNVACK"
>>> while True:
... begin, end = a.find('('), a.find(')')
... if begin != -1 and end != -1:
... a = a[:begin] + a[end+1:] + " " + a[begin+1:end]
... else:
... break
...
>>> a
'DRCKPVNTFVHESLADVQAVCSQKNVACK Carbamidomethyl Carbamidomethyl'

Related

Is there a more efficient way of replacing certain regex matches with elements?

I am required to use regex module.
I have coded this little program to replace certain regex matches such as orange with the length of orange in # signs, for example, if orange is in the string then it will be replaced with ######.
If a string has been changed it will add " !! This string has been changed !!" to the end of the string.
If a string has not been changed but has a # in it then it will not add " !! This string has been changed !!".
I am wondering, is there a more efficient way of coding this up? using regex functions and better python code.
orange = re.compile(r'\borange\b', re.IGNORECASE)
frog = re.compile(r'\bfrog\b', re.IGNORECASE)
cat = re.compile(r'\bcat\b', re.IGNORECASE)
num = 0
if re.search(orange, s):
s = re.sub(orange, "!!!!!!", s)
num +=1
if re.search(frog, s):
s = re.sub(frog, "!!!!", s)
num +=1
if re.search(cat, s):
s = re.sub(cat, "!!!", s)
num +=1
if num > 0:
return s + " !! This string has been changed !!"
else:
return s```
Assuming your line input can contain 'orange' 'frog' 'cat' simultaneously ONE particular solution to this is, create a regex pattern which can match either of your solutions, then create an iterator for each match, re-place this found match with the 'x' according to the len of the matched string and printing the string modified (or not if that is the case)
Code is:
import re
string = "orange frog cat test"
#string = "one two tree testing stackoverflow"
regex_pattern = re.compile(r"\b(orange|frog|cat)\b", re.IGNORECASE)
total_matches = regex_pattern.finditer(string)
# We find either of the options? then changes will be made
changes_done = regex_pattern.search(string)
for match in total_matches:
element_find = match.group(0)
string = regex_pattern.sub("x" * len(element_find), string, 1)
if( changes_done ):
print(string + " | changes where made")
else:
print(string + " | no changes made")
What really shines in this particular solution is the third parameter of sub, where you can limit the count of matches done. As i said, one particular solution for your problem.
Output generated for the replacement will be xxxxxx xxxx xxx test | changes where made
I guess you're using this code inside a function, since you're returning some values.
Anyway, without the num counter:
import re
pattern = r"\b(orange|frog|cat)\b"
s = "an orange eaten by a frog and a cat"
rgx_matches = re.findall(pattern, s, flags=re.IGNORECASE)
for rgx_match in rgx_matches:
print(re.sub(rgx_match, "#"*len(rgx_match), s) +\
" !! This string has been changed !!")

What Regex will match on pairs of comma-separated numbers, with pairs separated by pipes?

I am currently trying to RegEx match (in Python) on inputs that look like:
37.1000,-88.1000
37.1000,-88.1000|37.1450,-88.1060
37.1000,-88.1000|37.1450,-88.1060|35.1450,-83.1060
So, pairs of decimal numbers, separate by commas, and then those pairs (if > 1 pair) separated by |. I've tried a few things but cannot seem to get a regex string that matches properly.
Attempt 1:
((((\d*\.?\d+,\d*\.?\d+)\|)+)|(\d*\.?\d+,\d*\.?\d+))
Attempt 2:
((((-?\d*\.?\d+,-?\d*\.?\d+)\|)+)|(-?\d*\.?\d+,-?\d*\.?\d+))
I was hoping someone might have done this before, or has enough RegEx experience to do something like this.
If you want to match the whole string, you could match the decimal and repeat the pattern prepended by a comma.
Then use that same pattern and repeat that prepended by a |
^[+-]?\d+\.\d+(?:,[+-]?\d+\.\d+)*(?:\|[+-]?\d+\.\d+(?:,[+-]?\d+\.\d+)*)*$
^ Start of string
[+-]?\d+\.\d+ Match an optional + or - and a decimal part
(?: Non capturing group
,[+-]?\d+\.\d+ Match the same pattern as before prepended by a comma
)* Close group and repeat 0+ times
(?: Non capturing group
\| Match |
[+-]?\d+\.\d+ Match an optional + or - and a decimal part
(?: Non capturing group
,[+-]?\d+\.\d+ Match the same pattern as before prepended by a comma
)* Close group and repeat 0+ times
)* Close group and repeat 0+ times
$ End of string
regex demo
This is what parsers are for (checking the correct format, that is):
from parsimonious.grammar import Grammar
data = """
37.1000,-88.1000
37.1000,-88.1000|37.1450,-88.1060
37.1000,-88.1000|37.1450,-88.1060|35.1450,-83.1060
"""
grammar = Grammar(
r"""
line = pair (pipe pair)*
pair = point ws? comma ws? point
point = ~"-?\d+(?:.\d+)?"
comma = ","
pipe = "|"
ws = ~"\s+"
"""
)
for line in data.split("\n"):
try:
grammar.parse(line)
print("Correct format: {}".format(line))
except:
print("Not correct: {}".format(line))
This will yield
Not correct:
Correct format: 37.1000,-88.1000
Correct format: 37.1000,-88.1000|37.1450,-88.1060
Correct format: 37.1000,-88.1000|37.1450,-88.1060|35.1450,-83.1060
Not correct:
Bot Not correct: statements come from the empty lines.
If you actually want to retrieve the values, you'd need to write another Visitor class:
class Points(NodeVisitor):
grammar = Grammar(
r"""
line = pair (pipe pair)*
pair = point ws? comma ws? point
point = ~"-?\d+(?:.\d+)?"
comma = ","
pipe = "|"
ws = ~"\s+"
"""
)
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
x, *_, y = visited_children
return (x.text, y.text)
def visit_line(self, node, visited_children):
pairs = [visited_children[0]]
for potential_pair in [item[1] for item in visited_children[1]]:
pairs.append(potential_pair)
return pairs
point = Points()
for line in data.split("\n"):
try:
pairs = point.parse(line)
print(pairs)
except ParseError:
print("Not correct: {}".format(line))
You don't even need regex for this. Keep it simple.
Step 1
Split on ,.
s.split(',')
Step 2
Split on | and ensure each result is of type float (rather, that it can be converted to this type without fault). The second step here (validation) can be removed if it's not required.
r = s.split('|')
for v in r:
try:
float(v)
except ValueError:
print(v + ' is not a float')
Step 3
Combine.
Test it here
strings = [
'37.1000,-88.1000',
'37.1000,-88.1000|37.1450,-88.1060',
'37.1000,-88.1000|37.1450,-88.1060|35.1450,-83.1060'
]
def split_on_comma(s):
return s.split(',')
def split_on_bar(s):
r = s.split('|')
for v in r:
try:
float(v)
except ValueError:
print(v + ' is not a float')
return r
for s in strings:
for c in split_on_comma(s):
print(split_on_bar(c))
Without validation and functions, your code becomes:
for s in strings:
for c in s.split(','):
for b in c.split('|'):
print(b)
You can change the output to your liking, but this presents each required step for splitting and validating the data.
If you want to retrieve the value by pairs, and you use a simple regex or just split()
for value in values:
pairs = re.findall("([\d. ,-]+)\|?", value)
for pair in pairs:
v1, v2 = pair.strip().split(",")
# or
for value in values:
pairs = value.split("|")
for pair in pairs:
v1, v2 = pair.strip().split(",")

Finding the index of the second match of a regular expression in python

So I am trying to rename files to match the naming convention for plex mediaserver. ( SxxEyy )
Now I have a ton of files that use eg. 411 for S04E11. I have written a little function that will search for an occurrence of this pattern and replace it with the correct convention. Like this :
pattern1 = re.compile('[Ss]\\d+[Ee]\\d+')
pattern2 = re.compile('[\.\-]\d{3,4}')
def plexify_name(string):
#If the file matches the pattern we want, don't change it
if pattern1.search(string):
return string
elif pattern2.search(string):
piece_to_change = pattern2.search(string)
endpos = piece_to_change.end()
startpos = piece_to_change.start()
#Cut out the piece to change
cut = string[startpos+1:endpos-1]
if len(cut) == 4:
cut = 'S'+cut[0:2] + 'E' + cut[2:4]
if len(cut) == 3:
cut = 'S0'+cut[0:1] + 'E' + cut[1:3]
return string[0:startpos+1] + cut + string[endpos-1:]
And this works very well. But it turns out that some of the filenames will have a year in them eg. the.flash.2014.118.mp4 In which case it will change the 2014.
I tried using
pattern2.findall(string)
Which does return a list of strings like this --> ['.2014', '.118'] but what I want is a list of matchobjects so I can check if there is 2 and in that case use the start/end of the second. I can't seem to find something to do this in the re documentation. I am missing something or do I need to take a totally different approach?
You could try anchoring the match to the file extension:
pattern2 = re.compile(r'[.-]\d{3,4}(?=[.]mp4$)')
Here, (?= ... ) is a look-ahead assertion, meaning that the thing has to be there for the regex to match, but it's not part of the match:
>>> pattern2.findall('test.118.mp4')
['.118']
>>> pattern2.findall('test.2014.118.mp4')
['.118']
>>> pattern2.findall('test.123.mp4.118.mp4')
['.118']
Of course, you want it to work with all possible extensions:
>>> p2 = re.compile(r'[.-]\d{3,4}(?=[.][^.]+$)')
>>> p2.findall('test.2014.118.avi')
['.118']
>>> p2.findall('test.2014.118.mov')
['.118']
If there is more stuff between the episode number and the extension, regexes for matching that start to get tricky, so I would suggest a non-regex approach for dealing with that:
>>> f = 'test.123.castle.2014.118.x264.mp4'
>>> [p for p in f.split('.') if p.isdigit()][-1]
'118'
Or, alternatively, you can get match objects for all matches by using finditer and expanding the iterator by converting it to a list:
>>> p2 = re.compile(r'[.-]\d{3,4}')
>>> f = 'test.2014.712.x264.mp4'
>>> matches = list(p2.finditer(f))
>>> matches[-1].group(0)
'.712'

Capture a Repeating Group in Python using RegEx (see example)

I am writing a regular expression in python to capture the contents inside an SSI tag.
I want to parse the tag:
<!--#include file="/var/www/localhost/index.html" set="one" -->
into the following components:
Tag Function (ex: include, echo or set)
Name of attribute, found before the = sign
Value of attribute, found in between the "'s
The problem is that I am at a loss on how to grab these repeating groups, as name/value pairs may occur one or more times in a tag. I have spent hours on this.
Here is my current regex string:
^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$
It captures the include in the first group and file="/var/www/localhost/index.html" set="one" in the second group, but what I am after is this:
group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"
(continue for every other name="value" pair)
I am using this site to develop my regex
Grab everything that can be repeated, then parse them individually. This is probably a good use case for named groups, as well!
import re
data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''
result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')
Then iterate through it:
g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
key, value = keyvalue.split('=')
# do something with them
A way with the new python regex module:
#!/usr/bin/python
import regex
s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
p = r'''(?x)
(?>
\G(?<!^)
|
<!-- \# (?<function> [a-z]+ )
)
\s+
(?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''
matches = regex.finditer(p, s)
for m in matches:
if m.group("function"):
print ("function: " + m.group("function"))
print (" key: " + m.group("key") + "\n value: " + m.group("val") + "\n")
The way with re module:
#!/usr/bin/python
import re
s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
p = r'''(?x)
<!-- \# (?P<function> [a-z]+ )
\s+
(?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
-->
'''
matches = re.finditer(p, s)
for m in matches:
print ("function: " + m.group("function"))
for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
if param.group(1):
print (" value: " + param.group(1) + "\n")
else:
print (" key: " + param.group())
I recommend against using a single regular expression to capture every item in a repeating group. Instead--and unfortunately, I don't know Python, so I'm answering it in the language I understand, which is Java--I recommend first extracting all attributes, and then looping through each item, like this:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop {
public static final void main(String[] ignored) {
String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";
Matcher m = Pattern.compile(
"<!--#(include|echo|set) +(.*)-->").matcher(input);
m.matches();
String tagFunc = m.group(1);
String allAttrs = m.group(2);
System.out.println("Tag function: " + tagFunc);
System.out.println("All attributes: " + allAttrs);
m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
while(m.find()) {
System.out.println("name=\"" + m.group(1) +
"\", value=\"" + m.group(2) + "\"");
}
}
}
Output:
Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"
Here's an answer that may be of interest: https://stackoverflow.com/a/23062553/2736496
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.
Unfortunately python does not allow for recursive regular expressions.
You can instead do this:
import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
key, val in item.split('=')
# You can now do whatever you want with the key=val pair
The regex library allows capturing repeated groups (while builtin re does not). This allows for a simple solution without needing external for-loops to parse the groups afterwards.
import regex
string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')
match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n ')
gives you what you're after
include
file = /var/www/localhost/index.html
set = one

getting string between 2 characters in python

I need to get certain words out from a string in to a new format. For example, I call the function with the input:
text2function('$sin (x)$ is an function of x')
and I need to put them into a StringFunction:
StringFunction(function, independent_variables=[vari])
where I need to get just 'sin (x)' for function and 'x' for vari. So it would look like this finally:
StringFunction('sin (x)', independent_variables=['x']
problem is, I can't seem to obtain function and vari. I have tried:
start = string.index(start_marker) + len(start_marker)
end = string.index(end_marker, start)
return string[start:end]
and
r = re.compile('$()$')
m = r.search(string)
if m:
lyrics = m.group(1)
and
send = re.findall('$([^"]*)$',string)
all seems to seems to give me nothing. Am I doing something wrong? All help is appreciated. Thanks.
Tweeky way!
>>> char1 = '('
>>> char2 = ')'
>>> mystr = "mystring(123234sample)"
>>> print mystr[mystr.find(char1)+1 : mystr.find(char2)]
123234sample
$ is a special character in regex (it denotes the end of the string). You need to escape it:
>>> re.findall(r'\$(.*?)\$', '$sin (x)$ is an function of x')
['sin (x)']
If you want to cut a string between two identical characters (i.e, !234567890!)
you can use
line_word = line.split('!')
print (line_word[1])
You need to start searching for the second character beyond start:
end = string.index(end_marker, start + 1)
because otherwise it'll find the same character at the same location again:
>>> start_marker = end_marker = '$'
>>> string = '$sin (x)$ is an function of x'
>>> start = string.index(start_marker) + len(start_marker)
>>> end = string.index(end_marker, start + 1)
>>> string[start:end]
'sin (x)'
For your regular expressions, the $ character is interpreted as an anchor, not the literal character. Escape it to match the literal $ (and look for things that are not $ instead of not ":
send = re.findall('\$([^$]*)\$', string)
which gives:
>>> import re
>>> re.findall('\$([^$]*)\$', string)
['sin (x)']
The regular expression $()$ otherwise doesn't really match anything between the parenthesis even if you did escape the $ characters.

Categories