Regex replacement for strip() - python

Long time/first time.
I am a pharmacist by trade by am going through the motions of teaching myself how to code in a variety of languages that are useful to me for things like task automation at work, but mainly Python 3.x. I am working through the automatetheboringstuff eBook and finding it great.
I am trying to complete one of the practice questions from Chapter 7:
"Write a function that takes a string and does the same thing as the strip() string method. If no other arguments are passed other than the string to strip, then whitespace characters will be removed from the beginning and end of the string. Otherwise, the characters specified in the second argument to the function will be removed from the string."
I am stuck for the situation when the characters I want to strip appear in the string I want to strip them from e.g. 'ssstestsss'.strip(s)
#!python3
import re
respecchar = ['?', '*', '+', '{', '}', '.', '\\', '^', '$', '[', ']']
def regexstrip(string, _strip):
if _strip == '' or _strip == ' ':
_strip = r'\s'
elif _strip in respecchar:
_strip = r'\'+_strip'
print(_strip) #just for troubleshooting
re_strip = re.compile('^'+_strip+'*(.+)'+_strip+'*$')
print(re_strip) #just for troubleshooting
mstring = re_strip.search(string)
print(mstring) #just for troubleshooting
stripped = mstring.group(1)
print(stripped)
As it is shown, running it on ('ssstestsss', 's') will yield 'testsss' as the .+ gets all of it and the * lets it ignore the final 'sss'. If I change the final * to a + it only improves a bit to yield 'testss'. If I make the capture group non-greedy (i.e. (.+)? ) I still get 'testsss' and if exclude the character to be stripped from the character class for the capture group and remove the end string anchor (i.e. re.compile('^'+_strip+'*([^'+_strip+'.]+)'+_strip+'*') I get 'te' and if I don't remove the end string anchor then it obviously errors.
Apologies for the verbose and ramble-y question.
I deliberately included all the code (work in progress) as I am only learning so I realise that my code is probably rather inefficient, so if you can see any other areas where I can improve my code, please let me know. I know that there is no practical application for this code, but I'm going through this as a learning exercise.
I hope I have asked this question appropriately and haven't missed anything in my searches.
Regards
Lobsta

You (.+) is greedy, (by default). Just change it to non greedy, by using (.+?)
You can test python regex at this site
edit : As someone commented, (.+?) and (.+)? do not do the same thing : (.+?) is the non greedy version of (.+) while (.+)? matches or not the greedy (.+)

As I mentioned in my comment, you did not include special chars into the character class.
Also, the .* without a re.S / re.DOTALL modifier does not match newlines. You may avoid using it with ^PATTERN|PATTERN$ or \APATTERN|PATTERN\Z (note that \A matches the start of a string, and \Z matches the very end of the string, $ can match before the final newline symbol in a string, and thus, you cannot use $).
I'd suggest shrinking your code to
import re
def regexstrip(string, _strip=None):
_strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z"
print(_strip) #just for troubleshooting
return re.sub(_strip, '', string)
print(regexstrip(" ([no more stripping'] ) ", " ()[]'"))
# \A[\s\ \(\)\[\]\']+|[\s\ \(\)\[\]\']+\Z
# no more stripping
print(regexstrip(" ([no more stripping'] ) "))
# \A\s+|\s+\Z
# ([no more stripping'] )
See the Python demo
Note that:
The _strip argument is optional with a =None
The _strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z" inits the regex pattern: if _strip is passed, the symbols are put inside a [...] character class and escaped (since we cannot control the symbol positions much, it is the quickest easiest way to make them all treated as literal symbols).
With re.sub, we remove the matched substrings.

Related

How to parse parameters from text?

I have a text that looks like:
ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c,
,second d), third, fourth)
Engine can be different (instead of CollapsingMergeTree, there can be different word, ReplacingMergeTree, SummingMergeTree...) but the text is always in format ENGINE = word (). Around "=" sign, can be space, but it is not mandatory.
Inside parenthesis are several parameters usually a single word and comma, but some parameters are in parenthesis like second in the example above.
Line breaks could be anywhere. Line can end with comma, parenthesis or anything else.
I need to extract n parameters (I don't know how many in advance). In example above, there are 4 parameters:
first = first_param
second = (second_a, second_b, second_c, second_d) [extract with parenthesis]
third = third
fourth = fourth
How to do that with python (regex or anything else)?
You'd probably want to use a proper parser (and so look up how to hand-roll a parser for a simple language) for whatever language that is, but since what little you show here looks Python-compatible you could just parse it as if it were Python using the ast module (from the standard library) and then manipulate the result.
I came up with a regex solution for your problem. I tried to keep the regex pattern as 'generic' as I could, because I don't know if there will always be newlines and whitespace in your text, which means the pattern selects a lot of whitespace, which is then removed afterwards.
#Import the module for regular expressions
import re
#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c
,second_d), third, fourth)'''
#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)
#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)
#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])
#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])
third = re.sub('\s', '', result[3])
fourth = re.sub('\s', '', result[4])
print(first)
print(second)
print(third)
print(fourth)
OUTPUT:
first_param
second_a,second_b,second_c,second_d
third
fourth
Regex explanation:
\ = Escapes a control character, which is a character regex would interpret to mean something special. More here.
\( = Escape parentheses
() = Mark the expression in the parentheses as a sub-group. See result[1] and so on.
. = Matches any character (including newline, because of re.S)
* = Matches 0 or more occurrences of preceding expression.
? = Matches 0 or 1 occurrence of preceding expression.
NOTE: *? combined is called a nongreedy repetition, meaning the preceding expression is only matched once, instead of over and over again.
I am no expert, but I hope I got the explanations right.
I hope this helps.

Match regex with \\n in it

I have the following string:
>>> repr(s)
" NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp
I want to match the string before the \\n -- everything before a whitespace character. The output should be:
['NBCUniversal', 'VOLGAFILMINC']
Here is what I have so far:
re.findall(r'[^s].+\\n\d{1,2}', s)
What would be the correct regex for this?
EDIT: sorry I haven't read carefully your question
If you want to find all groups of letters immediatly before a literal \n, re.findall is appropriate. You can obtain the result you want with:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> re.findall(r'(?i)[a-z]+(?=\\n)', s)
['NBCUniversal', 'VOLGAFILMINC']
OLD ANSWER:
re.findall is not the appropriate method since you only need one result (that is a pair of strings). Here the re.search method is more appropriate:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> res = re.search(r'^(?i)[^a-z\\]*([a-z]+)\\n[^a-z\\]*([a-z]+)', s)
>>> res.groups()
('NBCUniversal', 'VOLGAFILM')
Note: I have assumed that there are no other characters between the first word and the literal \n, but if it isn't the case, you can add [^a-z\\]* before the \\n in the pattern.
If you want to fix your existing code instead of replace it, you're on the right track, you've just got a few minor problems.
Let's start with your pattern:
>>> re.findall(r'[^s].+\\n\d{1,2}', s)
[' NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64']
The first problem is that .+ will match everything that it can, all the way up to the very last \\n\d{1,2}, rather than just to the next \\n\d{1,2}. To fix that, add a ? to make it non-greedy:
>>> re.findall(r'[^s].+?\\n\d{1,2}', s)
[' NBCUniversal\\n63', ' VOLGAFILM, INC VOLGAFILMINC\\n64']
Notice that we now have two strings, as we should. The problem is, those strings don't just have whatever matched the .+?, they have whatever matched the entire pattern. To fix that, wrap the part you want to capture in () to make it a capturing group:
>>> re.findall(r'[^s](.+?)\\n\d{1,2}', s)
[' NBCUniversal', ' VOLGAFILM, INC VOLGAFILMINC']
That's nicer, but it still has a bunch of extra stuff on the left end. Why? Well, you're capturing everything after [^s]. That means any character except the letter s. You almost certainly meant [\s], meaning any character in the whitespace class. (Note that \s is already the whitespace class, so [\s], meaning the class consisting of the whitespace class, is unnecessary.) That's better, but that's still only going to match one space, not all the spaces. And it will match the earliest space it can that still leaves .+? something to match, not the latest. So if you want to suck all all the excess spaces, you need to repeat it:
re.findall(r'\s+(.+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILM, INC VOLGAFILMINC']
Getting closer… but the .+? matches anything, including the space between VOLGAFILM and VOLGAFILMINC, and again, the \s+ is going to match the first run of spaces it can, leaving the .+? to match everything after that.
You could fiddle with the prefix , but there's an easier solution. If you don't want spaces in your capture group, just capture a run of nonspaces instead of a run of anything, using \S:
re.findall(r'\s+(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
And notice that once you've done that, the \s+ isn't really doing anything anymore, so let's just drop it:
re.findall(r'(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
I've obviously made some assumptions above that are correct for your sample input, but may not be correct for real data. For example, if you had a string like Weyland-Yutani\\n…, I'm assuming you want Weyland-Yutani, not just Yutani. If you have a different rule, like only letters, just change the part in parentheses to whatever fits that rule, like (\w+?) or ([A-Za-z]+?).
Assuming that the input actually has the sequence \n (backslash followed by letter 'n') and not a newline, this will work:
>>> re.findall(r'(\S+)\\n', s)
['NBCUniversal', 'VOLGAFILMINC']
If the string actually contains newlines then replace \\n with \n in the regular expression.

Best way to split a string for the last space

I'm wondering the best way to split a string separated by spaces for the last space in the string which is not inside [, {, ( or ". For instance I could have:
a = 'a b c d e f "something else here"'
b = 'another parse option {(["gets confusing"])}'
For a it should parse into ['a', 'b', 'c', 'd', 'e', 'f'], ["something else here"]
and b should parse into ['another', 'parse', 'option'], ['([{"gets confusing"}])']
Right now I have this:
def getMin(aList):
min = sys.maxint
for item in aList:
if item < min and item != -1:
min = item
return min
myList = []
myList.append(b.find('['))
myList.append(b.find('{'))
myList.append(b.find('('))
myList.append(b.find('"'))
myMin = getMin(myList)
print b[:myMin], b[myMin:]
I'm sure there's better ways to do this and I'm open to all suggestions
Matching vs. Splitting
There is an easy solution. The key is to understand that matching and splitting are two sides of the same coin. When you say "match all", that means "split on what I don't want to match", and vice-versa. Instead of splitting, we're going to match, and you'll end up with the same result.
The Reduced, Simple Version
Let's start with the simplest version of the regex so you don't get scared by something long:
{[^{}]*}|\S+
This matches all the items of your second string—the same as if we were splitting (see demo)
The left side of the | alternation matches complete sets of {braces}.
The right side of the | matches any characters that are not whitespace characters.
It's that simple!
The Full Regex
We also need to match "full quotes", (full parentheses) and [full brackets]. No problem: we just add them to the alternation. Just for clarity, I'm throwing them together in a non-capture group (?: so that the \S+ pops out on its own, but there is no need.
(?:{[^{}]*}|"[^"]*"|\([^()]*\)|\[[^][]*\])|\S+
See demo.
Notes Potential Improvements
We could replace the quoted string regex by one that accepts escaped quotes
We could replace the brace, brackets and parentheses expressions by recursive expressions to allow nested constructions, but you'd have to use Matthew Barnett's (awesome) regex module instead of re
The technique is related to a simple and beautiful trick to Match (or replace) a pattern except when...
Let me know if you have questions!
You can use regular expressions:
import re
def parse(text):
m = re.search(r'(.*) ([[({"].*)', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
The first part (.*) catches everything up to the section in quotes or parenthesis, and the second part catches anything starting at a character in ([{".
If you need something more robust, this has a more complicated regular expression, but it will make sure that the opening token is matched, and it makes the last expression optional.
def parse(text):
m = re.search(r'(.*?)(?: ("[^"]*"|\([^)]*\)|\[[^]]*\]|\{[^}]*\}))?$', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
Perhaps this link will help:
Split a string by spaces -- preserving quoted substrings -- in Python
It explains how to preserve quoted substrings when splitting a string by spaces.

Python Regex - Match a character without consuming it

I would like to convert the following string
"For "The" Win","Way "To" Go"
to
"For ""The"" Win","Way ""To"" Go"
The straightforward regex would be
str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
i.e., Double the quotes that are
Followed by a letter but not preceded by a comma or the beginning of line
Preceded by a letter but not followed by a comma or the end of line
The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error
sre_constants.error: look-behind requires fixed-width pattern
What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'.
I can use the following regex (An answer provided to another question)
\b\s*"(?!,|[ \t]*$)
but that consumes the space just before the 'The' and 'To' and I get the below
"For""The"" Win","Way""To"" Go"
Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?
Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:
r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'
Looks to me like you don't need to bother with anchors.
If there is a character before the quote, you know it's not at the beginning of the string.
If that character is not a newline, you're not at the beginning of a line.
If the character is not a comma, you're not at the beginning of a field.
So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:
result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)
I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)
re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)
Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.
str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
(don't name your strings str)
str2 = re.sub('(?<=[^,])"(?=\w)'
'|'
'(?<=\w)"(?!,|$)',
'""', ss,
flags=re.MULTILINE)
I always wonder why people use raw strings for regex patterns when it isn't needed.
Note I changed your str which is the name of a builtin class to ss
.
For `"fun" :
str2 = re.sub('"'
'('
'(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)'
')',
'""', ss,
flags=re.MULTILINE)
or also
str2 = re.sub('(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)',
'"', ss,
flags=re.MULTILINE)

Print the line between specific pattern

I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world

Categories