I have a text that looks like:
ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c,
,second d), third, fourth)
Engine can be different (instead of CollapsingMergeTree, there can be different word, ReplacingMergeTree, SummingMergeTree...) but the text is always in format ENGINE = word (). Around "=" sign, can be space, but it is not mandatory.
Inside parenthesis are several parameters usually a single word and comma, but some parameters are in parenthesis like second in the example above.
Line breaks could be anywhere. Line can end with comma, parenthesis or anything else.
I need to extract n parameters (I don't know how many in advance). In example above, there are 4 parameters:
first = first_param
second = (second_a, second_b, second_c, second_d) [extract with parenthesis]
third = third
fourth = fourth
How to do that with python (regex or anything else)?
You'd probably want to use a proper parser (and so look up how to hand-roll a parser for a simple language) for whatever language that is, but since what little you show here looks Python-compatible you could just parse it as if it were Python using the ast module (from the standard library) and then manipulate the result.
I came up with a regex solution for your problem. I tried to keep the regex pattern as 'generic' as I could, because I don't know if there will always be newlines and whitespace in your text, which means the pattern selects a lot of whitespace, which is then removed afterwards.
#Import the module for regular expressions
import re
#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c
,second_d), third, fourth)'''
#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)
#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)
#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])
#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])
third = re.sub('\s', '', result[3])
fourth = re.sub('\s', '', result[4])
print(first)
print(second)
print(third)
print(fourth)
OUTPUT:
first_param
second_a,second_b,second_c,second_d
third
fourth
Regex explanation:
\ = Escapes a control character, which is a character regex would interpret to mean something special. More here.
\( = Escape parentheses
() = Mark the expression in the parentheses as a sub-group. See result[1] and so on.
. = Matches any character (including newline, because of re.S)
* = Matches 0 or more occurrences of preceding expression.
? = Matches 0 or 1 occurrence of preceding expression.
NOTE: *? combined is called a nongreedy repetition, meaning the preceding expression is only matched once, instead of over and over again.
I am no expert, but I hope I got the explanations right.
I hope this helps.
Related
I an a newbie in python and I want to use my regex in re.sub. I tried it on regex101 and it works. Somehow when I tried to use it on my python (version 3.6) it doesn't work properly. I get the following warning
bad character in group name '?=[^\t]*' at position 5
This is my code:
re = r"(?(?=[^\t]*)([\t]+))";
str = 'a bold, italic, teletype';
subst = ',';
result = re.sub($re, $subst, $str);
The problem is that you cannot use lookarounds in conditional constructs in a Python re. Only capturing group IDs to test if the previous group matched.
(?(id/name)yes-pattern|no-pattern)
Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted.
The (?(?=[^\t]*)([\t]+)) regex checks if there are 0+ chars other than tabs at the current location, and if yes, matches and captures 1 or more tabs. This makes no sense. If you want to match the first occurrence of 1 or more tabs, you may use re.sub with a mere "\t+" pattern and count=1 argument.
import re
reg = "\t+";
s = 'a bold, italic, teletype';
result = re.sub(reg, ',', s, count=1);
print(result);
See the Python demo
I suppose you could do this:
import re
regex = r'(^\w*?[\t]+)'
s = 'a bold, italic, teletype'
def repl(match):
s = match.group(0)
return s.rstrip() + ', '
print(re.sub(regex,repl, s))
out
a, bold, italic, teletype
Here we are capturing the beginning of the string through any tabs that may occur after the first word, and passing the match to a callable. The callable removes trailing tabs with rstrip and adds a trailing comma.
Note: if the first tab occurs after the first word, it's not replaced. i.e. 'a bold, italic, teletype' is left unchanged. Is that what you want?
I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.
I have some lines stored in a txt file like this.
A () Definition of A
Aardvark (n.) Definition of Aardvark
These definition are just placeholder for now since the actual definition are long.
So I want to split these strings with ("any word here") as a delimiter. But my code doesn't seem to work.
import re
f1=open("A-shortened.txt","r")
l=f1.readlines()
#print l
for i in l:
list=re.split("\(()')",i)
print list
the part inside re.split is confusing me a lot. I've tried many arguments but none seem to split the string as I want.
any help is appreciated.
The delimiter must match the parts of the string that you want to be removed in between. It's a little tricky because ( and ) are the group chars, but here you go:
text = 'Aardvark (n.) Definition of Aardvark'
re.split(r'\s*\([^)]*\)\s*', text)
r'' prevents any accidental backslash escapes. It does not matter here, but better safe than sorry.
\( and \) match literal ( and ).
[^)] matches anything but right brackets. Note: no backslash escape here.
\s* makes any spaces before and after the brackets part of the delimiter, so that your split parts do not start with spaces. Simplify to r'\([^)]*\)' if you want those spaces.
Change the regex to something like this:
>>> s = 'Aardvark (n.) Definition of Aardvark'
>>> re.split(r'\s*\(.*?\)\s*', s)
['Aardvark', 'Definition of Aardvark']
'\s*\(.*?\)\s*' basically means: match all possible spaces followed by anything around ( and ) along with all the possible spaces after that.
you'll want to split by an open paren, any number of non-close paren characters, and a close paren. So your regex should look like:
split_by = re.compile(r"""
\( # literal open paren
[^)]*? # anything but a close paren, repeated 0+ times (non-greedy)
\) # literal close paren""", re.X)
s = "Aardvark (n.) Definition of Aardvark"
split_by.split(s) # ["Aardvark ", " Definition of Aardvark"]
If you'd like to grab those spaces too:
re.split(r"\s*\([^)]*?\)\s*", s)
Given a string like this:
ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",
With regex, how do I get a tuple that looks like the following:
('ORTH', ['cali.ber,kl','calf','done'])
I've been doing it as such:
txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1
But i'm getting none for re.search().group(1). How should it be done to get the desired output?
The reason you don't get a match is that your regex doesn't match:
r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample.
This one would match:
re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)
What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values, unescaped.
That means you can't just split that string by ',', but instead need to consider the two different quotation characters(' and " ) in order to separate the fields.
So I'd use this approach:
Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest.
Use re.findall() to split the names into fields according to quotation marks
Edit:
1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines. The default behavior for . is to match everything except newlines.
From the Python re docs:
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
So you need to construct that regex with the re.DOTALL flag. You do this by compiling it first and passing the ORed flags:
re.compile(pattern, flags=re.DOTALL)
2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([A-Z\.]*)... to cover both cases. The ? is for non-greedy matching, so it matches the shortest possible match instead of the longest.
3) To cover PREFIX.FOO just extend the pattern for the prefix to ([A-Z\.]*) by including the . character and escaping it.
Updated example covering all the cases you mentioned:
import re
TEST_VALUES = [
"""ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]
EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])
pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)
for value in TEST_VALUES:
prefix, names_str = pattern.match(value).groups()
names = re.findall('[\'"](.*?)["\']', names_str)
result = prefix, names
assert(result == EXPECTED)
print result
I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world