Extract text between double square brackets in Python

Extract text between double square brackets in Python - python

If I have a string that may look like this:
"[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
How do I extract the categories and put them into a list?
I'm having a hard time getting the regular expression to work.

To expand on the explanation of the regex used by Avinash in his answer:
Category:([^\[\]]*) consists of several parts:
Category: which matches the text "Category:"
(...) is a capture group meaning roughly "the expression inside this group is a block that I want to extract"
[^...] is a negated set which means "do not match any characters in this set".
\[ and \] match "[" and "]" in the text respectively.
* means "match zero or more of the preceding regex defined items"
Where I have used ... to indicate that I removed some characters that were not important for the explanation.
So putting it all together, the regex does this:
Finds "Category:" and then matches any number (including zero) characters after that that are not the excluded characters "[" or "]". When it hits an excluded character it stops and the text matched by the regex inside the (...) part is returned. So the regex does not actually look for "[[" or "]]" as you might expect and so will match even if they are left out. You could force it to look for the double square brackets at the beginning and end by changing it to \[\[Category:([^\[\]]*)\]\].
For the second regex, Category:[^\[\]]*, the capture group (...) is excluded, so Python returns everything matched which includes "Category:".

Seems like you want something like this,
>>> str = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
>>> re.findall(r'Category:([^\[\]]*)', str)
['Political culture', 'Political ideologies']
>>> re.findall(r'Category:[^\[\]]*', str)
['Category:Political culture', 'Category:Political ideologies']
By default re.findall will print only the strings which are matched by the pattern present inside a capturing group. If no capturing group was present, then only the findall function would return the matches in list. So in our case , this Category: matches the string category: and this ([^\[\]]*) would capture any character but not of [ or ] zero or more times. Now the findall function would return the characters which are present inside the group index 1.

Python code:
s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line]
print(cats)
Output:
['Category:Political culture', 'Category:Political ideologies']

Related

How to match and replace this pattern in Python RE?

s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?

Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)

There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.

This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.

How to parse parameters from text?

I have a text that looks like:
ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c,
,second d), third, fourth)
Engine can be different (instead of CollapsingMergeTree, there can be different word, ReplacingMergeTree, SummingMergeTree...) but the text is always in format ENGINE = word (). Around "=" sign, can be space, but it is not mandatory.
Inside parenthesis are several parameters usually a single word and comma, but some parameters are in parenthesis like second in the example above.
Line breaks could be anywhere. Line can end with comma, parenthesis or anything else.
I need to extract n parameters (I don't know how many in advance). In example above, there are 4 parameters:
first = first_param
second = (second_a, second_b, second_c, second_d) [extract with parenthesis]
third = third
fourth = fourth
How to do that with python (regex or anything else)?

You'd probably want to use a proper parser (and so look up how to hand-roll a parser for a simple language) for whatever language that is, but since what little you show here looks Python-compatible you could just parse it as if it were Python using the ast module (from the standard library) and then manipulate the result.

I came up with a regex solution for your problem. I tried to keep the regex pattern as 'generic' as I could, because I don't know if there will always be newlines and whitespace in your text, which means the pattern selects a lot of whitespace, which is then removed afterwards.
#Import the module for regular expressions
import re
#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c
,second_d), third, fourth)'''
#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)
#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)
#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])
#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])
third = re.sub('\s', '', result[3])
fourth = re.sub('\s', '', result[4])
print(first)
print(second)
print(third)
print(fourth)
OUTPUT:
first_param
second_a,second_b,second_c,second_d
third
fourth
Regex explanation:
\ = Escapes a control character, which is a character regex would interpret to mean something special. More here.
\( = Escape parentheses
() = Mark the expression in the parentheses as a sub-group. See result[1] and so on.
. = Matches any character (including newline, because of re.S)
* = Matches 0 or more occurrences of preceding expression.
? = Matches 0 or 1 occurrence of preceding expression.
NOTE: *? combined is called a nongreedy repetition, meaning the preceding expression is only matched once, instead of over and over again.
I am no expert, but I hope I got the explanations right.
I hope this helps.

Python regex match only if standalone

Using re in python3, I want to match appearances of percentages in text, and substitute them with a special token (e.g. substitute "A 30% increase" by "A #percent# increase").
I only want to match if the percent expression is a standalone item. For example, it should not match "The product's code is A322%n43%". However, it should match when a line contains only one percentage expression like "89%".
I've tried using delimiters in my regex like \b, but because % is itself a non-alphanumeric character, it doesn't catch the end of the expression. Using \s makes it impossible to catch expression standing by themselves in a line.
At the moment, I have the code:
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "1,211.21%")
' #percent '
which still matches if the expression is followed by letters or other text (like the product code example above).
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "EEE1,211.21%asd")
'EEE #percent# asd'
What would you recommend?

Looks like a perfect job for Negative Lookbehind and Negative Lookahead:
re.sub(r'''(?<![^\s]) [+-]?[.,;]? (\d+[.,;']?)+% (?![^\s.,;!?'"])''',
'#percent#', string, flags=re.VERBOSE)
(?<![^\s]) means "no space immediately before the current position is allowed" (add more forbidden characters if you need).
(?![^\s.,;!?'"]) means "no space, period, etc. immediately after the current position are allowed".
Demo: https://regex101.com/r/khV7MZ/1.

Try putting "first" capture group with a "second".
original: r"[+-]?[.,;]?(\d+[.,;']?)+%"
suggestd: r"[+-]?[.,;]?((\d+[.,;']?)+%)\b"

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.

Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.

Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Regexp Word within a word with a fullstop

I'm having trouble matching a string with regexp (I'm not that experienced with regexp). I have a string which contains a forward slash after each word and a tag. An example:
led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION
In those strings, I am only interested in all strings that precede /PERSON. Here's the regexp pattern that I came up with:
(\w)*\/PERSON
And my code:
match = re.findall(r'(\w)*\/PERSON', string)
Basically, I am matching any word that comes before /PERSON. The output:
>>> reg
['Timothy', '', 'Geithner']
My problem is that the second match, matched to an empty string as for R./PERSON, the dot is not a word character. I changed my regexp to:
match = re.findall(r'(\w|.*?)\/PERSON', string)
But the match now is:
['led/O by/O Timothy', ' R.', ' Geithner']
It is taking everything prior to the first /PERSON which includes led/O by/O instead of just matching Timothy. Could someone please help me on how to do this matching, while including a full stop as an abbreviation? Or at least, not have an empty string match?
Thanks,

Match everything but a space character ([^ ]*). You also need the star (*) inside the capture:
match = re.findall(r'([^ ]*)\/PERSON', string)

Firstly, (\w|.) matches "a word character, or any character" (dot matches any character which is why you're getting those spaces).
Escaping this with a backslash will do the trick: (\w|\.)
Second, as #Ionut Hulub points out you may want to use + instead of * to ensure you match something but Regular Expressions work on the principle of "leftmost, longest" so it'll always try to match the longest part that it can before the slash.
If you want to match any non-whitespace character you can use \S instead of (\w|\.), which may actually be what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text between double square brackets in Python - python

If I have a string that may look like this: "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n" How do I extract the categories and put them into a list? I'm having a hard time getting the regular expression to work.

Python code: s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n" cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line] print(cats) Output: ['Category:Political culture', 'Category:Political ideologies']

Related

How to match and replace this pattern in Python RE?

How to parse parameters from text?

Python regex match only if standalone

Python regular expression to replace everything but specific words

Regexp Word within a word with a fullstop

Categories

Resources