Python regex show all characters between span - python

I want to view everything between span=(179, 331), How to display this ? In advance thanks
new_v1 = re.compile(r'Sprzedawca:')
new_v2 = re.compile(r'lp ')
print(new_v1.search(txt))
print(new_v2.search(txt))
Output:
<re.Match object; span=(179, 199), match='Sprzedawca: Nabywca:'>
<re.Match object; span=(328, 331), match='lp '>

This is an example how to match text between start and stop. I chose a simple text, adjust the regexp for your needs:
import re
RE = re.compile(r'(?:Start)(.*)(?:End)')
# re.compile(..., flags=re.DOTALL) to match also newlines
match = RE.search('testStartTextBetween123ABCxyzEndtest')
if match:
print(match.group(1)) # TextBetween123ABCxyz
There are three groups in the regexp (groups are in parentheses). The first one matches the start of text mark, the second one matches everything and the last one mathes the end of text mark.
The (?: notation means the resulting match is not saved. Only the middle group is saved as the first (and only) matched subgroup. This corresponds to match.group(1)

The function call new_v1.search(txt) returns a match object which has various attributes. You can call its methods to retrieve various facts about the match and the matched text. The simplest way to pull out the text which matched is probably
print(new_v1.search(txt).group(0))
but you could certainly also pull out the start and end attributes and extract the span yourself:
matched = new_v1.search(txt)
print(txt[matched.start():matched.end()])
Demo: https://ideone.com/CFXjx2
Of course, with a trivial regex, the matched text will be exactly the regex itself; perhaps you are actually more interested in where exactly in the string the match was found.

Related

Match all identifiers in a string

Problem:
I am looking for a way to match certain identifiers in a given line
that starts with certain words. The ID consists of
characters, possibly followed by digits, followed by a dash then some
more digits. An ID should only be matched on lines where the
starting word is one of the following: Closes, Fixes, Resolves. If a
line contains more than one IDs, those will be separated by
the string and. Any number of IDs can be present on a
line.
Example Test String:
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
What I tried:
Using a regular expressions to get all the matches, I always come up short in some regards. E.g. one of the regexp I tried is this:
^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*
I intended to have a non-capturing group where the line needs to
start with one of the allowed words, followed by a single space:
^(?:Closes|Fixes|Resolves)
Then at least one ID needs to follow the starting word,
which I intend to capture: (\w+-\d+)
Finally, zero or more ID can follow the first one, which are
separated by the string and, but I only want to capture the
IDs here, not the separator: (?:(?: and )(\w+-\d+))*
Result of this regexp in python:
test_string = """
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
"""
ids = []
for match in re.findall("^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*", test_string, re.M):
for group in match:
if group:
ids.append(group)
print(ids)
['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-44']
Also, here is the result on regex101.com. If more than one ID follows the initial one, unfortunately it only captures the last match, not all of them. I read that a repeated capturing group will only capture the last iteration, and I should put a capturing group around the repeated group to capture all iterations, but I couldn't make it work.
Summary:
Is there a solution for this with regular expressions, something similar to what I tried but which captures all the occurrences of the IDs? Or is there a better way to parse this string for the IDs, using Python?
You could use a single capturing group and in that capturing group match the first occurrence and repeat the same pattern 0+ times preceded by a space followed by and and space.
The values are in group 1.
To get the separate values, split on and
^(?:Closes|Fixes|Resolves) (\w+-\d+(?: and \w+-\d+)*)
Regex demo
It might be easier with the two-stage approach, such as:
def get_matches(test): #assume test is a list of strings
regex1 = re.compile(r'^(?:Closes|Fixes|Resolves) \w+-\d+')
regex2 = re.compile(r'\w+-\d+')
results = []
for line in test:
if regex1.search(line):
results.extend(regex2.findall(line))
return results
gives:
['PD-1','PD-21','PD-22','PD-31','PD-32',
'PD-33','PD4-41','PD4-42','PD4-43','PD4-44']
If you need to work with repeated capturing groups, you should install PyPi regex module with pip install regex and use
import regex
test_string = "your string here"
ids = []
for match in regex.finditer("^(?:Closes|Fixes|Resolves) (?P<id>\w+-\d+)(?:(?: and )(?P<id>\w+-\d+))*", test_string, regex.M):
ids.extend(match.captures("id"))
print(ids)
# => ['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-42', 'PD4-43', 'PD4-44']
See the Python demo
The capture stack for each group is accessible via match.captures(X).
The regex you have is fine to use as is, but it is more user-frienly with a named capturing group here.

How to parse parameters from text?

I have a text that looks like:
ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c,
,second d), third, fourth)
Engine can be different (instead of CollapsingMergeTree, there can be different word, ReplacingMergeTree, SummingMergeTree...) but the text is always in format ENGINE = word (). Around "=" sign, can be space, but it is not mandatory.
Inside parenthesis are several parameters usually a single word and comma, but some parameters are in parenthesis like second in the example above.
Line breaks could be anywhere. Line can end with comma, parenthesis or anything else.
I need to extract n parameters (I don't know how many in advance). In example above, there are 4 parameters:
first = first_param
second = (second_a, second_b, second_c, second_d) [extract with parenthesis]
third = third
fourth = fourth
How to do that with python (regex or anything else)?
You'd probably want to use a proper parser (and so look up how to hand-roll a parser for a simple language) for whatever language that is, but since what little you show here looks Python-compatible you could just parse it as if it were Python using the ast module (from the standard library) and then manipulate the result.
I came up with a regex solution for your problem. I tried to keep the regex pattern as 'generic' as I could, because I don't know if there will always be newlines and whitespace in your text, which means the pattern selects a lot of whitespace, which is then removed afterwards.
#Import the module for regular expressions
import re
#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c
,second_d), third, fourth)'''
#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)
#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)
#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])
#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])
third = re.sub('\s', '', result[3])
fourth = re.sub('\s', '', result[4])
print(first)
print(second)
print(third)
print(fourth)
OUTPUT:
first_param
second_a,second_b,second_c,second_d
third
fourth
Regex explanation:
\ = Escapes a control character, which is a character regex would interpret to mean something special. More here.
\( = Escape parentheses
() = Mark the expression in the parentheses as a sub-group. See result[1] and so on.
. = Matches any character (including newline, because of re.S)
* = Matches 0 or more occurrences of preceding expression.
? = Matches 0 or 1 occurrence of preceding expression.
NOTE: *? combined is called a nongreedy repetition, meaning the preceding expression is only matched once, instead of over and over again.
I am no expert, but I hope I got the explanations right.
I hope this helps.

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Extract text between double square brackets in Python

If I have a string that may look like this:
"[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
How do I extract the categories and put them into a list?
I'm having a hard time getting the regular expression to work.
To expand on the explanation of the regex used by Avinash in his answer:
Category:([^\[\]]*) consists of several parts:
Category: which matches the text "Category:"
(...) is a capture group meaning roughly "the expression inside this group is a block that I want to extract"
[^...] is a negated set which means "do not match any characters in this set".
\[ and \] match "[" and "]" in the text respectively.
* means "match zero or more of the preceding regex defined items"
Where I have used ... to indicate that I removed some characters that were not important for the explanation.
So putting it all together, the regex does this:
Finds "Category:" and then matches any number (including zero) characters after that that are not the excluded characters "[" or "]". When it hits an excluded character it stops and the text matched by the regex inside the (...) part is returned. So the regex does not actually look for "[[" or "]]" as you might expect and so will match even if they are left out. You could force it to look for the double square brackets at the beginning and end by changing it to \[\[Category:([^\[\]]*)\]\].
For the second regex, Category:[^\[\]]*, the capture group (...) is excluded, so Python returns everything matched which includes "Category:".
Seems like you want something like this,
>>> str = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
>>> re.findall(r'Category:([^\[\]]*)', str)
['Political culture', 'Political ideologies']
>>> re.findall(r'Category:[^\[\]]*', str)
['Category:Political culture', 'Category:Political ideologies']
By default re.findall will print only the strings which are matched by the pattern present inside a capturing group. If no capturing group was present, then only the findall function would return the matches in list. So in our case , this Category: matches the string category: and this ([^\[\]]*) would capture any character but not of [ or ] zero or more times. Now the findall function would return the characters which are present inside the group index 1.
Python code:
s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line]
print(cats)
Output:
['Category:Political culture', 'Category:Political ideologies']

Python regex: how to check if a character in a string is within the span of a regex matched substring?

I have a regex pattern, which I used on a large piece of text (a single string). Several discontiguous regions of the original text matches the regexp. Now, I'm attempting to build a state machine, to iterate over the text and do different things based on the char at a position, and whether this position is within the span of a regex match.
With RE.finditer(text), I can find all substrings, and extract their spans, thus I have a list of tuples to work with e.g.
(1, 5)
(10, 15)
(20, 55),
etc.
With this information, given the index of the character in my string, I can write an algorithm to see if that character is a part of a regex string. For example, given character 6, i can go through the list of spans and determine that it is not part of a matched substring.
Is there a better way of doing this?
Thanks in advance,
JW
EDIT: It sounds like you want to write your own parser FSM which (among other things) tokenizes comma characters, only when they are not escaped.
The following regex works for an identifier, possibly containing escaped commas. You could use this with antlr/lex:
input = r'aaaaa,bbbb/,ccccc,dddddd,'
pat = re.compile(r'((\w+|/,)+)')
for mat in re.finditer(pat, input):
... do stuff with mat.group(0)
(Original answer:
That could be a good solution, but you're not giving us enough context to tell.
Does character occur once or multiply? If it occurs once, you could just check whether the index from string.find(char) lies inside the spans of the regex matches.
Is character any arbitrary character - give us a specific example?
Why are you doing this on a per-character basis? Presumably you're not sequentially checking multiple chars?
Is your desired result boolean ('Yes, char was found inside the span of some regex match')? and what you do for the case where char was found OUTside a regex match?
Edit
Here's a regex which will grab the text between , ignoring escaped ,:
(?=<,)(?:[^,]|(?=</),)(?=,)
Original Answer
Here is some pseudo python code that should do what you're looking for:
pattern = re.compile(...)
pos = 0
while (match = pattern.search(haystack, pos)) {
for (i in range(pos, match.start)
//These chars are outside the match.
for (i in group(0))
//The chars are in the match
pos = match.end
//Finish with the rest of the chars not matched
for (i in range(pos, len(haystack))
//These chars are outside the match.

Categories