python regex mediawiki section parsing - python

I have text similar to the following:
==Mainsection1==
Some text here
===Subsection1.1===
Other text here
==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here.
In the above text the MainSection 1 and 2 have different names which can be everything the user wants. Same goes for the subsections.
What i want to do with a regex is get the text of a mainsection including its subsection (if there is one).
Yes this is from a wikipage. All mainsections names start with == and end with ==
All subsections have more then the 2== in there name.
regex =re.compile('==(.*)==([^=]*)', re.MULTILINE)
regex.findall(text)
But the above returns each separate section.
Meaning it perfectly returns a mainsection but sees a subsection on his own.
I hope someone can help me with this as its been bugging me for some time
edit:
The result should be:
[('Mainsection1', 'Some text here\n===Subsection1.1===
Other text here\n'), ('Mainsection2', 'Text goes here\n===Subsecttion2.1===
Other text goes here.\n')]
Edit 2:
I have rewritten my code to not use a regex. I came to the conclusion that it's easy enough to just parse it myself. Which makes it a bit more readable for me.
So here is my code:
def createTokensFromText(text):
sections = []
cur_section = None
cur_lines = []
for line in text.split('\n'):
line = line.strip()
if line.startswith('==') and not line.startswith('==='):
if cur_section:
sections.append( (cur_section, '\n'.join(cur_lines)) )
cur_lines = []
cur_section = line
continue
if cur_section:
cur_lines.append(line)
if cur_section:
sections.append( (cur_section, '\n'.join(cur_lines)) )
return sections
Thanks everyone for the help!
All the answers provided have helped me a lot!

First, it should be known, I know a little about Python, but I have never programmed formally in it... Codepad said this works, so here goes! :D -- Sorry the expression is so complex:
(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))
This does what you asked for, I believe! on Codepad, this code:
import re
wikiText = """==Mainsection1==
Some text here
===Subsection1.1===
Other text here
==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here. """
outputArray = re.findall('(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))', wikiText)
print outputArray
Produces this result:
[('Mainsection1', '\nSome text here\n===Subsection1.1===\nOther text here\n\n'), ('Mainsection2', '\nText goes here\n===Subsecttion2.1===\nOther text goes here. ')]
EDIT: Broken down, the expression essentially says:
01 (?<!=) # First, look behind to assert that there is not an equals sign
02 == # Match two equals signs
03 ([^=]+) # Capture one or more characters that are not an equals sign
04 == # Match two equals signs
05 (?!=) # Then verify that there are no equals signs following this
06 ( # Start a capturing group
07 [\s\S]*? # Match zero or more of ANY character (even CrLf), but BE LAZY
08 (?= # Look ahead to verify that either...
09 $ # this is the end of the
10 | # -OR-
11 (?<!=) # when I look behind there is no equals sign
12 == # then there are two equals signs
13 [^=]+ # then one or more characters that are not equals signs
14 == # then two equals signs
15 (?!=) # then verify that there are no equals signs following this
16 ) # End look-ahead group
17 ) # End capturing group
Line 03 and Line 06 specify the capturing groups for the Main Section Title and the Main Section Content, respectively.
Line 07 begs for a lot of explanation if you're not pretty fluent in Regex...
The \s and \S inside a character class [] will match anything that is whitespace or is not whitespace (i.e. ANYTHING WHATSOEVER) - one alternative to this is using the . operator, but depending upon your compiler options (or ability to specify options) this might or might not match CrLf (or Carriage-Return/Line-Feed). Since you want to match multiple lines, this is the easiest way to ensure a match.
The *? at the end means that it will match zero or more instances of the "anything" character class, but BE LAZY ABOUT IT - "lazy" quantifiers (sometimes called "reluctant") are the opposite of the default "greedy" quantifier (without the ? following it), and will not consume a source character unless the source that follows it cannot be matched by the part of the expression that follows the lazy quantifier. In other words, this will consume any characters until it finds either the end of the source text OR another main section which is specified by exactly two and only two equals signs on either side of one or more characters that are not an equals sign (including whitespace). Without the lazy operator, it would try to consume the entire source text then "backtrack" until it could match one of the things after it in the expression (end of source or a section header)
Line 08 is a "look-ahead" that specifies that the expression follwoing should be ABLE to be matched, but should not be consumed.
END EDIT
AFAIK, it has to be this complex in order to properly exclude the subsections... If you want to match the Section Name and Section Content into named groups, you can try this:
(?<!=)==(?P<SectionName>[^=]+)==(?!=)(?P<SectionContent>[\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))
If you'd like, I can break it down for you! Just ask! EDIT (see edit above) END EDIT

The problem here is that ==(.*)== matches ==(=Subsection=)==, so the first thing to do is to make sure there is no = inside the title : ==([^=]*)==([^=]*).
We then need to make sure that there is no = before the beginning of the match, otherwise, the first = of the three is ignored and the subtitle is matched. This will do the trick : (?<!=)==([^=]*)==([^=]*), it means "Matches if not preceded by ...".
We can also do this at the end to make sure, which gives as a final result (?<!=)==([^=]*)==(?!=)([^=]*).
>>> re.findall('(?<!=)==([^=]*)==(?!=)([^=]*)', x,re.MULTILINE)
[('Mainsection1', '\nSome text here\n'),
('Mainsection2', '\nText goes here\n')]
You could also remove the check at the end of the title and replace it with a newline. That may be better if you are sure there is a new line at the end of each title.
>>> re.findall('(?<!=)==([^=]*)==\n([^=]*)', x,re.MULTILINE)
[('Mainsection1', 'Some text here\n'), ('Mainsection2', 'Text goes here\n')]
EDIT :
section = re.compile(r"(?<!=)==([^=]*)==(?!=)")
result = []
mo = section.search(x)
previous_end = 0
previous_section = None
while mo is not None:
start = mo.start()
if previous_section:
result.append((previous_section, x[previous_end:start]))
previous_section = mo.group(0)
previous_end = mo.end()
mo = section.search(x, previous_end)
result.append((previous_section, x[previous_end:]))
print result
It's more simple than it looks : repeatedly, we search for a section title after the previous one, and we add it to the result with the text between the beginning of this title and the end of the previous one. Adjust it to suit your style and your needs. The result is :
[('==Mainsection1==',
' \nSome text here \n===Subsection1.1=== \nOther text here \n\n'),
('==Mainsection2==',
' \nText goes here \n===Subsecttion2.1=== \nOther text goes here. ')]

Related

get all the text between two newline characters(\n) of a raw_text using python regex

So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.
Some examples of text are given below:
1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n
The code I have written is given below:
def get_term_regex(s):
raw_text = s
term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'
try:
if ('TERMS' or 'Terms') in raw_text:
pattern1 = re.search(term_regex1,raw_text)
#print(pattern1)
return pattern1
except:
pass
But I am not getting any output, as there is no match.
The expected output is:
1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS
Any help would be really appreciated.
Try the following:
import re
text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.
('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!
So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.

Python 2.7: Matching a subtitle events in VTT subtitles using a regular expression

I'm writing a python script to parse VTT subtitle files.
I am using a regular expression to match and extract specific elements:
'in timecode'
'out timecode'
'other info' (mostly alignment information, like align:middle or line:-1)
subtitle content (the actual text)
I am using Python's 're' module from the standard library, and I am looking for a regular expression that will match all (5) of the below 'subtitle events':
WEBVTT
00:00:00.440 --> 00:00:02.320 align:middle line:-1
Hi.
00:00:03.440 --> 00:00:07.520 align:middle line:-1
This subtitle has one line.
00:00:09.240 --> 00:00:11.080 align:middle line:-2
This subtitle has
two lines.
00:00:15.240 --> 00:00:23.960 align:middle line:-4
Now...
Let's try
four...
lines...
00:00:24.080 --> 00:00:27.080 align:middle
PS: Note that stackoverflow doesn't allow me to add an empty line at the end of the code block. Normally the last 'empty' line will exist because a line break (\r\n or \n). After: 00:00:24.080 --> 00:00:27.080 align:middle
Below is my code. My problem is that I can't figure out a regular expression that will match all of the 'subtitle events' (including the one with an empty line as 'subtitle content').
import re
import io
webvttFileObject = io.open("C:\Users\john.doe\Documents\subtitle_sample.vtt", 'r', encoding = 'utf-8') # opens WebVTT file forcing UTF-8 encoding
textBuffer = webvttFileObject.read()
regex = re.compile(r"""(^[0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3}) # match TC-IN in group1
[ ]-->[ ] # VTT/SRT style TC-IN--TC-OUT separator
([0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3}) # match TC-OUT n group2
(.*)?\n # additional VTT info (like) alignment
(^.+\n)+\n? # subtitle_content """, re.MULTILINE|re.VERBOSE)
subtitle_match_count = 0
for match in regex.finditer(textBuffer):
subtitle_match_count += 1
group1, group2, group3, group4 = match.groups()
tc_in = group1.strip()
tc_out = group2.strip()
vtt_extra_info = group3
subtitle_content = group4
print "*** subtitle match count: %d ***" % subtitle_match_count
print "TIMECODE IN".ljust(20), tc_in
print "TIMECODE OUT".ljust(20), tc_out
print "ALIGN".ljust(20), vtt_extra_info.strip()
print "SUBTITLE CONTENT".ljust(20), subtitle_content
print
I've tried several variations of the regex in the code. All without success. What is also very strange to me is that if I put regex groups in a variable and print them, like I'm doing with this code, I only get the last line as SUBTITLE CONTENT. But I must be doing something wrong (right?). Any help is greatly appreciated.
Thanks in advance.
The reason why your regex doesn't match the last subtitle is here:
(^.+\n)+\n?
The ^.+\n is looking for a line with 1 or more characters. But the last line in the file is empty, so it doesn't match.
The reason why subtitle_content only contains the last line is also there. You're matching each line one by one with (^.+\n)+, i.e. the capture group always captures only a single line. With each matched line, the capture group's previous value is discarded, so in the end all you're left with is the last line. If you want to capture all lines, you have match them all in one go inside of the capture group, for example like this:
((?:^.+\n)+)
In order to make the regex work correctly, I've slightly changed the last two lines:
(^[0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3})
[ ]-->[ ]
([0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3})
([^\n]*)?\n # replaced `.*` with `[^\n]*` here because of the S-modifier
(.*?)(?:\n\n|\Z) # this now captures everything up to 2 consecutive
# newlines or the end of the string
This regex requires the modifiers m (multiline), s (single-line) and of course x (verbose).
See it in action here.

How to ignore case when using regex in python?

I have problem related to case insensitive search for regular expression. Here is part of the code that I wrote:
engType = 'XM665'
The value of engType was extracted from other files. Based on the engType, I want to find lines in another text file which contain this part and extract description infomation from that line, the description part will be between the engType string and 'Serial'.
for instance:
lines = ['xxxxxxxxxxx','mmmmmmmmmmm','jjjjj','xM665 Module 01 Serial (10-11)']
pat = re.compile(engType+'(.*?)[Ss][Ee][Rr][Ii][Aa][Ll]')
for line in lines:
des = pat.search(line).strip()
if des:
break;
print des.group(1).strip()
I know the result will be an error, since the case of my string engType is different from what it is in 'xM665 Module 01 Serial (10-11)', I understand that I can use [Ss] to do the case insensitive comparisons as What I have done in the last part of pat. However, since my engType is a variable, I could not apply that on a variable. I knew I could search in lower case like:
lines = ['xxxxxxxxxxx','mmmmmmmmmmm','jjjjj','xM665 Module 01 Serial (10-11)']
pat = re.compile(engType.lower()+'(.*?)serial')
for line in lines:
des = pat.search(line.lower()).strip()
if des:
break;
print des.group(1).strip()
result:
module 01
The case is now different compared to Module 01. If I want to keep the case, how can i do this? Thank you!
re.IGNORECASE is the flag you're looking for.
pat = re.compile(engType+'(.*?)[Ss][Ee][Rr][Ii][Aa][Ll]',re.IGNORECASE)
Or, more simply re.compile(engType+'(.*?)serial',re.IGNORECASE).
also, bug in this line:
des = pat.search(line.lower()).strip()
Remove the .strip(); if pat.search() is None you will get an AttributeError.
Check out re.IGNORECASE in http://docs.python.org/3/library/re.html
I believe it'll look like:
pat = re.compile(engType.lower()+'(.*?)serial', re.IGNORECASE)

python substitute a substring with one character less

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs
You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =
No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.
I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?
how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.
either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," ")
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module,but it's not necessory

Regular expression in python to capture multiple forms of badly formatted addresses

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.
I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.
import re
r1 = r"([\w\s+]+),?\s*\(?([\w\s+\\/]+)\)?\s*\(?([\w\s+\\/]+)\)?"
match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')
group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()
print group1
print group2
print group3
This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:
('caracas', 'venezuel' 'a')
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')
The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '.
Thanks in advance for any insight.
Cheers!
Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?
Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):
([\w\s+]+)
This says, "capture one or more (letter or one or more spaces)"
Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:
([\w\s]*\w)
Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.
Next you have:
,?\s*\(?
which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:
(?:,\s*\(|,\s*|\s*\()
which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.
Next you have the capturing expression, very similar to the first:
([\w\s+\\/]+)
Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:
([\w\s\\/]*\w)
The next expression is probably where you're getting your venezuel a problem; let's take a look:
\)?\s*\(?([\w\s+\\/]+)\)?
This is a rather long one, so let's break it down:
\)?\s*\(?
says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:
([\w\s+\\/]+)
This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:
\)?\s*
Followed by making your entire final expression optional, and the outer expression non-capturing:
(?:\(?([\w\s+\\/]+)\)?)?
The final expression would be:
([\w\s]*\w)(?:,\s*\(|,\s*|\s*\()([\w\s\\/]*\w)\)?\s*(?:\(?([\w\s+\\/]+)\)?)?
Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.
Testing it on your examples:
>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')
Could you not just find all the words in the text?
E.g.:
>>> import re
>>> samples = ['caracas, venezuela','caracas (venezuela)','caracas, (venezuela) (df)']
>>>
>>> def find_words(text):
... return re.findall('\w+',text)
...
>>> for sample in samples:
... print find_words(sample)
...
['caracas', 'venezuela']
['caracas', 'venezuela']
['caracas', 'venezuela', 'df']

Categories